Skip to Content
author's profile photo Former Member
Former Member

Detailed question - French data cleansing

I'm working on a team doing data cleansing and de-duplication within SAP. We've noticed some unusual behaviour, specifically relating to how Primary Names are handled.

Some examples are in the table below:

RAW_DATA CLEANSED_DATA 2 allée de Longchamps 2 all de Longchamp 92, rue Réaumur 92 rue Reaumur 33, rue Juliette Récamier 69006 Lyon 33 rue Juliette Recamier 12 Rue du Général Patton 12 rue du General Patton 253, avenue du Président Wilson 253 av du President Wilson 40-42 rue de la Boétie 40-42 rue la Boetie 8 Avenue Delcassé 8 av Delcasse 22 Boulevard Maréchal Foch 22 bd Marechal Foch 9-11 allée de l'Arche Tour Egée 9-11 all de l'Arche, Tour Egée 14 rue Avaulée 14 rue Avaulee 405 avenue Galilée 405 av Galilee

Each record has the status code description of "Data Quality corrected the following address components: region, locality, primary name "

In each case the cleansed version is factually incorrect, the missing accents and changing "Longchamps" to "Longchamp" make the addresses less good quality than the originals.

I cannot see any parameters that have been set to make this happen, this behaviour only seems to relate to Primary Name, other address components are fine. "Convert Latin Output to Ascii" is set to NO as evidenced by the fact that is it just this address component that is affected.

As far as I can see, there are two possible explanations:

  • There is a codepage issue which is preventing diacritical characters on Primary Name - but this does not explain "Longchamps" being changed to "Longchamp"
  • The address directory itself contains these inaccurate entries

I suspect that we'll need to raise it with the product group, but has anyone else encountered similar issues ?

Best regards,

Barry Carlino

Add a comment
10|10000 characters needed characters exceeded

Assigned Tags

Related questions

5 Answers

  • Best Answer
    author's profile photo Former Member
    Former Member
    Posted on Dec 09, 2015 at 03:29 PM

    Barry,

    Here is the information that I received from the directory group:

    For the Postcode of 92281 the street name is "ALLEE DE LONGCHAMP"

    In the data I see "ALLEE E LONGCHAMPS" for the Postcode 54500 only. The most common spelling is "ALLEE DE LONGCHAMP".

    All of the issues that you are reporting are as the data is given to us and not issues with the software.

    Thanks,

    Wanda

    Add a comment
    10|10000 characters needed characters exceeded

  • author's profile photo Former Member
    Former Member
    Posted on Dec 09, 2015 at 02:57 PM

    Sent the following to Barry:

    Barry,

    Thanks for sending the complete data. I have looked at the data and have found that the diacritical characters were not found in the data. That being said I talked to the group that works with the Directory data that we receive from the vendors and found out that they do not provide the diacritical characters in the raw data.

    Thanks,

    Wanda

    I am still checking on the Longchamp example and hopefully have a response shortly.

    Add a comment
    10|10000 characters needed characters exceeded

  • Posted on Dec 08, 2015 at 03:40 PM

    Are you using MATCH_PRIMARY_NAME or any of the other MATCH_ output fields? Diacritical characters are removed from these fields, e.g. becasue you want to feed the data into a Match transform.

    Of course, it could be the code page or the address directory, but I doubt that.

    FYI: I have recently done an exercise on Finnish data and DS keeps all accented characters in the target.

    Add a comment
    10|10000 characters needed characters exceeded

    • Former Member

      Thank you for your response Dirk,

      It's a good thought but no we are not using the MATCH_ fields, the field in question is the PRIMARY_SECONDARY_ADDRESS_BEST_COMPONENT_DELIVERY

      I can see how this would account for the removal of the diacritical characters, but not the Longchamps --> Longchamp change

  • author's profile photo Former Member
    Former Member
    Posted on Dec 08, 2015 at 07:49 PM

    Barry,

    Can you provide the full input examples with the locality, region, and postcode? When looking up 'de Longchamp*' in the reference data there are several streets that are just 'de Longchamp' and there are several that have 'de Longchamps' as the primary name. The one that we return would depend on the locality that the address is assigned to. It is very likely that the entry that we are making a match to is a locality that only has 'de Longchamp'. This could also be the same for the missing diacritics. I would need to have the complete address to verify this.

    Thanks,

    Wanda Green

    GAC Test Engineer

    Add a comment
    10|10000 characters needed characters exceeded

    • Former Member

      Wanda,

      I'll email you an Excel file containing the raw input data and the immediate output from the cleansing process (we do a small amount of additional processing based on the address completeness and/or address quality score

      Best regards,

      Barry Carlino

  • author's profile photo Former Member
    Former Member
    Posted on Dec 09, 2015 at 09:38 AM

    I can see how this would account for the removal of the diacritical characters, but not the Longchamps --> Longchamp change


    Well, as Wanda says, the full address is mandatory to deduce the reason.


    But, does the Longchamps address belongs to Paris?


    Are you using the GAC transform assigning only the street and number as input? I think Locality (and country) is the most weighted factor in GAC transform to obtaining properly results.


    paris.png (19.2 kB)
    Add a comment
    10|10000 characters needed characters exceeded

    • Former Member

      Nestor,

      The address in question is in Suresnes, near Paris.

      We have a common process which needs to deal with a range of addresses which arrive in a variety of formats. The worst case is that the entire address is in a single field with a separate COUNTRY field, the best case is that we get STREET1-4, CITY, REGION, POSTCODE and so on.Quite often we get data where the fields are incorrectly or inconsistently populated.

      Because of this we use COUNTRY and MULTILINE 1-12. We've tried using specific (STREET, CITY, REGION and so on) fields in the past but because of the data issues we regularly experience, the results aren't as good

      Best regards,

      Barry Carlino

Before answering

You should only submit an answer when you are proposing a solution to the poster's problem. If you want the poster to clarify the question or provide more information, please leave a comment instead, requesting additional details. When answering, please include specifics, such as step-by-step instructions, context for the solution, and links to useful resources. Also, please make sure that you answer complies with our Rules of Engagement.
You must be Logged in to submit an answer.

Up to 10 attachments (including images) can be used with a maximum of 1.0 MB each and 10.5 MB total.