Sunday, March 20, 2011

How to change diacritic characters to non-diacritic ones

Hello,

I've found a answer how to remove diacritic characters on stackoverflow, but could you please tell me if it is possible to change diacritic characters to non-diacritic ones?

Oh.. and I think about .NET (or other if not possible)

kind regards

From stackoverflow
  • Copying from my own answer to another question:

    Instead of creating your own table, you could instead convert the text to normalization form D, where the characters are represented as a base character plus the diacritics (for instance, "á" will be replaced by "a" followed by a combining acute accent). You can then strip everything which is not an ASCII letter.

    The tables still exist, but are now the ones from the Unicode standard.

    You could also try NFKD instead of NFD, to catch even more cases.

    References:

    hop : please don't do this, if possibly. you are butchering our languages. try to use transliteration
  • It might also be worthwhile to step back and consider why you want to do this. If you are trying to remove character differences you consider insignificant, you should look at the Unicode collation algorithm. This is the standard way to disregard differences such as case or diacritics when comparing strings for searching or sorting.

    If you plan to display the modified text, consider your audience. What you can safely filter away is locale sensitive. In US English, "Igloo" = "igloo", and "resume" = "résumé", but in Turkish, a lower case I is ı (dotless), and in French, cote means quote, côté means side, and côte means coast. So, the collation language determines what differences are significant.

    If removing diacritics is the right solution for your application, it is safest to produce your own table to which you explicitly add the characters you want to convert.

    A general, automated approach could be devised using Unicode decomposition. With this, you can decompose a character with diacritics to "combining" characters (the diacritic marks) and the base character with which they are combined. Filter out any thing that is a combining character, and you should have the "non-diacritic" ones.

    The lack of discrimination in the automated method, however, could have some unexpected effects. I'd recommend a lot of testing on a representative body of text.

    tomaszs : I think one of uses of this is to create nice URLs
  • For a simple example:

    To remove diacritics from a string:

    string newString = myDiacriticsString.Normalize(NormalizationForm.FormD);
    
    Feryt : does not work : "ě".Normalize(NormalizationForm.FormD) does not return "e"
    Hans Passant : Yes it does, use String.ToCharArray() to see it.
  • since no one has ever bothered to post the code to do this, here it is.

    string RemoveDiacriticals(string text)
    {
       text = text.Normalize(NormalizationForm.FormD);
       return Regex.Replace(text, @"[^\t\n\u001E-\u007F]", "");       
    }
    

    Note: a big reason for needing to do this is when you are integrating to a 3rd party system that only does ascii, but your data is in unicode. This is common. Your options are basically: remove accented characters, or attempt to remove accents from the accented characters to attempt to preserve as much as you can of the original input. Obviously, this is not a perfect solution but it is 80% better than simply removing any character above ascii 127.

0 comments:

Post a Comment