asked on

Regular Expressions: Match the base letter of a unicode string

Hello experts,

Is it possible to match the base letter of a unicode string? If so, how do I do it? So, for example, I have the word "hen" that I am looking for. In my text file, I could have "hen" (which will match) and I could have "heñ" (which currently does not match). I would like my regular expression or method thereof to be able to match both words.

So, Is there a regex tactic of which I am not aware that will match the base letter "n" when it comes across the unicode character ñ (and so on for every base letter)?

Thanks for shedding the light.

gmrsecs

basically, if you succeed to normalize your string in canonical mode, but I don't know how to do it in .net, you can use a simple reg exp like :
1) he\u006E\p{M}*

where \u006E is the 'n' representation in unicode, and \p{M}* 0 or more diacritic signs. so this reg exp will match 'hen', but also heX(where X is a composition between \006E and a diacritic(eg. \u0301))

anyway, the problem remains the canonical decomposition.

ASKER CERTIFIED SOLUTION

gmrsecs

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

Gewgala

ASKER

Thank you gmrsecs, that's exactly what I needed. I applyed the NormalizationForm.FormD to my string, but I then ran a regex after that on the same string that stripped out all diacritic symbols. So, for example, I ran this:

string s = <contents of file>;
string decoded = s.Normalize(Normalization.FormD);

Regex r = new Regex("\p{M}+", RegexOptions.Compiled);
decoded = r.Replace(decoded, "");

the string variable "decoded" would now contain the exact same content of the string variable "s" except all diacritic symbols would be stripped out, such as all ñ characters become simply n and so on, which I am them able to perform my matches on the decoded string and grab everything that I need.

Thanks!

Gewgala

ASKER

Thanks again!