Gewgala
asked on
Regular Expressions: Match the base letter of a unicode string
Hello experts,
Is it possible to match the base letter of a unicode string? If so, how do I do it? So, for example, I have the word "hen" that I am looking for. In my text file, I could have "hen" (which will match) and I could have "heñ" (which currently does not match). I would like my regular expression or method thereof to be able to match both words.
So, Is there a regex tactic of which I am not aware that will match the base letter "n" when it comes across the unicode character ñ (and so on for every base letter)?
Thanks for shedding the light.
Is it possible to match the base letter of a unicode string? If so, how do I do it? So, for example, I have the word "hen" that I am looking for. In my text file, I could have "hen" (which will match) and I could have "heñ" (which currently does not match). I would like my regular expression or method thereof to be able to match both words.
So, Is there a regex tactic of which I am not aware that will match the base letter "n" when it comes across the unicode character ñ (and so on for every base letter)?
Thanks for shedding the light.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Thank you gmrsecs, that's exactly what I needed. I applyed the NormalizationForm.FormD to my string, but I then ran a regex after that on the same string that stripped out all diacritic symbols. So, for example, I ran this:
string s = <contents of file>;
string decoded = s.Normalize(Normalization. FormD);
Regex r = new Regex("\p{M}+", RegexOptions.Compiled);
decoded = r.Replace(decoded, "");
the string variable "decoded" would now contain the exact same content of the string variable "s" except all diacritic symbols would be stripped out, such as all ñ characters become simply n and so on, which I am them able to perform my matches on the decoded string and grab everything that I need.
Thanks!
string s = <contents of file>;
string decoded = s.Normalize(Normalization.
Regex r = new Regex("\p{M}+", RegexOptions.Compiled);
decoded = r.Replace(decoded, "");
the string variable "decoded" would now contain the exact same content of the string variable "s" except all diacritic symbols would be stripped out, such as all ñ characters become simply n and so on, which I am them able to perform my matches on the decoded string and grab everything that I need.
Thanks!
ASKER
Thanks again!
1) he\u006E\p{M}*
where \u006E is the 'n' representation in unicode, and \p{M}* 0 or more diacritic signs. so this reg exp will match 'hen', but also heX(where X is a composition between \006E and a diacritic(eg. \u0301))
anyway, the problem remains the canonical decomposition.