Link to home
Start Free TrialLog in
Avatar of rye004
rye004Flag for United States of America

asked on

How to create a regular expression in C# to search for non English, French and German characters.

I am trying to build a regular expression that can help me understand if a piece of text contains any characters outside of the English, French and German language sets.  More specifically any characters you can type on a standard English, German or French keyboard.

I have the task of going through millions of “words” – groups of characters wrapped with white spaces.

Any suggestions would be greatly appreciated.
Avatar of David Johnson, CD
David Johnson, CD
Flag of Canada image

how are they encoded? utf-8? or unicode?
unicode anything <=255 utf-8 anything not starting with c2 or if starting with c2 <=255
Avatar of rye004

ASKER

I have the words in a text file that is set to Unicode.  Many Thanks!
Since this is .NET, you should be able to list out all of the characters you care about using a character class:

e.g.

if (Regex.IsMatch(input, "[^a-zA-ZäöüÄÖÜß(french chars here)]"))
{
    // Found offending character
}

Open in new window


I'm afraid I don't speak/read French, so I don't know what the characters that are used within it are, but if you know them, then you should be able to just insert them in the advertised place in the above pattern--sans parens.
ASKER CERTIFIED SOLUTION
Avatar of Bob Learned
Bob Learned
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of rye004

ASKER

Thank you.