thomers1
asked on
How to convert special characters in a string to their "normal" counterparts
for a profanity filter, i need to convert special characters (e.g. with accents) in a string to "normal" ones, example:
è to e
í to i
ú to u
etc.
how to do this efficiently?
è to e
í to i
ú to u
etc.
how to do this efficiently?
ASKER
thanks - hmm i forgot that multiple variations will exist for each "normal" letter
like
Ê Ë È É
should all be replaced with "E" (or converted into lowercase first, and then replaced by "e")
is there something to consider when comparing chars with special characters? (e.g. how do i encode them, if i can't type them on my keyboard).
like
Ê Ë È É
should all be replaced with "E" (or converted into lowercase first, and then replaced by "e")
is there something to consider when comparing chars with special characters? (e.g. how do i encode them, if i can't type them on my keyboard).
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
thanks! :-)
:-)
ASKER
hmm seem i was too fast closing this question
in your example, the lenght of find is 8, while repl is 4 - which means it doesnt work
in your example, the lenght of find is 8, while repl is 4 - which means it doesnt work
Well - obviously you need to ensure they're the same length ;-)
ASKER
nope what i meant is, if i use your example from above with 4 characters each, find.length() returns 8, while repl.length() returns 4.
find = "ÊËÈÉ"
repl = "EEEE"
To be on the safe side:
for(int i = 0;i < Math.min(find.length(),repl.length());i++)
ASKER
i think, the first line does not initialize the find string as UTF-8 encoded.
if i iterate over the characters of find, i get this (see code).
obviously, this can't be used to compare to the repl string.
if i iterate over the characters of find, i get this (see code).
obviously, this can't be used to compare to the repl string.
0:
1: ä
2:
3: ã
4:
5: à
6:
7: â
Can you show me the code you're running? Also, using the following, please tell me the result of passing 'file.encoding' as a parameter to it
http://technojeeves.com/joomla/index.php/free/54-javasystemproperties
http://technojeeves.com/joomla/index.php/free/54-javasystemproperties
ASKER
aah, file.encoding is "MacRoman"
final String prepare_find = "ÊËÈÉ";
final String prepare_repl = "eeee";
System.out.println("find: " + prepare_find.length());
System.out.println("repl: " + prepare_repl.length());
for (int i=0; i<prepare_find.length(); i++) {
System.out.println(i + ": " + prepare_find.charAt(i));
}
Ah OK. Refer to the MacRoman chart for your default chars. You might be better to install a full UTF-8 locale in the end
ASKER
You might be better to install a full UTF-8 locale in the end
How do i do that on Mac OSX ?
Don't know i'm afraid. I've never been a Mac user, but i'm assuming its latest incarnations support a UTF-8 environment. Having said that, most of the exotic accented chars should be in MacRoman, since they appear in ISO8859-1:
http://technojeeves.com/joomla/index.php/free/48-iso8859-1
http://technojeeves.com/joomla/index.php/free/48-iso8859-1
Open in new window