Link to home
Start Free TrialLog in
Avatar of thomers1
thomers1

asked on

How to convert special characters in a string to their "normal" counterparts

for a profanity filter, i need to convert special characters (e.g. with accents) in a string to "normal" ones, example:

è to e
í to i
ú to u

etc.

how to do this efficiently?
Avatar of CEHJ
CEHJ
Flag of United Kingdom of Great Britain and Northern Ireland image

I would just do
final String find = "èíú";
final String repl = "eiu";
 
// (in loop)
s = s.replaceAll("" + find.charAt[i], "" + repl.charAt[i]);

Open in new window

Avatar of thomers1
thomers1

ASKER

thanks  - hmm i forgot that multiple variations will exist for each "normal" letter

like
Ê                                         Ë È                                          É
should all be replaced with "E" (or converted into lowercase first, and then replaced by "e")

is there something to consider when comparing chars with special characters? (e.g. how do i encode them, if i can't type them on my keyboard).
ASKER CERTIFIED SOLUTION
Avatar of CEHJ
CEHJ
Flag of United Kingdom of Great Britain and Northern Ireland image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
thanks! :-)
:-)
hmm seem i was too fast closing this question

in your example, the lenght of find is 8, while repl is 4 - which means it doesnt work
Well - obviously you need to ensure they're the same length ;-)
nope what i meant is, if i use your example from above with 4 characters each, find.length() returns 8, while repl.length() returns 4.




find = "ÊËÈÉ"
repl = "EEEE"

Open in new window

To be on the safe side:
for(int i = 0;i < Math.min(find.length(),repl.length());i++)

Open in new window

i think, the first line does not initialize the find string as UTF-8 encoded.

if i iterate over the characters of find, i get this (see code).

obviously, this can't be used to compare to the repl string.




0: 
1: ä
2: 
3: ã
4: 
5: à
6: 
7: â

Open in new window

Can you show me the code you're running? Also, using the following, please tell me the result of passing 'file.encoding' as a parameter to it

http://technojeeves.com/joomla/index.php/free/54-javasystemproperties
aah, file.encoding is "MacRoman"



final String prepare_find = "ÊËÈÉ";
final String prepare_repl = "eeee";
 
System.out.println("find: " + prepare_find.length());
System.out.println("repl: " + prepare_repl.length());
 
for (int i=0; i<prepare_find.length(); i++) {
   System.out.println(i + ": " + prepare_find.charAt(i));
}

Open in new window

Ah OK. Refer to the MacRoman chart for your default chars. You might be better to install a full UTF-8 locale in the end
You might be better to install a full UTF-8 locale in the end
How do i do that on Mac OSX ?

Don't know i'm afraid. I've never been a Mac user, but i'm assuming its latest incarnations support a UTF-8 environment. Having said that, most of the exotic accented chars should be in MacRoman, since they appear in ISO8859-1:

http://technojeeves.com/joomla/index.php/free/48-iso8859-1