• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 789
  • Last Modified:

How to convert special characters in a string to their "normal" counterparts

for a profanity filter, i need to convert special characters (e.g. with accents) in a string to "normal" ones, example:

è to e
í to i
ú to u

etc.

how to do this efficiently?
0
thomers1
Asked:
thomers1
  • 8
  • 7
1 Solution
 
CEHJCommented:
I would just do
final String find = "èíú";
final String repl = "eiu";
 
// (in loop)
s = s.replaceAll("" + find.charAt[i], "" + repl.charAt[i]);

Open in new window

0
 
thomers1Author Commented:
thanks  - hmm i forgot that multiple variations will exist for each "normal" letter

like
Ê                                         Ë È                                          É
should all be replaced with "E" (or converted into lowercase first, and then replaced by "e")

is there something to consider when comparing chars with special characters? (e.g. how do i encode them, if i can't type them on my keyboard).
0
 
CEHJCommented:
Multiples don't matter: see below.

The fact that you can't type them doesn't matter, but they must exist in the encoding in which the code resides. You can always use Unicode escapes for untypable ones. Best to see your source code is saved as UTF-8
find = "ÊËÈÉ"
repl = "EEEE"

Open in new window

0
Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
thomers1Author Commented:
thanks! :-)
0
 
CEHJCommented:
:-)
0
 
thomers1Author Commented:
hmm seem i was too fast closing this question

in your example, the lenght of find is 8, while repl is 4 - which means it doesnt work
0
 
CEHJCommented:
Well - obviously you need to ensure they're the same length ;-)
0
 
thomers1Author Commented:
nope what i meant is, if i use your example from above with 4 characters each, find.length() returns 8, while repl.length() returns 4.




find = "ÊËÈÉ"
repl = "EEEE"

Open in new window

0
 
CEHJCommented:
To be on the safe side:
for(int i = 0;i < Math.min(find.length(),repl.length());i++)

Open in new window

0
 
thomers1Author Commented:
i think, the first line does not initialize the find string as UTF-8 encoded.

if i iterate over the characters of find, i get this (see code).

obviously, this can't be used to compare to the repl string.




0: 
1: ä
2: 
3: ã
4: 
5: à
6: 
7: â

Open in new window

0
 
CEHJCommented:
Can you show me the code you're running? Also, using the following, please tell me the result of passing 'file.encoding' as a parameter to it

http://technojeeves.com/joomla/index.php/free/54-javasystemproperties
0
 
thomers1Author Commented:
aah, file.encoding is "MacRoman"



final String prepare_find = "ÊËÈÉ";
final String prepare_repl = "eeee";
 
System.out.println("find: " + prepare_find.length());
System.out.println("repl: " + prepare_repl.length());
 
for (int i=0; i<prepare_find.length(); i++) {
   System.out.println(i + ": " + prepare_find.charAt(i));
}

Open in new window

0
 
CEHJCommented:
Ah OK. Refer to the MacRoman chart for your default chars. You might be better to install a full UTF-8 locale in the end
0
 
thomers1Author Commented:
You might be better to install a full UTF-8 locale in the end
How do i do that on Mac OSX ?

0
 
CEHJCommented:
Don't know i'm afraid. I've never been a Mac user, but i'm assuming its latest incarnations support a UTF-8 environment. Having said that, most of the exotic accented chars should be in MacRoman, since they appear in ISO8859-1:

http://technojeeves.com/joomla/index.php/free/48-iso8859-1
0

Featured Post

Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

  • 8
  • 7
Tackle projects and never again get stuck behind a technical roadblock.
Join Now