Howto simplify / convert characters from I8N to ASCII

I have a case where I need to convert any I8N character within a sentence that has an ascii equivalent into simple ascii. For example if I'd get the German city München I would like that to be converted to Munchen.

I can come up with several re-mapping ideas, but I was wondering if there's either a standardized or open-source solution to this...
LVL 17
mreuringAsked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

CEHJCommented:
Actually it 'should' be Muenchen. Removing umlauts, otherwise, is trivial
Mick BarryJava DeveloperCommented:
following should give you what you need

http://java.sun.com/docs/books/tutorial/i18n/text/string.html
mreuringAuthor Commented:
@CEHJ You are right, and that would be one of the things I'd hoped to find packed into a nice open-source effort. Barring such a project, for now I'd rather convert to Munchen than use (for instance) M%fcnchen. Replacing specifially just umlauts would be trivial, but can the same be said for 'any character with umlauts, accents, etc..'? I can only imagine building a large static map and pray to god I didn't miss one....

@objects That was the first place I looked and all it shows are conversion that preserve information, resulting in an encoded string, which is what I don't want. I need to loose information...
OWASP: Threats Fundamentals

Learn the top ten threats that are present in modern web-application development and how to protect your business from them.

Mick BarryJava DeveloperCommented:
>  I can only imagine building a large static map and pray to god I didn't miss one....

thats probably what you need
or a big regex to use with replaceAll()
will ask around here and see if anyone has seen anything
CEHJCommented:
>>Replacing specifially just umlauts would be trivial, but can the same be said for 'any character with umlauts, accents, etc..'?

Actually i should really qualify my comment about triviality. There can be complexities due to stuff like Unicode 'composed' encodings: characters can be combined with separate diacritic characters rather than simply appearing as one character.

Also, you might be safer just using the non-diacritic versions, as while Muenchen might be technically correct, you wouldn't want that happening in, say, Turkish

If you want to get into this seriously, you should look at something like http://site.icu-project.org/

Otherwise, your most efficient option would probably be to make a substituting FilterReader

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
CEHJCommented:
>  I can only imagine building a large static map and pray to god I didn't miss one....

You'll probably be OK just confining yourself to anything > 0x7F in ISO-8859-1
mreuringAuthor Commented:
@both Thanx for the suggestions, at least I have options now! I'm going to have a long read on the ICU project and then discuss with my team what we want to do. I will close today, just giving objects various contacts a change to respond.

@objects Thank you for asking around!

@CEHJ Thank you for pointing our Unicode 'composed', I hadn't even thought of that yet :(
Mick BarryJava DeveloperCommented:
as there isn;'t really a standard mapping the solution looks like it is going to depend on what you want to do with unmappable chars.
For example Java already has a CharsetEncoder class that sort of does what you want, just depends on how much control you want over the mapping.
http://java.sun.com/javase/6/docs/api/java/nio/charset/CharsetEncoder.html
CEHJCommented:
CharsetDecoder/CharsetEncoder aren't really appropriate - the problem is not encoding, but transformation
mreuringAuthor Commented:
Mostly due to your remark concerning diacritic 'composition' I managed to google a shortcut using Normalizer and replaceAll.
This serves as good-enough for now and we're looking at improving in the next release-cycle by making use if ICU.
mreuringAuthor Commented:
For those of interrest
	public static String simplify(final String text) {
		final String normalised = Normalizer.normalize(text, Normalizer.DECOMP, 0);
		final String simplified = normalised.replaceAll("[^\\p{ASCII}]|\\W","");
		return simplified;
	}

Open in new window

CEHJCommented:
:-)

Using a FilterReader really is very convenient. Try the partial (deals with 'i' and 'o' upper and lower case) implementation in the attached jar, using the code below
import net.proteanit.io.DiacriticRemovingReader;
import java.io.*;

public class RemoveDiacritics {
    public static void main(String[] args) throws IOException {
	if (args.length < 1) {
	    usage();
	    System.exit(1);
	}
	else {

	    Reader in = new DiacriticRemovingReader(new FileReader(args[0]));
	    int b = -1;
	    while ((b = in.read()) > -1) {
		System.out.print((char)b);
	    }

	    in.close();
	}
    }

	private static void usage() {
	    System.err.println("Usage: java RemoveDiacritics <file from which to remove diacritic characters, substituting 'plain' ones and printing to stdout>");
	}
    }

Open in new window

rem-diacrits.zip
Mick BarryJava DeveloperCommented:
>  the problem is not encoding, but transformation

by definition, encoding *is* a transformation :)

It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Java

From novice to tech pro — start learning today.