Reversing corrupted UTF8 Strings

Quite likely I can't find anything because I'm using the wrong keywords, bu I'm lloking for a tool/piece of java-code to retrieve information from corrupted UTF8 strings.
In an internationalised application that I've become responsible for, every now and again a new form gets added. Quite often people don't realise the implications of international forms and just assume that it'll work for any language they let it be filled in with. Off course this is not so, for instance, with Poland and Russia some characters, of not all, have to be stored as UTF8. This does not always happen, but the resulting corruption of the string follows a predictable pattern. As far as I know that meens it should be a reversible process, I am however at a loss and short of time to figure it out myself and was hoping somebody, somewhere might have a simple piece of code to 'fix' a corrupted string.
I was hoping it'd be as simple as taking in the characters as pairs and creating a new utf-character by merging the charcode of the two characters as one charcode, however my attempts at doing so fail horribly...

Hoping you guys can come up with something,

 Martin
LVL 17
mreuringAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Venci75Commented:
How do these UTF string get corrupted? Do you use html forms for filling these strings?
0
mreuringAuthor Commented:
Yes, a form with some message-bundles supplying validation and labels is build in jsp and posts to a struts-action. Depending on how accurate the developer creates everything in utf-8 encoding and the platform settings this might work without any intervention. As this turns out to malfunction on live-servers we're using some default filter that will force all requests to be handled as utf8 encoded. However I have on my hands now some gathered information of a few weeks that didn't get filtered and thus I believe the request is being handled as ANSI-encoded where it was filled with UTF8 information, the result is corrupted information. I'm looking for a way to reverse this process.
As an example, safe the following as a standard windows-encoded html file and load it in a browser that allows you to switch encoding:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
      <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
      <title> new document </title>
</head>
<body>
ФамилиÑ
</body>
</html>

I used mozilla/firefox to switch the encoding to utf-8 and the result was the original russian text that was mangled.
0
Venci75Commented:
>>>> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
In that case you should read your strings as iso-8859-1 encoded.
0
Get expert help—faster!

Need expert help—fast? Use the Help Bell for personalized assistance getting answers to your important questions.

mreuringAuthor Commented:
Well, don't take this small simulation too literally, it's a test-case in which the process gets reversed, now what I'm trying to find out is how to do this using some java-code. Reading the strings in iso-8859-1 encoding just results in the above, now how would I force the string to 'convert' to utf-8 without accually changing the bytecodes, so that the characters revert to their former russian glory...
0
mightyoneCommented:
e.g. to convert a string from one encoding to another is quite easy:

try
{
String original = "ФамилиÑ";      // your String
byte [] obytes = original.getBytes("ISO-8859-1");
String utf8 = new String (obytes);  
/*normally uses your deafult system encoding, you might want to use String utf8 = new String (obytes, "UTF8");  */

}
catch (UnsupportedEncodingException uee){}
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
WebstormCommented:
Hi mreuring,

Try this method:

    public static String decodeUTF8(String s)
    {
        int l=s.length(),b,sumb=0,i,more=-1;
        StringBuffer sbuf=new StringBuffer(l);
        for (i=0;i<l;i++)
        {
            b=s.charAt(i);
            if ((b&0xc0)==0x80)
            {
                sumb=(sumb<<6)|(b&0x3f);
                if (--more==0) sbuf.append((char)sumb);
            }
            else if ((b&0x80)==0x00)
            {
                sbuf.append((char)b);
            }
            else if ((b&0xe0)==0xc0)
            {
                sumb=b&0x1f;
                more=1;
            }
            else if ((b&0xf0)==0xe0)
            {
                sumb=b&0x0f;
                more=2;
            }
            else if ((b&0xf8)==0xf0)
            {
                sumb=b&0x07;
                more=3;
            }
            else if ((b&0xfc)==0xf8)
            {
                sumb=b&0x03;
                more=4;
            }
            else // if ((b & 0xfe) == 0xfc)
            {
                sumb=b&0x01;
                more=5;
            }
        }
        return sbuf.toString();
    }

0
mightyoneCommented:
???

what should that do? it looks like the wrong way round?

by the way, check ibm.com for ICU

good luck
0
CEHJCommented:
You can only reverse encoding when the input into the original was valid in the first place
0
Venci75Commented:
When you got the strings in you servlet/jsp - they are already corrupted - as CEHJ said. You must either change the encoding of your page or the encoding of your VM in order to prevent this.
0
mreuringAuthor Commented:
I'm trying to adress the short remarks here first:
CEHJ - Up to the point where a servlet started accessing the request as being ANSI or ISO encoded nothing was wrong with the encoding, that's why in general we now just use a filter to set the request's encoding to utf-8 before any servlet has access to it. The above html-snippet shows that, mozilla at least, is able to reverse the corrupted characters to their former Russian utf-8 characters.

Venci75 - Changing the encoding of the page doesn't always work, changing the encoding of the VM is not an option. As mentioned above we have chosen to change the encoding of the request instead.

However, I would like to stress, I am not interrested in the correct handling of the incoming date on the servlet-level, I'm looking for a way to correct existing date. I will try and test mightyone's suggestion first and will get back on that.

  Martin
0
Venci75Commented:
What do you mean by "Changing the encoding of the page doesn't always work"?
As I said - when you receive the String in your servlet that are already correupted !
0
CEHJCommented:
What i meant earlier is the following - if Java gets 'garbage in' in terms of characters, it will often decode anything it can't handle as '?' When that happens, there is no way in which you can derive the original data from the result of the decoding.
0
mreuringAuthor Commented:
Venci - It receives requests from a client as UTF-8 when the page-encoding is properly set, however, it does not always handle them accordingly. I have increasing frustrations about Java's seeming incapabilites of detecting encoding, but again, this was just the cause of our corrupted data and it's already fixed, won't happen again. I just needed a way to convert our already corrupted data back.

CEHJ - I know that sometimes it is quite incapable of handling the date and this will result in useless data, lacking the full UTF-8 byte-code. However, in a post above http:#12484557 I have supplied an extraction of one of these strings and you can clearly recognise a pattern in these corrupted snippets of data.

Mightyone - Your code-example has been proven to work in a small Swing-application I have build for test-purposes. (http:#12486429) So that resolves it. Reading the byte-code and subsequently creating a new string from that byte-code while explicitly setting it's encoding to UTF-8 is the trick I was looking for, thank you so verry much.

Martin
0
CEHJCommented:
So, by the accepted answer, the question really seems to have been 'how do i convert between character encodings'? (if that inline String is not encodable as iso-8859-1, the code will fail)
0
mreuringAuthor Commented:
The problem with the wording you choose, and I'm wondering wether you miscomprehend the above solution, is that then most people will try and provide a way of converting the string which will result in altering the byte-code of said string. That's not what I was looking for.
The reason I'm wondering wether you comprehend this distinction is that you say, 'if that inline String is not encodable as iso-8859-1, the code will fail' where in the resulting strings I have all the strings have been interpreted as being ISO-8859-1 while the accual data was UTF-8 encoded. Taking above mentioned string as an example (let's see if EE can handle UTF-8 while we're at it):

The original string: &#1060;&#1072;&#1084;&#1080;&#1083;&#1080;
The result of misinterpreted request: Фамили

As you can see in the resulting string there's an obvious repetition (Ð appears every other character), this denotes the first byte of a two-byte UTF-8 character, particularly those characters most often found in the russian language. It's just a matter of grabbing the byte-code and telling Java that it's not ISO-8859-1 but UTF-8 instead. When using mightyone's, almost laughable, simple piece of code that's exactly what happens, and I'm inclined to say that it will always work for this particular form of encoding-corruption.

All I need to figure out is how to word it so that everybody accually understands what happend,

 Martin

0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Java

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.