?
Solved

Reversing corrupted UTF8 Strings

Posted on 2004-11-03
15
Medium Priority
?
1,149 Views
Last Modified: 2012-08-14
Quite likely I can't find anything because I'm using the wrong keywords, bu I'm lloking for a tool/piece of java-code to retrieve information from corrupted UTF8 strings.
In an internationalised application that I've become responsible for, every now and again a new form gets added. Quite often people don't realise the implications of international forms and just assume that it'll work for any language they let it be filled in with. Off course this is not so, for instance, with Poland and Russia some characters, of not all, have to be stored as UTF8. This does not always happen, but the resulting corruption of the string follows a predictable pattern. As far as I know that meens it should be a reversible process, I am however at a loss and short of time to figure it out myself and was hoping somebody, somewhere might have a simple piece of code to 'fix' a corrupted string.
I was hoping it'd be as simple as taking in the characters as pairs and creating a new utf-character by merging the charcode of the two characters as one charcode, however my attempts at doing so fail horribly...

Hoping you guys can come up with something,

 Martin
0
Comment
Question by:mreuring
  • 5
  • 4
  • 3
  • +2
15 Comments
 
LVL 9

Expert Comment

by:Venci75
ID: 12483968
How do these UTF string get corrupted? Do you use html forms for filling these strings?
0
 
LVL 17

Author Comment

by:mreuring
ID: 12484557
Yes, a form with some message-bundles supplying validation and labels is build in jsp and posts to a struts-action. Depending on how accurate the developer creates everything in utf-8 encoding and the platform settings this might work without any intervention. As this turns out to malfunction on live-servers we're using some default filter that will force all requests to be handled as utf8 encoded. However I have on my hands now some gathered information of a few weeks that didn't get filtered and thus I believe the request is being handled as ANSI-encoded where it was filled with UTF8 information, the result is corrupted information. I'm looking for a way to reverse this process.
As an example, safe the following as a standard windows-encoded html file and load it in a browser that allows you to switch encoding:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
      <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
      <title> new document </title>
</head>
<body>
ФамилиÑ
</body>
</html>

I used mozilla/firefox to switch the encoding to utf-8 and the result was the original russian text that was mangled.
0
 
LVL 9

Expert Comment

by:Venci75
ID: 12484846
>>>> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
In that case you should read your strings as iso-8859-1 encoded.
0
Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 17

Author Comment

by:mreuring
ID: 12486085
Well, don't take this small simulation too literally, it's a test-case in which the process gets reversed, now what I'm trying to find out is how to do this using some java-code. Reading the strings in iso-8859-1 encoding just results in the above, now how would I force the string to 'convert' to utf-8 without accually changing the bytecodes, so that the characters revert to their former russian glory...
0
 
LVL 6

Accepted Solution

by:
mightyone earned 1000 total points
ID: 12486429
e.g. to convert a string from one encoding to another is quite easy:

try
{
String original = "ФамилиÑ";      // your String
byte [] obytes = original.getBytes("ISO-8859-1");
String utf8 = new String (obytes);  
/*normally uses your deafult system encoding, you might want to use String utf8 = new String (obytes, "UTF8");  */

}
catch (UnsupportedEncodingException uee){}
0
 
LVL 13

Expert Comment

by:Webstorm
ID: 12487196
Hi mreuring,

Try this method:

    public static String decodeUTF8(String s)
    {
        int l=s.length(),b,sumb=0,i,more=-1;
        StringBuffer sbuf=new StringBuffer(l);
        for (i=0;i<l;i++)
        {
            b=s.charAt(i);
            if ((b&0xc0)==0x80)
            {
                sumb=(sumb<<6)|(b&0x3f);
                if (--more==0) sbuf.append((char)sumb);
            }
            else if ((b&0x80)==0x00)
            {
                sbuf.append((char)b);
            }
            else if ((b&0xe0)==0xc0)
            {
                sumb=b&0x1f;
                more=1;
            }
            else if ((b&0xf0)==0xe0)
            {
                sumb=b&0x0f;
                more=2;
            }
            else if ((b&0xf8)==0xf0)
            {
                sumb=b&0x07;
                more=3;
            }
            else if ((b&0xfc)==0xf8)
            {
                sumb=b&0x03;
                more=4;
            }
            else // if ((b & 0xfe) == 0xfc)
            {
                sumb=b&0x01;
                more=5;
            }
        }
        return sbuf.toString();
    }

0
 
LVL 6

Expert Comment

by:mightyone
ID: 12487865
???

what should that do? it looks like the wrong way round?

by the way, check ibm.com for ICU

good luck
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 12488643
You can only reverse encoding when the input into the original was valid in the first place
0
 
LVL 9

Expert Comment

by:Venci75
ID: 12490839
When you got the strings in you servlet/jsp - they are already corrupted - as CEHJ said. You must either change the encoding of your page or the encoding of your VM in order to prevent this.
0
 
LVL 17

Author Comment

by:mreuring
ID: 12491071
I'm trying to adress the short remarks here first:
CEHJ - Up to the point where a servlet started accessing the request as being ANSI or ISO encoded nothing was wrong with the encoding, that's why in general we now just use a filter to set the request's encoding to utf-8 before any servlet has access to it. The above html-snippet shows that, mozilla at least, is able to reverse the corrupted characters to their former Russian utf-8 characters.

Venci75 - Changing the encoding of the page doesn't always work, changing the encoding of the VM is not an option. As mentioned above we have chosen to change the encoding of the request instead.

However, I would like to stress, I am not interrested in the correct handling of the incoming date on the servlet-level, I'm looking for a way to correct existing date. I will try and test mightyone's suggestion first and will get back on that.

  Martin
0
 
LVL 9

Expert Comment

by:Venci75
ID: 12491102
What do you mean by "Changing the encoding of the page doesn't always work"?
As I said - when you receive the String in your servlet that are already correupted !
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 12491489
What i meant earlier is the following - if Java gets 'garbage in' in terms of characters, it will often decode anything it can't handle as '?' When that happens, there is no way in which you can derive the original data from the result of the decoding.
0
 
LVL 17

Author Comment

by:mreuring
ID: 12492023
Venci - It receives requests from a client as UTF-8 when the page-encoding is properly set, however, it does not always handle them accordingly. I have increasing frustrations about Java's seeming incapabilites of detecting encoding, but again, this was just the cause of our corrupted data and it's already fixed, won't happen again. I just needed a way to convert our already corrupted data back.

CEHJ - I know that sometimes it is quite incapable of handling the date and this will result in useless data, lacking the full UTF-8 byte-code. However, in a post above http:#12484557 I have supplied an extraction of one of these strings and you can clearly recognise a pattern in these corrupted snippets of data.

Mightyone - Your code-example has been proven to work in a small Swing-application I have build for test-purposes. (http:#12486429) So that resolves it. Reading the byte-code and subsequently creating a new string from that byte-code while explicitly setting it's encoding to UTF-8 is the trick I was looking for, thank you so verry much.

Martin
0
 
LVL 86

Expert Comment

by:CEHJ
ID: 12492136
So, by the accepted answer, the question really seems to have been 'how do i convert between character encodings'? (if that inline String is not encodable as iso-8859-1, the code will fail)
0
 
LVL 17

Author Comment

by:mreuring
ID: 12492242
The problem with the wording you choose, and I'm wondering wether you miscomprehend the above solution, is that then most people will try and provide a way of converting the string which will result in altering the byte-code of said string. That's not what I was looking for.
The reason I'm wondering wether you comprehend this distinction is that you say, 'if that inline String is not encodable as iso-8859-1, the code will fail' where in the resulting strings I have all the strings have been interpreted as being ISO-8859-1 while the accual data was UTF-8 encoded. Taking above mentioned string as an example (let's see if EE can handle UTF-8 while we're at it):

The original string: &#1060;&#1072;&#1084;&#1080;&#1083;&#1080;
The result of misinterpreted request: Фамили

As you can see in the resulting string there's an obvious repetition (Ð appears every other character), this denotes the first byte of a two-byte UTF-8 character, particularly those characters most often found in the russian language. It's just a matter of grabbing the byte-code and telling Java that it's not ISO-8859-1 but UTF-8 instead. When using mightyone's, almost laughable, simple piece of code that's exactly what happens, and I'm inclined to say that it will always work for this particular form of encoding-corruption.

All I need to figure out is how to word it so that everybody accually understands what happend,

 Martin

0

Featured Post

Free Tool: Site Down Detector

Helpful to verify reports of your own downtime, or to double check a downed website you are trying to access.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

After being asked a question last year, I went into one of my moods where I did some research and code just for the fun and learning of it all.  Subsequently, from this journey, I put together this article on "Range Searching Using Visual Basic.NET …
For beginner Java programmers or at least those new to the Eclipse IDE, the following tutorial will show some (four) ways in which you can import your Java projects to your Eclipse workbench. Introduction While learning Java can be done with…
This tutorial will introduce the viewer to VisualVM for the Java platform application. This video explains an example program and covers the Overview, Monitor, and Heap Dump tabs.
This tutorial explains how to use the VisualVM tool for the Java platform application. This video goes into detail on the Threads, Sampler, and Profiler tabs.
Suggested Courses
Course of the Month16 days, 9 hours left to enroll

862 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question