Link to home
Start Free TrialLog in
Avatar of ayeen
ayeenFlag for United States of America

asked on

Convert files in any unicode encoding type to UTF-8 in Java


I have this upload mechanism in our web app and everything is OK if the user just uploads a simple file and this is how part of the code looks:


// raw file
FormFile myFile = myForm.getTheFile();//using struts..so this is defined somewhere in my form bean

// final product should be BufferedReader
BufferedReader read = new BufferedReader(new InputStreamReader(myFile.getInputStream()));


My question is how do you convert files regardless of encoding type to UTF-8? Here are some of the list of extension names that I have to convert (yes i also have to handle big and lil endians):

utxt
utf8
utf-8  
utf16  
utf-16  
utf-16le  
utf-16be  
ucs2 ucs-2
ucs-2le
ucs-2be
cs2le


Now somewhere, I have seen this code (see below) but this just converts a text file to UTF-8:

String asString = new String(contents, "ISO8859_1");//for text file
byte[] newBytes = asString.getBytes("UTF8");


And I have seen this from the java site:

http://java.sun.com/j2se/1.3/docs/guide/intl/encoding.doc.html


Based on the info in that URL does that mean I just have to change "ISO8859_1" to "UnicodeBig" or "UTF-16" etc depending on what type my raw file is?
Like:

//if Sixteen-bit Unicode Transformation Format, little-endian byte order, with byte-order mark
String asString = new String(contents, "UnicodeLittle");
byte[] newBytes = asString.getBytes("UTF8");


If that is all I have to do how do I cast the byte[] newBytes to BufferedReader? Because BufferedReader should be my final product after converting the files to UTF-8


Thanks in advance!
ASKER CERTIFIED SOLUTION
Avatar of Mick Barry
Mick Barry
Flag of Australia image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of ayeen

ASKER

thanks objects..

this is what im doing:

//-----------------------------------------------------------------------------------------------------
BufferedReader read = new BufferedReader(new InputStreamReader(myFile.getInputStream(), getEncodingTypeCode("utf16")));

ByteArrayOutputStream temp = new ByteArrayOutputStream();
BufferedWriter write = new BufferedWriter(new OutputStreamWriter(temp, "UTF8"));
//-----------------------------------------------------------------------------------------------------




is there a way to convert "any" unicode files to UTF8 without having  to know before hand what type of encoding type the file is?

currently i just check the extension names, something like:

//-----------------------------------------------------------------------------------------------------
private static String getEncodingTypeCode (String extension_name){
      String encoding_type = "";

      if(extension_name.equalsIgnoreCase("utf8"))
      {                                          
            return "UTF8";
      }
      if(extension_name.equalsIgnoreCase("utf-8"))
      {                                          
            return "UTF8";
      }
      if(extension_name.equalsIgnoreCase("utf16"))
      {                                          
            return "UTF-16";
      }
      if(extension_name.equalsIgnoreCase("utf-16"))
      {                                          
            return "UTF-16";
      }
      if(extension_name.equalsIgnoreCase("utf-16le"))
      {                                          
            return "UnicodeLittle";
      }
      if(extension_name.equalsIgnoreCase("utf-16be"))
      {                                          
            return "UnicodeBig";
      }
}
//-----------------------------------------------------------------------------------------------------


but that "if" condition can go on and on...i still need to add ucs, ucs2, ucs-2l2, Cp037, Cp273 etc (basically everything here: http://java.sun.com/j2se/1.3/docs/guide/intl/encoding.doc.html)....

and aside from those "if" conditions, if it's a .txt file i also need to check the byte order marks to see if it's just a plain .txt file or if its a unicode...

so i have something like:

//-----------------------------------------------------------------------------------------------------
      if(first_byte.equalsIgnoreCase("FE") && sec_byte.equalsIgnoreCase("FF")){
            return sEncodingType = "UnicodeBig";//java.sun.com/j2se/1.3/docs/guide/intl/encoding.doc.html
      }
      else if (first_byte.equalsIgnoreCase("FF") && sec_byte.equalsIgnoreCase("FE")){
            return sEncodingType = "UnicodeLittle";
      }
      else if (first_byte.equalsIgnoreCase("EF") && sec_byte.equalsIgnoreCase("BB") && third_byte.equalsIgnoreCase("BF")){
            return sEncodingType = "UTF8";
      }
//-----------------------------------------------------------------------------------------------------
      

that would work fine but a problem would arise when the .txt file has unicode characters but its unmarked (no byte order marks)...
and ofcourse i also still need to check the BOM (byte order marks) for all the other extension names so i would know what encoding type to pass in my BufferedReader.

i'm curious if there's a shorter way to convert any files to UTF8 so i dont have to go through all this checking anymore and  just convert whatever file, unmarked or not to UTF8 w/o having to check first what type of encoding the file has..




not really, you need to know the encoding to be able to tell the Reader what to use
Avatar of ayeen

ASKER

i see.... oh well, i guess i dont have a choice but to put each one of those encoding types...

anyway, i tried your solution but i noticed the BOM (byte order marks) are still there after i convert a file to UTF8. is there a way to remove the BOM before converting it to UTF8?

thanks again!
you need to remove them manually
Avatar of ayeen

ASKER

yeah, i figured there's no shortcut to that too....

for the sake of those reading this here's how i removed the BOM...

//encoding_type = can be UTF8, UnicodeBig, UnicodeLil, etc
BufferedReader read = new BufferedReader(new InputStreamReader(file.getInputStream(), encoding_type));
ByteArrayOutputStream temp = new ByteArrayOutputStream();
BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(temp, "UTF8"));

line = read.readLine();

//offset = 1 for BE/LE and 2 for UTF8
writer.write(line, offset, line.length()-offset);

Avatar of ayeen

ASKER

solution is excellent but kinda vague that i had to do additional research and had to do some code experiments  to fill up some missing parts of the solution...