Avatar of ayeen
ayeen
Flag for United States of America asked on

Convert files in any unicode encoding type to UTF-8 in Java


I have this upload mechanism in our web app and everything is OK if the user just uploads a simple file and this is how part of the code looks:


// raw file
FormFile myFile = myForm.getTheFile();//using struts..so this is defined somewhere in my form bean

// final product should be BufferedReader
BufferedReader read = new BufferedReader(new InputStreamReader(myFile.getInputStream()));


My question is how do you convert files regardless of encoding type to UTF-8? Here are some of the list of extension names that I have to convert (yes i also have to handle big and lil endians):

utxt
utf8
utf-8  
utf16  
utf-16  
utf-16le  
utf-16be  
ucs2 ucs-2
ucs-2le
ucs-2be
cs2le


Now somewhere, I have seen this code (see below) but this just converts a text file to UTF-8:

String asString = new String(contents, "ISO8859_1");//for text file
byte[] newBytes = asString.getBytes("UTF8");


And I have seen this from the java site:

http://java.sun.com/j2se/1.3/docs/guide/intl/encoding.doc.html


Based on the info in that URL does that mean I just have to change "ISO8859_1" to "UnicodeBig" or "UTF-16" etc depending on what type my raw file is?
Like:

//if Sixteen-bit Unicode Transformation Format, little-endian byte order, with byte-order mark
String asString = new String(contents, "UnicodeLittle");
byte[] newBytes = asString.getBytes("UTF8");


If that is all I have to do how do I cast the byte[] newBytes to BufferedReader? Because BufferedReader should be my final product after converting the files to UTF-8


Thanks in advance!
JavaJava EEJSP

Avatar of undefined
Last Comment
ayeen

8/22/2022 - Mon
ASKER CERTIFIED SOLUTION
Mick Barry

THIS SOLUTION ONLY AVAILABLE TO MEMBERS.
View this solution by signing up for a free trial.
Members can start a 7-Day free trial and enjoy unlimited access to the platform.
See Pricing Options
Start Free Trial
GET A PERSONALIZED SOLUTION
Ask your own question & get feedback from real experts
Find out why thousands trust the EE community with their toughest problems.
ayeen

ASKER
thanks objects..

this is what im doing:

//-----------------------------------------------------------------------------------------------------
BufferedReader read = new BufferedReader(new InputStreamReader(myFile.getInputStream(), getEncodingTypeCode("utf16")));

ByteArrayOutputStream temp = new ByteArrayOutputStream();
BufferedWriter write = new BufferedWriter(new OutputStreamWriter(temp, "UTF8"));
//-----------------------------------------------------------------------------------------------------




is there a way to convert "any" unicode files to UTF8 without having  to know before hand what type of encoding type the file is?

currently i just check the extension names, something like:

//-----------------------------------------------------------------------------------------------------
private static String getEncodingTypeCode (String extension_name){
      String encoding_type = "";

      if(extension_name.equalsIgnoreCase("utf8"))
      {                                          
            return "UTF8";
      }
      if(extension_name.equalsIgnoreCase("utf-8"))
      {                                          
            return "UTF8";
      }
      if(extension_name.equalsIgnoreCase("utf16"))
      {                                          
            return "UTF-16";
      }
      if(extension_name.equalsIgnoreCase("utf-16"))
      {                                          
            return "UTF-16";
      }
      if(extension_name.equalsIgnoreCase("utf-16le"))
      {                                          
            return "UnicodeLittle";
      }
      if(extension_name.equalsIgnoreCase("utf-16be"))
      {                                          
            return "UnicodeBig";
      }
}
//-----------------------------------------------------------------------------------------------------


but that "if" condition can go on and on...i still need to add ucs, ucs2, ucs-2l2, Cp037, Cp273 etc (basically everything here: http://java.sun.com/j2se/1.3/docs/guide/intl/encoding.doc.html)....

and aside from those "if" conditions, if it's a .txt file i also need to check the byte order marks to see if it's just a plain .txt file or if its a unicode...

so i have something like:

//-----------------------------------------------------------------------------------------------------
      if(first_byte.equalsIgnoreCase("FE") && sec_byte.equalsIgnoreCase("FF")){
            return sEncodingType = "UnicodeBig";//java.sun.com/j2se/1.3/docs/guide/intl/encoding.doc.html
      }
      else if (first_byte.equalsIgnoreCase("FF") && sec_byte.equalsIgnoreCase("FE")){
            return sEncodingType = "UnicodeLittle";
      }
      else if (first_byte.equalsIgnoreCase("EF") && sec_byte.equalsIgnoreCase("BB") && third_byte.equalsIgnoreCase("BF")){
            return sEncodingType = "UTF8";
      }
//-----------------------------------------------------------------------------------------------------
      

that would work fine but a problem would arise when the .txt file has unicode characters but its unmarked (no byte order marks)...
and ofcourse i also still need to check the BOM (byte order marks) for all the other extension names so i would know what encoding type to pass in my BufferedReader.

i'm curious if there's a shorter way to convert any files to UTF8 so i dont have to go through all this checking anymore and  just convert whatever file, unmarked or not to UTF8 w/o having to check first what type of encoding the file has..




Mick Barry

not really, you need to know the encoding to be able to tell the Reader what to use
ayeen

ASKER
i see.... oh well, i guess i dont have a choice but to put each one of those encoding types...

anyway, i tried your solution but i noticed the BOM (byte order marks) are still there after i convert a file to UTF8. is there a way to remove the BOM before converting it to UTF8?

thanks again!
Experts Exchange has (a) saved my job multiple times, (b) saved me hours, days, and even weeks of work, and often (c) makes me look like a superhero! This place is MAGIC!
Walt Forbes
Mick Barry

you need to remove them manually
ayeen

ASKER
yeah, i figured there's no shortcut to that too....

for the sake of those reading this here's how i removed the BOM...

//encoding_type = can be UTF8, UnicodeBig, UnicodeLil, etc
BufferedReader read = new BufferedReader(new InputStreamReader(file.getInputStream(), encoding_type));
ByteArrayOutputStream temp = new ByteArrayOutputStream();
BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(temp, "UTF8"));

line = read.readLine();

//offset = 1 for BE/LE and 2 for UTF8
writer.write(line, offset, line.length()-offset);

ayeen

ASKER
solution is excellent but kinda vague that i had to do additional research and had to do some code experiments  to fill up some missing parts of the solution...
⚡ FREE TRIAL OFFER
Try out a week of full access for free.
Find out why thousands trust the EE community with their toughest problems.