ayeen
asked on
Convert files in any unicode encoding type to UTF-8 in Java
I have this upload mechanism in our web app and everything is OK if the user just uploads a simple file and this is how part of the code looks:
// raw file
FormFile myFile = myForm.getTheFile();//usin
// final product should be BufferedReader
BufferedReader read = new BufferedReader(new InputStreamReader(myFile.g
My question is how do you convert files regardless of encoding type to UTF-8? Here are some of the list of extension names that I have to convert (yes i also have to handle big and lil endians):
utxt
utf8
utf-8
utf16
utf-16
utf-16le
utf-16be
ucs2 ucs-2
ucs-2le
ucs-2be
cs2le
Now somewhere, I have seen this code (see below) but this just converts a text file to UTF-8:
String asString = new String(contents, "ISO8859_1");//for text file
byte[] newBytes = asString.getBytes("UTF8");
And I have seen this from the java site:
http://java.sun.com/j2se/1.3/docs/guide/intl/encoding.doc.html
Based on the info in that URL does that mean I just have to change "ISO8859_1" to "UnicodeBig" or "UTF-16" etc depending on what type my raw file is?
Like:
//if Sixteen-bit Unicode Transformation Format, little-endian byte order, with byte-order mark
String asString = new String(contents, "UnicodeLittle");
byte[] newBytes = asString.getBytes("UTF8");
If that is all I have to do how do I cast the byte[] newBytes to BufferedReader? Because BufferedReader should be my final product after converting the files to UTF-8
Thanks in advance!
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
not really, you need to know the encoding to be able to tell the Reader what to use
ASKER
i see.... oh well, i guess i dont have a choice but to put each one of those encoding types...
anyway, i tried your solution but i noticed the BOM (byte order marks) are still there after i convert a file to UTF8. is there a way to remove the BOM before converting it to UTF8?
thanks again!
anyway, i tried your solution but i noticed the BOM (byte order marks) are still there after i convert a file to UTF8. is there a way to remove the BOM before converting it to UTF8?
thanks again!
you need to remove them manually
ASKER
yeah, i figured there's no shortcut to that too....
for the sake of those reading this here's how i removed the BOM...
//encoding_type = can be UTF8, UnicodeBig, UnicodeLil, etc
BufferedReader read = new BufferedReader(new InputStreamReader(file.get InputStrea m(), encoding_type));
ByteArrayOutputStream temp = new ByteArrayOutputStream();
BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(temp, "UTF8"));
line = read.readLine();
//offset = 1 for BE/LE and 2 for UTF8
writer.write(line, offset, line.length()-offset);
for the sake of those reading this here's how i removed the BOM...
//encoding_type = can be UTF8, UnicodeBig, UnicodeLil, etc
BufferedReader read = new BufferedReader(new InputStreamReader(file.get
ByteArrayOutputStream temp = new ByteArrayOutputStream();
BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(temp, "UTF8"));
line = read.readLine();
//offset = 1 for BE/LE and 2 for UTF8
writer.write(line, offset, line.length()-offset);
ASKER
solution is excellent but kinda vague that i had to do additional research and had to do some code experiments to fill up some missing parts of the solution...
ASKER
this is what im doing:
//------------------------
BufferedReader read = new BufferedReader(new InputStreamReader(myFile.g
ByteArrayOutputStream temp = new ByteArrayOutputStream();
BufferedWriter write = new BufferedWriter(new OutputStreamWriter(temp, "UTF8"));
//------------------------
is there a way to convert "any" unicode files to UTF8 without having to know before hand what type of encoding type the file is?
currently i just check the extension names, something like:
//------------------------
private static String getEncodingTypeCode (String extension_name){
String encoding_type = "";
if(extension_name.equalsIg
{
return "UTF8";
}
if(extension_name.equalsIg
{
return "UTF8";
}
if(extension_name.equalsIg
{
return "UTF-16";
}
if(extension_name.equalsIg
{
return "UTF-16";
}
if(extension_name.equalsIg
{
return "UnicodeLittle";
}
if(extension_name.equalsIg
{
return "UnicodeBig";
}
}
//------------------------
but that "if" condition can go on and on...i still need to add ucs, ucs2, ucs-2l2, Cp037, Cp273 etc (basically everything here: http://java.sun.com/j2se/1.3/docs/guide/intl/encoding.doc.html)....
and aside from those "if" conditions, if it's a .txt file i also need to check the byte order marks to see if it's just a plain .txt file or if its a unicode...
so i have something like:
//------------------------
if(first_byte.equalsIgnore
return sEncodingType = "UnicodeBig";//java.sun.co
}
else if (first_byte.equalsIgnoreCa
return sEncodingType = "UnicodeLittle";
}
else if (first_byte.equalsIgnoreCa
return sEncodingType = "UTF8";
}
//------------------------
that would work fine but a problem would arise when the .txt file has unicode characters but its unmarked (no byte order marks)...
and ofcourse i also still need to check the BOM (byte order marks) for all the other extension names so i would know what encoding type to pass in my BufferedReader.
i'm curious if there's a shorter way to convert any files to UTF8 so i dont have to go through all this checking anymore and just convert whatever file, unmarked or not to UTF8 w/o having to check first what type of encoding the file has..