I have this upload mechanism in our web app and everything is OK if the user just uploads a simple file and this is how part of the code looks:
// raw file
FormFile myFile = myForm.getTheFile();//usin
g struts..so this is defined somewhere in my form bean
// final product should be BufferedReader
BufferedReader read = new BufferedReader(new InputStreamReader(myFile.g
etInputStr
eam()));
My question is how do you convert files regardless of encoding type to UTF-8? Here are some of the list of extension names that I have to convert (yes i also have to handle big and lil endians):
utxt
utf8
utf-8
utf16
utf-16
utf-16le
utf-16be
ucs2 ucs-2
ucs-2le
ucs-2be
cs2le
Now somewhere, I have seen this code (see below) but this just converts a text file to UTF-8:
String asString = new String(contents, "ISO8859_1");//for text file
byte[] newBytes = asString.getBytes("UTF8");
And I have seen this from the java site:
http://java.sun.com/j2se/1.3/docs/guide/intl/encoding.doc.html
Based on the info in that URL does that mean I just have to change "ISO8859_1" to "UnicodeBig" or "UTF-16" etc depending on what type my raw file is?
Like:
//if Sixteen-bit Unicode Transformation Format, little-endian byte order, with byte-order mark
String asString = new String(contents, "UnicodeLittle");
byte[] newBytes = asString.getBytes("UTF8");
If that is all I have to do how do I cast the byte[] newBytes to BufferedReader? Because BufferedReader should be my final product after converting the files to UTF-8
Thanks in advance!
this is what im doing:
//------------------------
BufferedReader read = new BufferedReader(new InputStreamReader(myFile.g
ByteArrayOutputStream temp = new ByteArrayOutputStream();
BufferedWriter write = new BufferedWriter(new OutputStreamWriter(temp, "UTF8"));
//------------------------
is there a way to convert "any" unicode files to UTF8 without having to know before hand what type of encoding type the file is?
currently i just check the extension names, something like:
//------------------------
private static String getEncodingTypeCode (String extension_name){
String encoding_type = "";
if(extension_name.equalsIg
{
return "UTF8";
}
if(extension_name.equalsIg
{
return "UTF8";
}
if(extension_name.equalsIg
{
return "UTF-16";
}
if(extension_name.equalsIg
{
return "UTF-16";
}
if(extension_name.equalsIg
{
return "UnicodeLittle";
}
if(extension_name.equalsIg
{
return "UnicodeBig";
}
}
//------------------------
but that "if" condition can go on and on...i still need to add ucs, ucs2, ucs-2l2, Cp037, Cp273 etc (basically everything here: http://java.sun.com/j2se/1.3/docs/guide/intl/encoding.doc.html)....
and aside from those "if" conditions, if it's a .txt file i also need to check the byte order marks to see if it's just a plain .txt file or if its a unicode...
so i have something like:
//------------------------
if(first_byte.equalsIgnore
return sEncodingType = "UnicodeBig";//java.sun.co
}
else if (first_byte.equalsIgnoreCa
return sEncodingType = "UnicodeLittle";
}
else if (first_byte.equalsIgnoreCa
return sEncodingType = "UTF8";
}
//------------------------
that would work fine but a problem would arise when the .txt file has unicode characters but its unmarked (no byte order marks)...
and ofcourse i also still need to check the BOM (byte order marks) for all the other extension names so i would know what encoding type to pass in my BufferedReader.
i'm curious if there's a shorter way to convert any files to UTF8 so i dont have to go through all this checking anymore and just convert whatever file, unmarked or not to UTF8 w/o having to check first what type of encoding the file has..