I have run into a problem that I can't seem to find a solution to.
my users are copying and pasting from MS-Word. My DB is Oracle with its encoding set to "UTF-8".
Using Oracle's thin driver it automatically converts to the DB's default character set.
When Java tries to encode Unicode to UTF-8 and it runs into an unknown character (typically a character that is in the High Ascii range) it substitutes it with '?' or some other wierd character.
How do I prevent this.
I tried different encodings using a simple driver like:
class UnicodeConversionTest
{
public static void main(String[] args)
{
try {
String str = new String("`test3`");
String utfStr = new String(str.getBytes("UTF-8
"), "UTF-8");
System.out.println("Conver
ted:" + str + " to:" + utfStr);
} catch (Exception e) {
e.printStackTrace(System.o
ut);
}
}
}
But that didn't work. Then I tried a more elaborate conversion:
import sun.io.CharToByteConverter
;
import sun.io.ByteToCharConverter
;
public class UnicodeTest {
public UnicodeTest() {
}
public static void main(String[] args) {
UnicodeTest unicodeTest1 = new UnicodeTest();
try {
ByteToCharConverter fromUnicode = ByteToCharConverter.getCon
verter("US
-ASCII");
char[] subChars = { ' ' };
fromUnicode.setSubstitutio
nMode(true
);
fromUnicode.setSubstitutio
nChars(sub
Chars);
String originalStr = new String("test3");
char[] convertedChars = fromUnicode.convertAll(ori
ginalStr.g
etBytes())
;
String convertedStr = new String(convertedChars);
//String convertedStr = new String(originalStr.getByte
s("US-ASCI
I"), "US-ASCII");
System.out.println("String
:" + originalStr + " converted to:" + convertedStr);
} catch (Exception e) {
e.printStackTrace(System.o
ut);
}
}
I tried a variation of the second code snippet that inserts into the DB - just to see the results and it was a no go.
I don't want '?' replacing the uknown chars. I would rather strip them or replace them with ' ' but I haven't been able to get that to work (using the second bit of code)
Any ideas on what I am doing wrong?
Thanx,
CJ