yes.
This works:
convertedStr = convertedStr.replace('\uff
But I was hoping for a solution that wouldn't require me to replace the chars manually.
CJ
Main Topics
Browse All TopicsI have run into a problem that I can't seem to find a solution to.
my users are copying and pasting from MS-Word. My DB is Oracle with its encoding set to "UTF-8".
Using Oracle's thin driver it automatically converts to the DB's default character set.
When Java tries to encode Unicode to UTF-8 and it runs into an unknown character (typically a character that is in the High Ascii range) it substitutes it with '?' or some other wierd character.
How do I prevent this.
I tried different encodings using a simple driver like:
class UnicodeConversionTest
{
public static void main(String[] args)
{
try {
String str = new String("`test3`");
String utfStr = new String(str.getBytes("UTF-8
System.out.println("Conver
} catch (Exception e) {
e.printStackTrace(System.o
}
}
}
But that didn't work. Then I tried a more elaborate conversion:
import sun.io.CharToByteConverter
import sun.io.ByteToCharConverter
public class UnicodeTest {
public UnicodeTest() {
}
public static void main(String[] args) {
UnicodeTest unicodeTest1 = new UnicodeTest();
try {
ByteToCharConverter fromUnicode = ByteToCharConverter.getCon
char[] subChars = { ' ' };
fromUnicode.setSubstitutio
fromUnicode.setSubstitutio
String originalStr = new String("test3");
char[] convertedChars = fromUnicode.convertAll(ori
String convertedStr = new String(convertedChars);
//String convertedStr = new String(originalStr.getByte
System.out.println("String
} catch (Exception e) {
e.printStackTrace(System.o
}
}
I tried a variation of the second code snippet that inserts into the DB - just to see the results and it was a no go.
I don't want '?' replacing the uknown chars. I would rather strip them or replace them with ' ' but I haven't been able to get that to work (using the second bit of code)
Any ideas on what I am doing wrong?
Thanx,
CJ
This Question has been solved and asker verified All Experts Exchange premium technology solutions are available to subscription members.
Experts Exchange has been collecting answers to technology questions since 1996…3 million and counting! If you have a question, chances are we already have your answer.
If you can't find the exact answer you're looking for, ask our exclusive community of 50,000 experts. You’ll get a personalized answer from a trusted professional.
Thousands of free tech tips, tricks, how-to’s and tutorials are available in our peer reviewed articles section. See for yourself how smart our experts are, no login required.
Access the answers to your technology questions today.
30-day free trial. Register in 60 seconds.
Members of the expert community talk about why the experience at Experts Exchange is different than what you will find anywhere else.

Try it out and discover for yourself.
30-day free trial. Register in 60 seconds.
Join the community of experts here and help other tech pros by answering question in your area of expertise. You can earn FREE access to all Experts Exchange's premium features and resources.
To me it sounds like that the problem you're having is that the output device can't handle the odd characters! I have been getting the same result in the past.
Have you tried round-trip, meaning inserting from Java and then selecting from Java, showing the final result in a JTextField?
I am pretty sure that the driver works just fine, and that a call to Statement.execute or Statement.executeUpdate would encode your strings correctly.
\t
When I use the following (for testing purposes):
String insertSql = "insert into unicode_test (string_id, string_value) values (?,?)";
PreparedStatement ps = conn.prepareStatement(inse
ps.setInt(1, maxID);
ps.setString(2, convertedStr);
int rowsInserted = ps.executeUpdate();
The DB gets an inverted '?' stored as the character.
So the conversion that jdbc driver is doing is using a wierd character and I don't want that character to be stored or displayed in my tools or site.
Retrieving the value from the DB and displaying it returns the string with the '?' or inverted '?'
CJ
The driver's conversion is resulting in the string being stored with those inverted '?' or '\ufffd' chars in the DB. I don't want those in the DB. Since the jdbc driver's conversion is doing this, I want to convert before the driver does it, so any unsupported chars are not replaced with ugly characters in the DB.
CJ
well the other languages querying the DB (besides Java) are Perl, ColdFusion and C.
They display the inverted '?' b/c that is what they get back from the DB.
Coldfusion and Perl also insert into the DB. ColdFusion is having the same problem as Java. Perl reads a environment setting var called 'NLS_LANG' that fixes this issue as we set the encoding to 'WE8ISO8859P1'.
The OCI Oracle driver supports the 'NLS_LANG' setting and I set it for that but the thin driver (which is used 100% of the time) ignores all client settings.
CJ
Is the db driver set up correctly, or is it auto-configured?
Regarding Unicode and UTF-8, either the db supports UTF-8 or it doesn't, Unicode never enters the stage at that level. That is what UTF-8 is used for, encoding whatever multi-byte character set into discrete 8 bit values. It sounds more like there is a 8 bit to 7 bit conversion problem.
Have you tried printing out the actual byte values? Have you tried to store the UTF-8 byte buffer as binary in the db?
\t
Reading the following it appears to me that all a mapping exists for all characters:
http://www.sun.com/develop
or am i missing something?
there's also a bit of code to encode utf8 that may (or may not) help u.
still waiting on the verification of non-java clients being able to read/write UTF-8 to the DB. Supposedly in Perl they set the environment (NLS_LANG charset to the Wester setting some "WEP..." String and it works) but I want proof :-)
objects: the encode method didn't work (given in the URL)
I used the following code:
char[] charArray = originalStr.toCharArray();
int[] scalorArray = new int[charArray.length];
for (int i=0;i<charArray.length;i++
scalorArray[i]= Character.getNumericValue(
System.out.println("Conver
I think that should be the way to convert Unicode to scalor values.
How can I verify that the driver supports UTF-8. All documentation seem to point that way.
CJ
>> Perl reads a environment setting var called 'NLS_LANG' that fixes this issue as we set the encoding to 'WE8ISO8859P1'.
check with you DBA, I remember that you must specify if you need multi-byte support during oracle dabase creation. WE8ISO8859P1 is single byte schema.
another point the the data source, how do you get the original string? use entered in java GUI?
I would try this:
get the data from GUI, say string "original", display it back to GUI, then save it to database without any convertion using prepared statement's setString() method. retrieve it back from database immediately without convertion into string "fromdb", display "fromdb" to the GUI again.
guess you might have already done it. what's the result?
we don't do any conversion in the gui. We get the string and use a prepared statement to insert it.
But high ascii characters like (ALT-0147) and (ALT-0148)
show up as inverted chars in the DB and then are retrieved as such. When you paste them into the gui they show up as small squares.
the results of the dump:
Typ=1 Len=6 CharacterSet=UTF8: 60,74,65,73,74,60
Typ=1 Len=6 CharacterSet=UTF8: 60,74,65,73,74,60
Typ=1 Len=6 CharacterSet=UTF8: 60,74,65,73,74,60
Typ=1 Len=6 CharacterSet=UTF8: 60,74,65,73,74,60
Typ=1 Len=6 CharacterSet=UTF8: 60,74,65,73,74,60
Typ=1 Len=10 CharacterSet=UTF8: e2,80,98,74,65,73,74,e2,80
Typ=1 Len=10 CharacterSet=UTF8: e2,80,98,74,65,73,74,e2,80
Typ=1 Len=10 CharacterSet=UTF8: e2,80,98,74,65,73,74,e2,80
CJ
>> show up as inverted chars in the DB and then are retrieved as such.
when we say show up, it must through some front end. so what you see is the presentation of your data by that particular front end application (be it DB or 3rd party tools or utilities).
>> When you paste them into the gui they show up as small squares.
copy and paste again passes the data through windows clipboard, which does convertion.
why not try to retrieve it back directly using java code and display it on java GUI.
I have some experience saving/retrieving encoded data. Here's my comments.
First, you need to know exactly what is NLS_CHARACTERSET setting for database. Use something like:
select * from NLS_DATABASE_PARAMETERS;
You will have probably either WE8ISO8859P1 or UTF8.
1. If your database encoding is UTF8, you should NOT encode characters yourself! The reason is that driver will encode output string automatically, so your string will be UTF-8 encoded twice.
Every high Ascii character is UTF-8 encoded as 2 or three bytes. For example, "é" (hE9) will be encoded as two bytes "é" (hC3 hA9). If you will encode it yourself - you will have 4 or more bytes, because "Ã" character (hC3) will be UTF-8 encoded again.
To check what do you have, save some fixed string say "==é==" and read it back. See, how many characters you have saved in database - 5 or 10 - you should see either "==é==" or "==é==". You should retrieve back exactly your 5 charactes, because driver should decode UTF-8 to Unicode. See hexadecimals you retrieved to match with those you tried to save.
Win CP-1252 glyphs (e.g. TM or mdash) in the range 128 (80h) to 159 (9Fh) should be treated differently - they should not be UTF-8 encoded with others. I can explain why and how if you are interested.
2. If your database encoding is WE8ISO8859P1 - you can use your application level UTF-8 encoding. But in this case every other client reading from DB should be aware of it - to decode it "manually" as well.
I have some experience saving/retrieving encoded data. Here's my comments.
First, you need to know exactly what is NLS_CHARACTERSET setting for database. Use something like:
select * from NLS_DATABASE_PARAMETERS;
You will have probably either WE8ISO8859P1 or UTF8.
1. If your database encoding is UTF8, you should NOT encode characters yourself! The reason is that driver will encode output string automatically, so your string will be UTF-8 encoded twice.
Every high Ascii character is UTF-8 encoded as 2 or three bytes. For example, "é" (hE9) will be encoded as two bytes "é" (hC3 hA9). If you will encode it yourself - you will have 4 or more bytes, because "Ã" character (hC3) will be UTF-8 encoded again.
To check what do you have, save some fixed string say "==é==" and read it back. See, how many characters you have saved in database - 5 or 10 - you should see either "==é==" or "==é==". You should retrieve back exactly your 5 charactes, because driver should decode UTF-8 to Unicode. See hexadecimals you retrieved to match with those you tried to save.
Win CP-1252 glyphs (e.g. TM or mdash) in the range 128 (80h) to 159 (9Fh) should be treated differently - they should not be UTF-8 encoded with others. I can explain why and how if you are interested.
2. If your database encoding is WE8ISO8859P1 - you can use your application level UTF-8 encoding. But in this case every other client reading from DB should be aware of it - to decode it "manually" as well.
Thanks CJ :)
http://www.evalu8.com.au
"Giving everyone a voice"
>>Comment from kfahrut
>>Win CP-1252 glyphs (e.g. TM or mdash) in the range 128 (80h) to 159 (9Fh) should be treated >>differently - they should not be UTF-8 encoded with others. I can explain why and how if you are >>interested.
i am expiereiencing the same problems. woulld you please explain this encoding-matter to me?!
Thx plsql
OK, here's my offline addition. Hope it helps.
Here's the situation we have: Our editors are editing texts and submitting them as ISO-8859-1 to the Oracle database with ISO-8859-1 encoding. The issue is that when they are entering characters like mdash or TM (as unicode characters), those are converted by Microsoft tools to CP-1252. So that mdash becomes 97h and TM - 99h. Though those CP-1252 characters in the range 128 (80h) to 159 (9Fh) are illegal in ISO-8859-1, both web browser and Oracle database are accepting and saving them.
So now we have CP-1252 characters in ISO-8859-1 database. If we read them back as if they were ISO-8859-1 and send back to the browser as ISO-8859-1 - it works! We even tried it on Netscape browser on Apple - and we saw all Windows CP-1252 characters properly on the ISO-8859-1 HTML page. I would never suggest that Netscape would support CP-1252 characters on Apple as if they were true ISO-8859-1 chars!
Now goes the problem. If we are reading those characters from database and sending response back UTF-8 encoded, then CP-1252 characters are UTF-8 encoded simply by appending C2h, so that mdash (97h) becomes C2h 97h. And now we can't see those UTF-8 encoded CP-1252 characters neither in MS IE or Netscape (with UTF-8 encoding for the HTML page). For example, if we are sending those back through web services (that are using UTF-8) - client can't see mdashes, TMs, etc.
To solve the issue we have (more than) two approaches.
1. Before saving supposedly ISO-8859-1 characters to database - scan the string for the illegal for ISO CP-1252 codes: 128 (80h) to 159 (9Fh).
Substitute them with HTML or XML named or numerical entities, so that instead of saving one byte mdash character 97h, save "—".
2. If you already have those parasite CP-1252 chars in your ISO-8859-1 database - then before UTF-8-encoding and sending string to the client either do the same - substitute CP-1252 characters with HTML or XML named or numerical entities - or - that what I an currently doing - in Java program substitute Unicode chars 00 80h to 00 9Fh with the true Unicode equivalents, so that mdash character 0097h to be substituted with the true Unicode - 2014h, TM 0099h with 2122h, etc.
Now if this Java Unicode string - without CP-1252 chars - will be encoded with UTF-8, those characters will be properly UTF-8 encoded as three bytes: E2 80 94h (mdash) and E2 84 A2h (TM).
Though this might be too late to answer this, but I've had no problems just using String.getBytes("UTF-8") and, using parameterized query, setBytes(). It's the point that you _cannot_ output a string once you get it as bytes, because it needs backward conversion to somewhat charset, and this last step produces these "unrecognized character" question-marks.
This worked for me using Interbase, but I believe this would work with any DB that does not perform any character coliation and stores strings as it gets them.
When you retreive the results from a query, just perform steps backward: from a result set, call getBytes(), and construct a string with these bytes and UTF-8 encoding.
Business Accounts
Answer for Membership
by: CEHJPosted on 2003-02-13 at 10:51:44ID: 7944022
>>System.out.println("Conv erted:" + str + " to:" + utfStr);
You're talking about '?' getting printed out unexpectedly by the above code i take it?