sniles
asked on
Turkish, Thai (& other) fonts/charsets are being corrupted
My application transmits and receives data over DataInputStream and DataOutputStreams (attached to Sockets). For Turkish, Thai and several other charactersets and fonts, users are reporting that the text is being transmitted (or received) incorrectly. Characters are being changed from what was typed.
How should I correct this?
How should I correct this?
use Readers and Writers instead of DataInputStream or String.getBytes()
"The most important reason for adding the Reader and Writer hierarchies in Java 1.1 is for internationalization. The old IO stream hierarchy supports only 8-bit byte streams and doesn’t handle the 16-bit Unicode characters well. Since Unicode is used for internationalization (and Java’s native char is 16-bit Unicode), the Reader and Writer hierarchies were added to support Unicode in all IO operations. In addition, the new libraries are designed for faster operations than the old. "
'cos ur using thai/Turkish fonts which uses 16-bit Unicode representation, switch to readers/writers to avoid problems.
'cos ur using thai/Turkish fonts which uses 16-bit Unicode representation, switch to readers/writers to avoid problems.
ASKER
I tried this. It certainly sounds like it should be right on the money, but my user is still reporting that the problem exists. Now instead of junk characters, "?" is inserted.
Here are the interesting bits of the code:
private OutputStreamWriter out;
private InputStreamReader in;
private Socket nc = null;
o
o
o
out = new OutputStreamWriter(nc.getO utputStrea m());
in = new InputStreamReader(nc.getIn putStream( ));
o
o
StringBuffer inbuff = new StringBuffer();
int inint;
while ( (inint = in.read()) != -1) {
char inchar = (char) inint;
if (inchar == '\r') continue;
if (inchar == '\n') break;
inuff.append(inchar);
}
buff = inbuff.toString();
o
o
out.write(s);
out.write("\r\n");
out.flush();
Here are the interesting bits of the code:
private OutputStreamWriter out;
private InputStreamReader in;
private Socket nc = null;
o
o
o
out = new OutputStreamWriter(nc.getO
in = new InputStreamReader(nc.getIn
o
o
StringBuffer inbuff = new StringBuffer();
int inint;
while ( (inint = in.read()) != -1) {
char inchar = (char) inint;
if (inchar == '\r') continue;
if (inchar == '\n') break;
inuff.append(inchar);
}
buff = inbuff.toString();
o
o
out.write(s);
out.write("\r\n");
out.flush();
ASKER
I tried this. It certainly sounds like it should be right on the money, but my user is still reporting that the problem exists. Now instead of junk characters, "?" is inserted.
Here are the interesting bits of the code:
private OutputStreamWriter out;
private InputStreamReader in;
private Socket nc = null;
o
o
o
out = new OutputStreamWriter(nc.getO utputStrea m());
in = new InputStreamReader(nc.getIn putStream( ));
o
o
StringBuffer inbuff = new StringBuffer();
int inint;
while ( (inint = in.read()) != -1) {
char inchar = (char) inint;
if (inchar == '\r') continue;
if (inchar == '\n') break;
inuff.append(inchar);
}
buff = inbuff.toString();
o
o
out.write(s);
out.write("\r\n");
out.flush();
Here are the interesting bits of the code:
private OutputStreamWriter out;
private InputStreamReader in;
private Socket nc = null;
o
o
o
out = new OutputStreamWriter(nc.getO
in = new InputStreamReader(nc.getIn
o
o
StringBuffer inbuff = new StringBuffer();
int inint;
while ( (inint = in.read()) != -1) {
char inchar = (char) inint;
if (inchar == '\r') continue;
if (inchar == '\n') break;
inuff.append(inchar);
}
buff = inbuff.toString();
o
o
out.write(s);
out.write("\r\n");
out.flush();
try,
in = new InputStreamReader(nc.getIn putStream( ),"Unicode");
to specify that you are using unicode chars to read.
else if ur using UTF format then specify,
in = new InputStreamReader(nc.getIn putStream( ),"UTF");
in = new InputStreamReader(nc.getIn
to specify that you are using unicode chars to read.
else if ur using UTF format then specify,
in = new InputStreamReader(nc.getIn
Hi sniles,
In addition to sqoms comments, use
readUTF and writeUTF methods of DataInputStream and DataOutputStream .
Best of luck
In addition to sqoms comments, use
readUTF and writeUTF methods of DataInputStream and DataOutputStream .
Best of luck
ASKER
I tried these suggestions, however I'm still having trouble. When I added the "Unicode" parameter to the InputStreamReader constructor, my TCP/IP connection failed with "Missing byte-order mark". This implies that *all* the data (including the command codes (IRC protocol - RFC 1459)) must be in unicode. I'm afraid I can't do that, as I can't always control (and change the data encoding of) the server.
When I added "UTF", I received a "Could not load class: sun.io.ByteToCharUTC" message. I'm using VJ++ 6.0 and Internet Explorer 5.0.
What I'd love is something that would just let me send these bytes (over socket) as they were received from the user's input (which does work -- local echo formats them correctly), then receive these bytes with the high-order bit intact, so that I can just pass
When I added "UTF", I received a "Could not load class: sun.io.ByteToCharUTC" message. I'm using VJ++ 6.0 and Internet Explorer 5.0.
What I'd love is something that would just let me send these bytes (over socket) as they were received from the user's input (which does work -- local echo formats them correctly), then receive these bytes with the high-order bit intact, so that I can just pass
ASKER
I've been able to narrow the problem down a bit. The problem is in writing the string -- not in receiving it.
From another type of client (not the problem software and hereafter called "the good client"), I can send data such as Arial font's Greek Small Letter Beta (u+03B2)and have it be received and displayed correctly by my problem client software. However, when my problem client software *writes* a character, the "good" client displays only a "?".
Now I have:
out = new BufferedWriter(new OutputStreamWriter(nc.getO utputStrea m()));
in = new BufferedReader(new InputStreamReader(nc.getIn putStream( )));
I output data with:
debugMsgs.add("DATA sent: "+s, Color.lightGray);
out.write(s,0,s.length());
out.newLine();
out.flush();
I input data with:
buff=in.readLine();
debugMsgs.add("DATA RCVD: "+buff, Color.lightGray);
When I run this, and try to send Greek Small Letter Beta, the debug messages *does* show the correct character. However, the "good" client (which apparently *can* handle sending & receiving of these characters) sees only a "?".
When I then use the "good" client to send Greek Small Letter Beta, it shows up correctly in my "degbugMsgs", and on my client's output display.
2 instances of the "good" client can send and receive this character correctly.
I sure hope this helps! This one has me seeing double!
From another type of client (not the problem software and hereafter called "the good client"), I can send data such as Arial font's Greek Small Letter Beta (u+03B2)and have it be received and displayed correctly by my problem client software. However, when my problem client software *writes* a character, the "good" client displays only a "?".
Now I have:
out = new BufferedWriter(new OutputStreamWriter(nc.getO
in = new BufferedReader(new InputStreamReader(nc.getIn
I output data with:
debugMsgs.add("DATA sent: "+s, Color.lightGray);
out.write(s,0,s.length());
out.newLine();
out.flush();
I input data with:
buff=in.readLine();
debugMsgs.add("DATA RCVD: "+buff, Color.lightGray);
When I run this, and try to send Greek Small Letter Beta, the debug messages *does* show the correct character. However, the "good" client (which apparently *can* handle sending & receiving of these characters) sees only a "?".
When I then use the "good" client to send Greek Small Letter Beta, it shows up correctly in my "degbugMsgs", and on my client's output display.
2 instances of the "good" client can send and receive this character correctly.
I sure hope this helps! This one has me seeing double!
>> the "good" client displays only a "?".
where do you display it ? on the screen ? it is possible that your VM (AWT) is not configured to display all Unicode characters ..
check the symbol Unicode code directly.
where do you display it ? on the screen ? it is possible that your VM (AWT) is not configured to display all Unicode characters ..
check the symbol Unicode code directly.
ASKER
I don't think that's it. The "?" appears on the "good" client, which is capable of displaying that character (because it can send it,locally echo it and display it when it is sent to it by another "good" client).
Also, I'm certain my "problem" client can display it because it displays it when the "good" client sends it to it, and when it locally echoes.
My screens look like this:
Scenario 1: "Problem" client sends data:
1. User types the beta character in the input area.
2. beta character correctly displays the beta character in the local echo.
3. beta character correctly displays in the "DebugMsg": "DATA sent: "
4. "good" client receives this data as "?". ** this is the problem **
Scenario 2: "Good" client sends data:
1. User types the beta character into the "good" client input area.
2. beta character correctly displays the beta character in the local echo.
3. beta character appears correct in the "Problem" client's DebugMsg: "DATA received: "
4. beta character correctly appears in the "problem" client's display.
Also, I'm certain my "problem" client can display it because it displays it when the "good" client sends it to it, and when it locally echoes.
My screens look like this:
Scenario 1: "Problem" client sends data:
1. User types the beta character in the input area.
2. beta character correctly displays the beta character in the local echo.
3. beta character correctly displays in the "DebugMsg": "DATA sent: "
4. "good" client receives this data as "?". ** this is the problem **
Scenario 2: "Good" client sends data:
1. User types the beta character into the "good" client input area.
2. beta character correctly displays the beta character in the local echo.
3. beta character appears correct in the "Problem" client's DebugMsg: "DATA received: "
4. beta character correctly appears in the "problem" client's display.
All I want you to do is to print the char code as integer.
char ch = st.charAt(0);
System.out.println("" + (long)ch);
what are OSes / JDK versions do you have on the both machines ?
char ch = st.charAt(0);
System.out.println("" + (long)ch);
what are OSes / JDK versions do you have on the both machines ?
ASKER
The character (at output from "problem" client) printed as 946.
Using NT 4.0 SP4, JDK 1.1 (under Visual J++ 6.0 and IE 5.0). Using same machine for both clients.
Using NT 4.0 SP4, JDK 1.1 (under Visual J++ 6.0 and IE 5.0). Using same machine for both clients.
ASKER
The character (at output from "problem" client) printed as 946.
Using NT 4.0 SP4, JDK 1.1 (under Visual J++ 6.0 and IE 5.0). Using same machine for both clients.
Using NT 4.0 SP4, JDK 1.1 (under Visual J++ 6.0 and IE 5.0). Using same machine for both clients.
so you receive the correct charecter code and the only problem is that you can't display it ?
ASKER
No, the problem seems to be that when the problem client sends the character code, it corrupts it.
If character code is sent by a good client, my problem client can receive and display the character correctly.
If character code is sent by a good client, my problem client can receive and display the character correctly.
maybe I'm not clear enough.
I want you to display the character CODE (not the CHARACTER) on both machines (before sending and after receiving) and compare it.
if you get two different codes, please post some small compilable example that reproduces the problem, so that we can try it ourselves...
I want you to display the character CODE (not the CHARACTER) on both machines (before sending and after receiving) and compare it.
if you get two different codes, please post some small compilable example that reproduces the problem, so that we can try it ourselves...
ASKER
Here's what happens when I try to send the Greek Small Letter Beta character (unicode 3B2):
Print of character just before send (in sender) using the code you requested I insert: 946 (which is correct. 946 is hex 3B2)
Print of character just after receipt (in receiver) using the code you requested I insert: 63 (which is hex 3F... the value for the question mark).
I have posted the sample code you requested to:
http://www.tenet.net/home/steve/ServerTCP.java
and
http://www.tenet.net/home/steve/ClientTCP.java
If this helps to clarify, here is the breakdown of what works and what doesn't.
Good --> Good Displays OK
Good --> Problem Displays OK
Problem --> Good Fails to display
Problem --> Problem Fails to Display
Print of character just before send (in sender) using the code you requested I insert: 946 (which is correct. 946 is hex 3B2)
Print of character just after receipt (in receiver) using the code you requested I insert: 63 (which is hex 3F... the value for the question mark).
I have posted the sample code you requested to:
http://www.tenet.net/home/steve/ServerTCP.java
and
http://www.tenet.net/home/steve/ClientTCP.java
If this helps to clarify, here is the breakdown of what works and what doesn't.
Good --> Good Displays OK
Good --> Problem Displays OK
Problem --> Good Fails to display
Problem --> Problem Fails to Display
ASKER
Whoops! I just realized I posted an earlier copy of the ClientTCP.java file. It wasn't complete and I don't think it loaded the test string correctly. If you've already downloaded ClientTCP.java, could you download what's there now?
Thanks.
Thanks.
BufferedReader rcv_packet = new BufferedReader(new DataInputReader(tcpsocket. getInputSt ream()));
so what is DataInputReader ?
so what is DataInputReader ?
ASKER
Sorry again. Wrong version of the ServerTCP.java. It's been replaced.
I compiled both, ran them and verified that I am getting the expected (i.e. erroneous -- reproduced problem) output.
I compiled both, ran them and verified that I am getting the expected (i.e. erroneous -- reproduced problem) output.
I still can see this line inside your code
//Connect Stream for communication
BufferedReader rcv_packet = new BufferedReader(new DataInputReader(tcpsocket. getInputSt ream()));
so what is DataInputReader ?
//Connect Stream for communication
BufferedReader rcv_packet = new BufferedReader(new DataInputReader(tcpsocket.
so what is DataInputReader ?
ASKER
I just downloaded it and double-checked. That line is gone and has been replaced by:
//Connect Stream for communication
BufferedReader rcv_packet = new BufferedReader(new InputStreamReader(tcpsocke t.getInput Stream())) ;
File change date: 12/7/99 9:16 am
File size: 2,413 bytes
Could you try again, please?
//Connect Stream for communication
BufferedReader rcv_packet = new BufferedReader(new InputStreamReader(tcpsocke
File change date: 12/7/99 9:16 am
File size: 2,413 bytes
Could you try again, please?
sniles,
In the client side have ur outputstream as,
out = new BufferedWriter( new OutputStreamWriter(nc.getO utputStrea m(),"Unico de"));
and on the server side receive it as,
BufferedReader rcv_packet = new BufferedReader(new InputStreamReader(tcpsocke t.getInput Stream()," Unicode")) ;
it prints ? as the char. but correctly prints 946 as the code.
i tested it. its working fine.
Previously ud posted a comment saying that u face problems in sending commands if u use this. can u post ur entire code if it still causes problem?
-sgoms
In the client side have ur outputstream as,
out = new BufferedWriter( new OutputStreamWriter(nc.getO
and on the server side receive it as,
BufferedReader rcv_packet = new BufferedReader(new InputStreamReader(tcpsocke
it prints ? as the char. but correctly prints 946 as the code.
i tested it. its working fine.
Previously ud posted a comment saying that u face problems in sending commands if u use this. can u post ur entire code if it still causes problem?
-sgoms
ASKER
That would require me to change the server side of the communication, which I cannot do. I can only change the client side.
Try this scenario,
On the client side,
out = new DataOutputStream(nc.getOut putStream( ));
bytes=test_string.getBytes ("Unicode" );
out.write( bytes, 0, bytes.length);
//send the data as bytes which is encoded to unicode format
On the server side, //no changes
DataInputStream rcv_packet = new DataInputStream(tcpsocket. getInputSt ream());
message = rcv_packet.readLine();
charAt(0) will not work in this case 'cos it will only print 255. but i tried it & it printed the same chars
try it & let me know how it goes.
-sgoms
On the client side,
out = new DataOutputStream(nc.getOut
bytes=test_string.getBytes
out.write( bytes, 0, bytes.length);
//send the data as bytes which is encoded to unicode format
On the server side, //no changes
DataInputStream rcv_packet = new DataInputStream(tcpsocket.
message = rcv_packet.readLine();
charAt(0) will not work in this case 'cos it will only print 255. but i tried it & it printed the same chars
try it & let me know how it goes.
-sgoms
ASKER
Unfortunately, the ServerTCP is just a testing stub... it is not my actual server program. Your example converts the entire string to unicode, which my server would need to be recoded for -- something I cannot do.
ASKER
Unfortunately, the ServerTCP is just a testing stub... it is not my actual server program. Your example converts the entire string to unicode, which my server would need to be recoded for -- something I cannot do.
sniles,
iam doing the encoding only on the client side. the server side remains unchanged. u can read teh data from the server as a string using readline.
whichever data is in unicode format alone can be encoded using getBytes("Unicode") on the client side.
did u test this logic in ur side? the SERVER REMAINS UNALTERED.
-sgoms
iam doing the encoding only on the client side. the server side remains unchanged. u can read teh data from the server as a string using readline.
whichever data is in unicode format alone can be encoded using getBytes("Unicode") on the client side.
did u test this logic in ur side? the SERVER REMAINS UNALTERED.
-sgoms
ASKER
The server is an IRC-style server written in C++. It's not Java. When it reads in the bytes, instead of the command verb it expects it's getting a unicode version of the command verb, which it can't understand.
Example: it needs:
PRIVMSG #channelNameHere :dataHere
The bytes it expects in its protocol are:
P R I V M S G
0 1 2 3 4 5 6
(byte position)
With your suggestion, it gets:
0x00 P 0x00 R 0x00 I 0x00 V ...
0 1 2 3 4 5 6 7
(byte position)
Server -- the real server, not the stub I provided you with, is a Windows C++ application which does not handle unicode in it's protocol.
Example: it needs:
PRIVMSG #channelNameHere :dataHere
The bytes it expects in its protocol are:
P R I V M S G
0 1 2 3 4 5 6
(byte position)
With your suggestion, it gets:
0x00 P 0x00 R 0x00 I 0x00 V ...
0 1 2 3 4 5 6 7
(byte position)
Server -- the real server, not the stub I provided you with, is a Windows C++ application which does not handle unicode in it's protocol.
From the client side whatever data that u need to send to unicode, encode them. else send them as a noraml byte array.
////
char greek=0x3B2;
String test_string1 = (new Character(greek)).toString ();
String test_string2 = "PRIVMSG #channelNameHere :dataHere";
////
byte[] bytes1=test_string1.getByt es("Unicod e");
out.write(bytes1,0,bytes1. length);
byte[] bytes2=test_string2.getByt es();
out.write(bytes2,0,bytes2. length);
out.flush();
///
will this suit ur purposes? No tampering with the server side.
-sgoms
////
char greek=0x3B2;
String test_string1 = (new Character(greek)).toString
String test_string2 = "PRIVMSG #channelNameHere :dataHere";
////
byte[] bytes1=test_string1.getByt
out.write(bytes1,0,bytes1.
byte[] bytes2=test_string2.getByt
out.write(bytes2,0,bytes2.
out.flush();
///
will this suit ur purposes? No tampering with the server side.
-sgoms
ASKER
That seems like a promising idea, however when the data is passed through the server to the other client, it is not properly reconstructed.
I send (as string test_string2 in your example above):
PRIVMSG bob :
Then I send (as test_string1 in your example above, using .getBytes("Unicode");
0x2561
The client (using a BufferedReader) sees:
PRIVMSG bob :þÿ%a
The part after the colon is:
0xFE 0xFF 0x25 0x61 (each in it's own char).
It looks like I need to do something on the receiving end to pull appropriate bytes in as unicode chars.
I send (as string test_string2 in your example above):
PRIVMSG bob :
Then I send (as test_string1 in your example above, using .getBytes("Unicode");
0x2561
The client (using a BufferedReader) sees:
PRIVMSG bob :þÿ%a
The part after the colon is:
0xFE 0xFF 0x25 0x61 (each in it's own char).
It looks like I need to do something on the receiving end to pull appropriate bytes in as unicode chars.
it seems that I can't follow you - you have server that expects plain text commands, and you want to send Unicode commands to it ?
what's the problem with sending plain text ?
String st = "/nick";
OutputStream os = ...
os.write(st.getBytes());
what's the problem with sending plain text ?
String st = "/nick";
OutputStream os = ...
os.write(st.getBytes());
.. Turkish, Thai ...
how can you send different '/nick' command in Thai ?
or your server expects 'plain commands' and 'Unicode text messages' ?
how can you send different '/nick' command in Thai ?
or your server expects 'plain commands' and 'Unicode text messages' ?
ASKER
That seems like a promising idea, however when the data is passed through the server to the other client, it is not properly reconstructed.
I send (as string test_string2 in your example above):
PRIVMSG bob :
Then I send (as test_string1 in your example above, using .getBytes("Unicode");
0x2561
The client (using a BufferedReader) sees:
PRIVMSG bob :þÿ%a
The part after the colon is:
0xFE 0xFF 0x25 0x61 (each in it's own char).
It looks like I need to do something on the receiving end to pull appropriate bytes in as unicode chars.
I send (as string test_string2 in your example above):
PRIVMSG bob :
Then I send (as test_string1 in your example above, using .getBytes("Unicode");
0x2561
The client (using a BufferedReader) sees:
PRIVMSG bob :þÿ%a
The part after the colon is:
0xFE 0xFF 0x25 0x61 (each in it's own char).
It looks like I need to do something on the receiving end to pull appropriate bytes in as unicode chars.
ASKER
The server expects plain text commands. It doesn't care about the message data. I have a Thai user (call him "Larry") who types Thai into his input area. Larry's client software takes this input and creates a command that looks like this:
PRIVMSG Bob: whateverHeTypedInGoesHere
The Server's job is to receive this PRIVMSG command and route it to Bob.
Bob's client software's job is to recognize this PRIVMSG command, then display the message part of it so that it appears as the same message that Larry typed in.
You're correct that under this scheme, the Thai users must use plain text names for themselves. At this point, I'm just shooting for the typed-in message data.
Sorry if that all got confusing. I was hoping to present this to you without having to bog you down with too many details. The problem turned out to be a bit more complex than I first thought.
PRIVMSG Bob: whateverHeTypedInGoesHere
The Server's job is to receive this PRIVMSG command and route it to Bob.
Bob's client software's job is to recognize this PRIVMSG command, then display the message part of it so that it appears as the same message that Larry typed in.
You're correct that under this scheme, the Thai users must use plain text names for themselves. At this point, I'm just shooting for the typed-in message data.
Sorry if that all got confusing. I was hoping to present this to you without having to bog you down with too many details. The problem turned out to be a bit more complex than I first thought.
ASKER
The server expects plain text commands. It doesn't care about the message data. I have a Thai user (call him "Larry") who types Thai into his input area. Larry's client software takes this input and creates a command that looks like this:
PRIVMSG Bob: whateverHeTypedInGoesHere
The Server's job is to receive this PRIVMSG command and route it to Bob.
Bob's client software's job is to recognize this PRIVMSG command, then display the message part of it so that it appears as the same message that Larry typed in.
You're correct that under this scheme, the Thai users must use plain text names for themselves. At this point, I'm just shooting for the typed-in message data.
Sorry if that all got confusing. I was hoping to present this to you without having to bog you down with too many details. The problem turned out to be a bit more complex than I first thought.
PRIVMSG Bob: whateverHeTypedInGoesHere
The Server's job is to receive this PRIVMSG command and route it to Bob.
Bob's client software's job is to recognize this PRIVMSG command, then display the message part of it so that it appears as the same message that Larry typed in.
You're correct that under this scheme, the Thai users must use plain text names for themselves. At this point, I'm just shooting for the typed-in message data.
Sorry if that all got confusing. I was hoping to present this to you without having to bog you down with too many details. The problem turned out to be a bit more complex than I first thought.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Thanks! You were right. I wound up needing to just transmit everything as 8-byte values, then maintain a shift-in and shift-out of unicode flag in the text. Then, in my code, I do the byte seperating on output and byte addition upon input. This seems to work well, and the user is happy (well, as happy as they ever get, anyways :) )
Thanks also to sgoms and ravindra76!
Thanks also to sgoms and ravindra76!
sniles,
glad that u got it solved.
what i found was,
from the client side when you send the unicode with value 946 if u get its byte array it gets printed as,
-1(11111111 11111111)
-2(11111111 11111110)
78(11111111 11001110)
-3(11111111 11111101)
on the server side the data was altered. to
(11111111)
(11111110)
(11001110)
(11111101)
'cos char is unsigned u loose the fst eight bits.
if u use the message.getBytes() & print,
byte[] b=message.getBytes();
for(int i=0;i<b.length;i++)
System.out.println((long)b [i]);
u get the actual data.
-sgoms
glad that u got it solved.
what i found was,
from the client side when you send the unicode with value 946 if u get its byte array it gets printed as,
-1(11111111 11111111)
-2(11111111 11111110)
78(11111111 11001110)
-3(11111111 11111101)
on the server side the data was altered. to
(11111111)
(11111110)
(11001110)
(11111101)
'cos char is unsigned u loose the fst eight bits.
if u use the message.getBytes() & print,
byte[] b=message.getBytes();
for(int i=0;i<b.length;i++)
System.out.println((long)b
u get the actual data.
-sgoms