Link to home
Start Free TrialLog in
Avatar of sniles
sniles

asked on

Turkish, Thai (& other) fonts/charsets are being corrupted

My application transmits and receives data over DataInputStream and DataOutputStreams (attached to Sockets).  For Turkish, Thai and several other charactersets and fonts, users are reporting that the text is being transmitted (or received) incorrectly.  Characters are being changed from what was typed.

How should I correct this?
Avatar of heyhey_
heyhey_

use Readers and Writers instead of DataInputStream or String.getBytes()
"The most important reason for adding the Reader and Writer hierarchies in Java 1.1 is for internationalization. The old IO stream hierarchy supports only 8-bit byte streams and doesn’t handle the 16-bit Unicode characters well. Since Unicode is used for internationalization (and Java’s native char is 16-bit Unicode), the Reader and Writer hierarchies were added to support Unicode in all IO operations. In addition, the new libraries are designed for faster operations than the old. "

'cos ur using thai/Turkish fonts which uses 16-bit Unicode representation, switch to readers/writers to avoid problems.
Avatar of sniles

ASKER

I tried this. It certainly sounds like it should be right on the money, but my user is still reporting that the problem exists. Now instead of junk characters, "?" is inserted.

Here are the interesting bits of the code:

private      OutputStreamWriter  out;
private InputStreamReader  in;      
private      Socket nc = null;
  o
  o
  o
out = new OutputStreamWriter(nc.getOutputStream());
in = new InputStreamReader(nc.getInputStream( ));
  o
  o
StringBuffer inbuff = new StringBuffer();
int inint;
                  
while ( (inint = in.read()) != -1) {
  char inchar = (char) inint;
  if (inchar == '\r') continue;
  if (inchar == '\n') break;
                        
  inuff.append(inchar);
}
buff = inbuff.toString();
  o
  o
out.write(s);
out.write("\r\n");
out.flush();


Avatar of sniles

ASKER

I tried this. It certainly sounds like it should be right on the money, but my user is still reporting that the problem exists. Now instead of junk characters, "?" is inserted.

Here are the interesting bits of the code:

private      OutputStreamWriter  out;
private InputStreamReader  in;      
private      Socket nc = null;
  o
  o
  o
out = new OutputStreamWriter(nc.getOutputStream());
in = new InputStreamReader(nc.getInputStream( ));
  o
  o
StringBuffer inbuff = new StringBuffer();
int inint;
                  
while ( (inint = in.read()) != -1) {
  char inchar = (char) inint;
  if (inchar == '\r') continue;
  if (inchar == '\n') break;
                        
  inuff.append(inchar);
}
buff = inbuff.toString();
  o
  o
out.write(s);
out.write("\r\n");
out.flush();


try,

in = new InputStreamReader(nc.getInputStream( ),"Unicode");

to specify that you are using unicode chars to read.
else if ur using UTF format then specify,

in = new InputStreamReader(nc.getInputStream( ),"UTF");
Hi sniles,

 In addition to sqoms comments, use
readUTF and writeUTF methods of DataInputStream and DataOutputStream .

Best of luck
Avatar of sniles

ASKER

I tried these suggestions, however I'm still having trouble.  When I added the "Unicode" parameter to the InputStreamReader constructor, my TCP/IP connection failed with "Missing  byte-order mark".  This implies that *all* the data (including the command codes (IRC protocol - RFC 1459)) must be in unicode.  I'm afraid I can't do that, as I can't always control (and change the data encoding of) the server.

When I added "UTF", I received a "Could not load class: sun.io.ByteToCharUTC" message.  I'm using VJ++ 6.0 and Internet Explorer 5.0.

What I'd love is something that would just let me send these bytes (over socket) as they were received from the user's input (which does work -- local echo formats them correctly), then receive these bytes  with the high-order bit intact, so that I can just pass
Avatar of sniles

ASKER

I've been able to narrow the problem down a bit. The problem is in writing the string -- not in receiving it.

From another type of client (not the problem software and hereafter called "the good client"), I can send data such as Arial font's Greek Small Letter Beta (u+03B2)and have it be received and displayed correctly by my problem client software.  However, when my problem client software *writes* a character, the "good" client displays only a "?".

Now I have:

out = new BufferedWriter(new OutputStreamWriter(nc.getOutputStream()));
in = new BufferedReader(new InputStreamReader(nc.getInputStream( )));


I output data with:

debugMsgs.add("DATA sent: "+s, Color.lightGray);                  
out.write(s,0,s.length());
out.newLine();
out.flush();


I input data with:

buff=in.readLine();
debugMsgs.add("DATA RCVD: "+buff, Color.lightGray);      


When I run this, and try to send Greek Small Letter Beta, the debug messages *does* show the correct character. However, the "good" client (which apparently *can* handle sending & receiving of these characters) sees only a "?".

When I then use the "good" client to send Greek Small Letter Beta, it shows up correctly in my "degbugMsgs", and on my client's output display.

2 instances of the "good" client can send and receive this character correctly.

I sure hope this helps!  This one has me seeing double!
>> the "good" client displays only a "?".

where do you display it ? on the screen ? it is possible that your VM (AWT) is not configured to display all Unicode characters ..

check the symbol Unicode code directly.
Avatar of sniles

ASKER

I don't think that's it.  The "?" appears on the "good" client, which is capable of displaying that character (because it can send it,locally echo it and display it when it is sent to it by another "good" client).

Also, I'm certain my "problem" client can display it because it displays it when the "good" client sends it to it, and when it locally echoes.

My screens look like this:

Scenario 1: "Problem" client sends data:
1. User types the beta character in the input area.
2. beta character correctly displays the beta character in the local echo.
3. beta character correctly displays in the "DebugMsg": "DATA sent: "
4. "good" client receives this data as "?".  ** this is the problem **

Scenario 2: "Good" client sends data:
1. User types the beta character into the "good" client input area.
2. beta character correctly displays the beta character in the local echo.
3. beta character appears correct in the "Problem" client's DebugMsg: "DATA received: "
4. beta character correctly appears in the "problem" client's display.
All I want you to do is to print the char code as integer.

char ch = st.charAt(0);
System.out.println("" + (long)ch);

what are OSes / JDK versions do you have on the both machines ?
Avatar of sniles

ASKER

The character (at output from "problem" client) printed as 946.

Using NT 4.0 SP4, JDK 1.1 (under Visual J++ 6.0 and IE 5.0).  Using same machine for both clients.
Avatar of sniles

ASKER

The character (at output from "problem" client) printed as 946.

Using NT 4.0 SP4, JDK 1.1 (under Visual J++ 6.0 and IE 5.0).  Using same machine for both clients.
so you receive the correct charecter code and the only problem is that you can't display it ?
Avatar of sniles

ASKER

No, the problem seems to be that when the problem client sends the character code, it corrupts it.

If character code is sent by a good client, my problem client can receive and display the character correctly.
maybe I'm not clear enough.
I want you to display the character CODE (not the CHARACTER) on both machines (before sending and after receiving) and compare it.

if you get two different codes, please post some small compilable example that reproduces the problem, so that we can try it ourselves...
Avatar of sniles

ASKER

Here's what happens when I try to send the Greek Small Letter Beta character (unicode 3B2):

Print of character just before send (in sender) using the code you requested I insert:  946     (which is correct.  946 is hex 3B2)

Print of character just after receipt (in receiver) using the code you requested I insert:  63 (which is hex 3F... the value for the question mark).

I have posted the sample code you requested to:
http://www.tenet.net/home/steve/ServerTCP.java
and
http://www.tenet.net/home/steve/ClientTCP.java

If this helps to clarify, here is the breakdown of what works and what doesn't.

Good --> Good   Displays OK
Good --> Problem  Displays OK
Problem --> Good Fails to display
Problem --> Problem Fails to Display

Avatar of sniles

ASKER

Whoops!  I just realized I posted an earlier copy of the ClientTCP.java file.  It wasn't complete and I don't think it loaded the test string correctly.  If you've already downloaded ClientTCP.java, could you download what's there now?

Thanks.
BufferedReader rcv_packet = new BufferedReader(new DataInputReader(tcpsocket.getInputStream()));


so what is DataInputReader ?
Avatar of sniles

ASKER

Sorry again.  Wrong version of the ServerTCP.java.  It's been replaced.

I compiled both, ran them and verified that I am getting the expected (i.e. erroneous -- reproduced problem) output.
I still can see this line inside your code

  //Connect Stream for communication
    BufferedReader rcv_packet = new BufferedReader(new DataInputReader(tcpsocket.getInputStream()));


so what is DataInputReader ?
Avatar of sniles

ASKER

I just downloaded it and double-checked.  That line is gone and has been replaced by:

//Connect Stream for communication
BufferedReader rcv_packet = new BufferedReader(new InputStreamReader(tcpsocket.getInputStream()));

File change date: 12/7/99 9:16 am
File size: 2,413 bytes

Could you try again, please?
sniles,

In the client side have ur outputstream as,
out = new BufferedWriter( new OutputStreamWriter(nc.getOutputStream(),"Unicode"));

and on the server side receive it as,
  BufferedReader rcv_packet = new BufferedReader(new InputStreamReader(tcpsocket.getInputStream(),"Unicode"));

it prints ? as the char. but correctly prints 946 as the code.

i tested it. its working fine.
Previously ud posted a comment saying that u face problems in sending commands if u use this. can u post ur entire code if it still causes problem?

-sgoms

Avatar of sniles

ASKER

That would require me to change the server side of the communication, which I cannot do.  I can only change the client side.
Try this scenario,

On the client side,
out = new DataOutputStream(nc.getOutputStream());
bytes=test_string.getBytes("Unicode");
out.write( bytes, 0, bytes.length);

//send the data as bytes which is encoded to unicode format

On the server side, //no changes
DataInputStream rcv_packet = new DataInputStream(tcpsocket.getInputStream());
message = rcv_packet.readLine();

charAt(0) will not work in this case 'cos it will only print 255. but i tried it & it printed the same chars

try it & let me know how it goes.
-sgoms
Avatar of sniles

ASKER

Unfortunately, the ServerTCP is just a testing stub... it is not my actual server program.  Your example converts the entire string to unicode, which my server would need to be recoded for -- something I cannot do.
Avatar of sniles

ASKER

Unfortunately, the ServerTCP is just a testing stub... it is not my actual server program.  Your example converts the entire string to unicode, which my server would need to be recoded for -- something I cannot do.
sniles,

iam doing the encoding only on the client side. the server side remains unchanged. u can read teh data from the server as a string using readline.

whichever data is in unicode format alone can be encoded using getBytes("Unicode") on the client side.

did u test this logic in ur side? the SERVER REMAINS UNALTERED.

-sgoms
Avatar of sniles

ASKER

The server is an IRC-style server written in C++.  It's not Java. When it reads in the bytes, instead of the command verb it expects it's getting a unicode version of the command verb, which it can't understand.

Example: it needs:
PRIVMSG #channelNameHere :dataHere

The bytes it expects in its protocol are:
P  R  I  V  M  S  G  
0  1  2  3  4  5  6
(byte position)

With your suggestion, it gets:
0x00  P  0x00  R  0x00  I  0x00  V ...
0     1    2   3    4   5   6    7
(byte position)


Server -- the real server, not the stub I provided you with, is a Windows C++ application which does not handle unicode in it's protocol.
From the client side whatever data that u need to send to unicode, encode them. else send them as a noraml byte array.

////

char greek=0x3B2;
String test_string1 = (new Character(greek)).toString();
String test_string2 = "PRIVMSG #channelNameHere :dataHere";

////

byte[] bytes1=test_string1.getBytes("Unicode");
out.write(bytes1,0,bytes1.length);
byte[] bytes2=test_string2.getBytes();
out.write(bytes2,0,bytes2.length);
out.flush();

///

will this suit ur purposes? No tampering with the server side.
-sgoms
Avatar of sniles

ASKER

That seems like a promising idea, however when the data is passed through the server to the other client, it is not properly reconstructed.

I send (as string test_string2 in your example above):
PRIVMSG bob :

Then I send (as test_string1 in your example above, using .getBytes("Unicode");
0x2561

The client (using a BufferedReader) sees:
PRIVMSG bob :þÿ%a

The part after the colon is:
0xFE 0xFF 0x25 0x61   (each in it's own char).

It looks like I need to do something on the receiving end to pull appropriate bytes in as unicode chars.
it seems that I can't follow you - you have server that expects plain text commands, and you want to send Unicode commands to it ?

what's the problem with sending plain text ?
String st = "/nick";
OutputStream os = ...
os.write(st.getBytes());
.. Turkish, Thai ...

how can you send different '/nick' command in Thai ?

or your server expects 'plain commands' and 'Unicode text messages' ?
Avatar of sniles

ASKER

That seems like a promising idea, however when the data is passed through the server to the other client, it is not properly reconstructed.

I send (as string test_string2 in your example above):
PRIVMSG bob :

Then I send (as test_string1 in your example above, using .getBytes("Unicode");
0x2561

The client (using a BufferedReader) sees:
PRIVMSG bob :þÿ%a

The part after the colon is:
0xFE 0xFF 0x25 0x61   (each in it's own char).

It looks like I need to do something on the receiving end to pull appropriate bytes in as unicode chars.
Avatar of sniles

ASKER

The server expects plain text commands. It doesn't care about the message data.  I have a Thai user (call him "Larry") who types Thai into his input area. Larry's client software takes this input and creates a command that looks like this:

PRIVMSG Bob: whateverHeTypedInGoesHere

The Server's job is to receive this PRIVMSG command and route it to Bob.  

Bob's client software's job is to recognize this PRIVMSG command, then display the message part of it so that it appears as the same message that Larry typed in.

You're correct that under this scheme, the Thai users must use plain text names for themselves.  At this point, I'm just shooting for the typed-in message data.

Sorry if that all got confusing. I was hoping to present this to you without having to bog you down with too many details.  The problem turned out to be a bit more complex than I first thought.
Avatar of sniles

ASKER

The server expects plain text commands. It doesn't care about the message data.  I have a Thai user (call him "Larry") who types Thai into his input area. Larry's client software takes this input and creates a command that looks like this:

PRIVMSG Bob: whateverHeTypedInGoesHere

The Server's job is to receive this PRIVMSG command and route it to Bob.  

Bob's client software's job is to recognize this PRIVMSG command, then display the message part of it so that it appears as the same message that Larry typed in.

You're correct that under this scheme, the Thai users must use plain text names for themselves.  At this point, I'm just shooting for the typed-in message data.

Sorry if that all got confusing. I was hoping to present this to you without having to bog you down with too many details.  The problem turned out to be a bit more complex than I first thought.
ASKER CERTIFIED SOLUTION
Avatar of heyhey_
heyhey_

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of sniles

ASKER

Thanks!  You were right.  I wound up needing to just transmit everything as 8-byte values, then maintain a shift-in and shift-out of unicode flag in the text. Then, in my code, I do the byte seperating on output and byte addition upon input.  This seems to work well, and the user is happy (well, as happy as they ever get, anyways :)  )

Thanks also to sgoms and ravindra76!
sniles,

glad that u got it solved.
what i found was,
from the client side when you send the unicode with value 946 if u get its byte array it gets printed as,
-1(11111111 11111111)
-2(11111111 11111110)
78(11111111 11001110)
-3(11111111 11111101)

on the server side the data was altered. to
(11111111)
(11111110)
(11001110)
(11111101)

'cos char is unsigned u loose the fst eight bits.
if u use the message.getBytes() & print,
byte[] b=message.getBytes();
for(int i=0;i<b.length;i++)
  System.out.println((long)b[i]);

u get the actual data.

-sgoms