Link to home
Start Free TrialLog in
Avatar of mann061997
mann061997

asked on

How to read *both* ASCII and UNICODE

I read input data using a *Reader object. This is instantiated with the default byteToCharConverter. This is fine for reading ASCII - but how do I read UNICODE?
And assuming I had an appropriate converter for UNICODE, how  would I be able to read ASCII?
Avatar of fadl
fadl

I don't understand exactly what is your problem.

ASCII is AFAIK subset of UNICODE
The only difference between the two is that ASCII fits 1 BYTE
while UNICODE to 2BYTES, so you moust somehow decide if
you will read your input stream byte by byte or
2bytes by 2bytes. Note that Java 1.1.* has
*Reader classes for things like UNICODE
and
*InputStream for good old single byte ASCII

Please be more specific...

Michal
Avatar of mann061997

ASKER

Ok - more to the point:
- how can I determine whether a given input is UNICODE or ASCII?
- is there a "NULL" converter, so I can use a Reader object on
  a UNICODE input stream?

I think you must know what data are comming in your stream.
If you don't know whether input data will be in ASCII or UNICODE
then read all bytes comming to byte[] and then go
through that array and find e.g. \n's ...

Another solution could be - read first byte if it is e.g. 0x0D
then read rest as ASCII otherwise read it as UNICODE.


Michal
There is no need to determine whether a particular text string is Unicode or ASCII -  Use the Reader classes, as you do, and you will properly read either.  Unicode is written and read in a format known as UTF-8, which has the very nice property that all ASCII characters take the exact same single byte that they would in an ASCII string.

The only issue that you could have would be if you were trying to read something from another encoding, such as Big5, or a platform-specific non-Roman mapping.  In that case, you would indeed need to use an InputByteStream and convert the resulting Byte[] explicitly, specifying the converter.
Sorry russgold, but it doesn't seem to work that way - it's what I've been doing all along.
The Reader doesn't seem to detect UNICODE, so every other character is 0x00. The source data is bona fide UNICODE: it starts with 0xff 0xfe and was created by notepad.
Of course, I could check for fffe myself and discard every other byte, but I was hoping this would be handled by the reader classes.
It appears that I have misunderstood your question.  Whay are you trying to read UNICODE directly?  Java uses it internally, but expects to read and write text in another format.  If you simply want the full range or characters possible in UNICODE, you can use UTF-8.
ASKER CERTIFIED SOLUTION
Avatar of msmolyak
msmolyak

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
I don't think you can use the same encoding to read both ASCII and Unicode. UTF-8 is not ASCII and it is not Unicode. You can read the stream as UTF-8 only if it was written as UTF-8. (UTF-8 can use between 1 and 3 bytes since it needs extra bits to store number of bytes it uses).

Thus you would have to treat each data source individually using the encoding which created it.
UnicodeLittle did the trick. What's the difference between these Unicode variants? Where can I find some info about the available encoding strings?
Unfortunately Sun's byte to char converters are not documented. But at least you can look up their names (and decompile the code if you are very adventurous). The class names's suffix is the encoding string to use.

I think the difference between UnicodeBig and UnicodeLittle is the order of bytes (upper byte first or lower byte first). Since there are only two it's easy to establish the right one by experimentation.