Link to home
Start Free TrialLog in
Avatar of sneeuw
sneeuwFlag for Belgium

asked on

How detect Unicode and Wide char text ?

Hi,

Suppose I read out a text somewhere but I don't know whether the text is 1 or 2 bytes per character ...
How can I detect ?
ASKER CERTIFIED SOLUTION
Avatar of djbusychild
djbusychild

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of jkr
Copy the text into a buffer and use the Win32 API 'IsTextUnicode()' (from the docs):

DWORD IsTextUnicode( CONST LPVOID lpBuffer,
// pointer to an input buffer to be examined

int cb,
// the size in bytes of the input buffer

LPINT lpi
// pointer to flags that condition text examination and receive results

);

The IsTextUnicode function determines whether a buffer probably contains a form of Unicode text. The

function uses various statistical and deterministic methods to make its determination, under the control

of flags passed via lpi. When the function returns, the results of such tests are reported via lpi.

If all specified tests are passed, the function returns TRUE; otherwise, it returns FALSE.

If you don't want to load the whole file, use a reasonable amount of bytes, which must be dividable

by 2.

Feel free to ask if you need more information!
Ooops, sorry djbusychild, I was typing when you posted...
Avatar of sneeuw

ASKER

OK, I'll try it !
What do you think of this function then :

WideCharToMultiByte()

After doing a test it seems that this function returns a questionmark '?' per character if not a known wide character ?
(So I could use this as test-case too ?)

What exactly is a MultiByte then ??  Is this always 1-byte chars (It seemed to be in the test) ?

See I'm struggling here ... ;-)
I read text data from a CD but don't know for sure wheter it is 1 or 2 bytes char text data.
At the moment I test for widechar (This test is flaky hence this posting in the first place) next if I think it is wide char ... I convert using
WideCharToMultiByte()
which returns 1-byte characters.
This seems to work out OK for me but I'm worried ... what when the text I read out is Chinese or ...

Avatar of sneeuw

ASKER

Hi,

Unfortunately it didn't work !!!!

It returned false all the time on data I KNOW is Unicode (or is at least 2 bytes per char) !!

The data is byte-swapped but this function is supposed to be able to deal with it !
>>The data is byte-swapped

What do you mean by 'byte-swapped'? The data has to have the right byte ordering, otherwise the function WILL fail.

You could do a simple test using string literals, e.g.

wchar_t* pUnicode = L"this is a test string";
char* pAnsi =  = "this is a test string";
BTW: HAve you tried 'IS_TEXT_UNICODE_REVERSE_ASCII16' or 'IS_TEXT_UNICODE_REVERSE_STATISTICS'?
Avatar of sneeuw

ASKER

> BTW: HAve you tried 'IS_TEXT_UNICODE_REVERSE_ASCII16' or 'IS_TEXT_UNICODE_REVERSE_STATISTICS'?

Yes !
This is why I assume the byte-swapped data can be handled ?

But I Should byte-wap myself and try again !
Maybe this routine doesn't like Byte-swapping afterall.

The data is byte-swapped because (I guess) originally intended for Motorola machines.
>>The data is byte-swapped because (I guess) originally
>>intended for Motorola machines.

Well, in this case, it should work when the data is reversed to an Intel byte order...
Avatar of djbusychild
djbusychild

no prob, jkr. =)

sneeuw, you're right. the function's not going to
detect endian difference. ;)
Some info I'd like to add to this thread:  Regarding endian differences, one of the things available in Unicode is the concept of a byte order mark; you shouldn't generally have to guess since well-behaved applications that write Unicode text output *should* first output a byte order mark so that readers of the data can know how to interpret each pair of bytes.

Here's some info drawn from MSDN at http://www.msdn.microsoft.com/library/default.asp?URL=/library/psdk/winbase/unicode_42jv.htm


----- 8< quoted from MSDN 8< -----

Byte-order Mark
Always prefix a Unicode plain text file with a byte-order mark. Because Unicode plain text is a sequence of 16-bit code values, it is sensitive to the byte ordering used when the text was written.

A byte-order mark is not a control character that selects the byte order of the text; it simply informs an application receiving the file that the file is byte ordered.

Ideally, all Unicode text would follow only one set of byte-ordering rules. This is not possible, however, because microprocessors differ in the placement of the least significant byte: Intel? and MIPS? processors position the least significant byte first, whereas Motorola processors (and all byte-reversed Unicode files) position it last. With only a single set of byte-ordering rules, users of one type of microprocessor would be forced to swap the byte order every time a plain text file is read from or written to, even if the file is never transferred to another system based on a different microprocessor.

The preferred place to specify byte order is in a file header, but text files do not have headers. Therefore, Unicode has defined a character (0xFEFF) and a noncharacter (0xFFFE) as byte-order marks. They are mirror byte-images of each other.

Since the sequence 0xFEFF is exceedingly rare at the outset of regular non-Unicode text files, it can serve as an implicit marker or signature to identify the file as a Unicode file. Applications that read both Unicode and non-Unicode text files should use the presence of this sequence as an indicator that the file is most likely a Unicode file. (Compare this technique to using the MS-DOS EOF marker to terminate text files.)

When an application finds 0xFEFF at the beginning of a text file, it typically processes the file as though it were a Unicode file, although it may also perform further heuristic checks to verify that this is true. Such a check could be as simple as testing whether the variation in the low-order bytes is much higher than the variation in the high-order bytes. For example, if ASCII text is converted to Unicode text, every second byte is zero. Also, checking both for the linefeed and carriage-return characters (0x000A and 0x000D) and for even or odd file size can provide a strong indicator of the nature of the file.

When an application finds 0xFFFE at the beginning of a text file, it interprets it to mean the file is a byte-reversed Unicode file. The application can either swap the order of the bytes or alert the user that an error has occurred.

The Unicode byte-order mark character is not found in any code page, so it disappears if data is converted to ANSI. Unlike other Unicode characters, it is not replaced by a default character when it is converted. If a byte-order mark is found in the middle of a file, it is not interpreted as a Unicode character and has no effect on text output.

The Unicode value 0xFFFF is illegal in plain text files and cannot be passed between Win32 functions. The value 0xFFFF is reserved for an application's private use.

Avatar of sneeuw

ASKER

Good info but the text I read does not come from a file.
It is read out of structures describing files and their locations.  The texts are not preceeded by any special characters which indicates Unicode / byte order.
Where do the structures come from?  Knowing that might give the information you need to determine what format they are in.
Avatar of sneeuw

ASKER

Joliet File system structures on CD
Avatar of sneeuw

ASKER

I still need to do the byte swap test and then assign the points when it works.

In the mean time I did some more tests and reading so I posted another question.  Pls. feel free to participate !

https://www.experts-exchange.com/jsp/qManageQuestion.jsp?ta=cplusprog&qid=20137274

Question : Multi Byte MBCS vs. Wide Char
Avatar of sneeuw

ASKER

Nope IsTextUnicode() returns false ALL the time !!
I don't get it ?
I did byte-swap this time and I know that helps because if I take the same input, byte-swap and next convert to Multi-Byte I end up with a correct string
Avatar of sneeuw

ASKER

Like said :

Nope IsTextUnicode() returns false ALL the time !!
I don't get it ?
I did byte-swap this time and I know that helps because if I take the same input, byte-swap and next
convert to Multi-Byte I end up with a correct string  
sneeuw,

These questions are still open and our records show you logged in recently. Please resolve them appropriately as soon as possible. Continued disregard of your open questions will result in the force/acceptance of a comment as an answer; other actions affecting your account may also be taken. I will revisit these questions in approximately seven (7) days. Please note that the recommended minimum for an "Easy" question is 50 points.
https://www.experts-exchange.com/jsp/qShow.jsp?ta=winprog&qid=20183446
https://www.experts-exchange.com/jsp/qShow.jsp?ta=winprog&qid=20158806
https://www.experts-exchange.com/jsp/qShow.jsp?ta=cplusprog&qid=20192985
https://www.experts-exchange.com/jsp/qShow.jsp?ta=cplusprog&qid=20151309
https://www.experts-exchange.com/jsp/qShow.jsp?ta=cplusprog&qid=20137274
https://www.experts-exchange.com/jsp/qShow.jsp?ta=cplusprog&qid=20136466
https://www.experts-exchange.com/jsp/qShow.jsp?ta=delphi&qid=20088277
https://www.experts-exchange.com/jsp/qShow.jsp?ta=javascript&qid=20183228

EXPERTS: Please leave your thoughts on this question here.

Thanks,

Netminder
Community Support Moderator
Experts Exchange
I suggest to split the points between dbusychild and jkr for their help.

PLEASE DO NOT ACCEPT THIS COMMENT AS AN ANSWER!
======
Werner
Force/accepted by

Netminder
Community Support Moderator
Experts Exchange

jkr: points for you at https://www.experts-exchange.com/jsp/qShow.jsp?ta=cplusprog&qid=20270942