Solved

How detect Unicode and Wide char text ?

Posted on 2001-06-15
20
457 Views
Last Modified: 2008-03-17
Hi,

Suppose I read out a text somewhere but I don't know whether the text is 1 or 2 bytes per character ...
How can I detect ?
0
Comment
Question by:sneeuw
  • 8
  • 5
  • 2
  • +3
20 Comments
 
LVL 5

Accepted Solution

by:
djbusychild earned 75 total points
ID: 6196908
this is kinda flaky... I don't believe there's a solve-it-all if the encoding is unknown..

you could try and guess...

on win32 you can guess using IsTextUnicode()

http://msdn.microsoft.com/library/psdk/winbase/unicode_81np.htm
0
 
LVL 86

Expert Comment

by:jkr
ID: 6196912
Copy the text into a buffer and use the Win32 API 'IsTextUnicode()' (from the docs):

DWORD IsTextUnicode( CONST LPVOID lpBuffer,
// pointer to an input buffer to be examined

int cb,
// the size in bytes of the input buffer

LPINT lpi
// pointer to flags that condition text examination and receive results

);

The IsTextUnicode function determines whether a buffer probably contains a form of Unicode text. The

function uses various statistical and deterministic methods to make its determination, under the control

of flags passed via lpi. When the function returns, the results of such tests are reported via lpi.

If all specified tests are passed, the function returns TRUE; otherwise, it returns FALSE.

If you don't want to load the whole file, use a reasonable amount of bytes, which must be dividable

by 2.

Feel free to ask if you need more information!
0
 
LVL 86

Expert Comment

by:jkr
ID: 6196915
Ooops, sorry djbusychild, I was typing when you posted...
0
 

Author Comment

by:sneeuw
ID: 6197869
OK, I'll try it !
What do you think of this function then :

WideCharToMultiByte()

After doing a test it seems that this function returns a questionmark '?' per character if not a known wide character ?
(So I could use this as test-case too ?)

What exactly is a MultiByte then ??  Is this always 1-byte chars (It seemed to be in the test) ?

See I'm struggling here ... ;-)
I read text data from a CD but don't know for sure wheter it is 1 or 2 bytes char text data.
At the moment I test for widechar (This test is flaky hence this posting in the first place) next if I think it is wide char ... I convert using
WideCharToMultiByte()
which returns 1-byte characters.
This seems to work out OK for me but I'm worried ... what when the text I read out is Chinese or ...

0
 

Author Comment

by:sneeuw
ID: 6198354
Hi,

Unfortunately it didn't work !!!!

It returned false all the time on data I KNOW is Unicode (or is at least 2 bytes per char) !!

The data is byte-swapped but this function is supposed to be able to deal with it !
0
 
LVL 86

Expert Comment

by:jkr
ID: 6198507
>>The data is byte-swapped

What do you mean by 'byte-swapped'? The data has to have the right byte ordering, otherwise the function WILL fail.

You could do a simple test using string literals, e.g.

wchar_t* pUnicode = L"this is a test string";
char* pAnsi =  = "this is a test string";
0
 
LVL 86

Expert Comment

by:jkr
ID: 6198519
BTW: HAve you tried 'IS_TEXT_UNICODE_REVERSE_ASCII16' or 'IS_TEXT_UNICODE_REVERSE_STATISTICS'?
0
 

Author Comment

by:sneeuw
ID: 6198530
> BTW: HAve you tried 'IS_TEXT_UNICODE_REVERSE_ASCII16' or 'IS_TEXT_UNICODE_REVERSE_STATISTICS'?

Yes !
This is why I assume the byte-swapped data can be handled ?

But I Should byte-wap myself and try again !
Maybe this routine doesn't like Byte-swapping afterall.

The data is byte-swapped because (I guess) originally intended for Motorola machines.
0
 
LVL 86

Expert Comment

by:jkr
ID: 6198536
>>The data is byte-swapped because (I guess) originally
>>intended for Motorola machines.

Well, in this case, it should work when the data is reversed to an Intel byte order...
0
 
LVL 5

Expert Comment

by:djbusychild
ID: 6198582
no prob, jkr. =)

sneeuw, you're right. the function's not going to
detect endian difference. ;)
0
Top 6 Sources for Identifying Threat Actor TTPs

Understanding your enemy is essential. These six sources will help you identify the most popular threat actor tactics, techniques, and procedures (TTPs).

 
LVL 3

Expert Comment

by:cwrea
ID: 6200269
Some info I'd like to add to this thread:  Regarding endian differences, one of the things available in Unicode is the concept of a byte order mark; you shouldn't generally have to guess since well-behaved applications that write Unicode text output *should* first output a byte order mark so that readers of the data can know how to interpret each pair of bytes.

Here's some info drawn from MSDN at http://www.msdn.microsoft.com/library/default.asp?URL=/library/psdk/winbase/unicode_42jv.htm


----- 8< quoted from MSDN 8< -----

Byte-order Mark
Always prefix a Unicode plain text file with a byte-order mark. Because Unicode plain text is a sequence of 16-bit code values, it is sensitive to the byte ordering used when the text was written.

A byte-order mark is not a control character that selects the byte order of the text; it simply informs an application receiving the file that the file is byte ordered.

Ideally, all Unicode text would follow only one set of byte-ordering rules. This is not possible, however, because microprocessors differ in the placement of the least significant byte: Intel? and MIPS? processors position the least significant byte first, whereas Motorola processors (and all byte-reversed Unicode files) position it last. With only a single set of byte-ordering rules, users of one type of microprocessor would be forced to swap the byte order every time a plain text file is read from or written to, even if the file is never transferred to another system based on a different microprocessor.

The preferred place to specify byte order is in a file header, but text files do not have headers. Therefore, Unicode has defined a character (0xFEFF) and a noncharacter (0xFFFE) as byte-order marks. They are mirror byte-images of each other.

Since the sequence 0xFEFF is exceedingly rare at the outset of regular non-Unicode text files, it can serve as an implicit marker or signature to identify the file as a Unicode file. Applications that read both Unicode and non-Unicode text files should use the presence of this sequence as an indicator that the file is most likely a Unicode file. (Compare this technique to using the MS-DOS EOF marker to terminate text files.)

When an application finds 0xFEFF at the beginning of a text file, it typically processes the file as though it were a Unicode file, although it may also perform further heuristic checks to verify that this is true. Such a check could be as simple as testing whether the variation in the low-order bytes is much higher than the variation in the high-order bytes. For example, if ASCII text is converted to Unicode text, every second byte is zero. Also, checking both for the linefeed and carriage-return characters (0x000A and 0x000D) and for even or odd file size can provide a strong indicator of the nature of the file.

When an application finds 0xFFFE at the beginning of a text file, it interprets it to mean the file is a byte-reversed Unicode file. The application can either swap the order of the bytes or alert the user that an error has occurred.

The Unicode byte-order mark character is not found in any code page, so it disappears if data is converted to ANSI. Unlike other Unicode characters, it is not replaced by a default character when it is converted. If a byte-order mark is found in the middle of a file, it is not interpreted as a Unicode character and has no effect on text output.

The Unicode value 0xFFFF is illegal in plain text files and cannot be passed between Win32 functions. The value 0xFFFF is reserved for an application's private use.

0
 

Author Comment

by:sneeuw
ID: 6201876
Good info but the text I read does not come from a file.
It is read out of structures describing files and their locations.  The texts are not preceeded by any special characters which indicates Unicode / byte order.
0
 
LVL 3

Expert Comment

by:cwrea
ID: 6202386
Where do the structures come from?  Knowing that might give the information you need to determine what format they are in.
0
 

Author Comment

by:sneeuw
ID: 6202426
Joliet File system structures on CD
0
 

Author Comment

by:sneeuw
ID: 6202512
I still need to do the byte swap test and then assign the points when it works.

In the mean time I did some more tests and reading so I posted another question.  Pls. feel free to participate !

http://www.experts-exchange.com/jsp/qManageQuestion.jsp?ta=cplusprog&qid=20137274

Question : Multi Byte MBCS vs. Wide Char
0
 

Author Comment

by:sneeuw
ID: 6202594
Nope IsTextUnicode() returns false ALL the time !!
I don't get it ?
I did byte-swap this time and I know that helps because if I take the same input, byte-swap and next convert to Multi-Byte I end up with a correct string
0
 

Author Comment

by:sneeuw
ID: 6285495
Like said :

Nope IsTextUnicode() returns false ALL the time !!
I don't get it ?
I did byte-swap this time and I know that helps because if I take the same input, byte-swap and next
convert to Multi-Byte I end up with a correct string  
0
 
LVL 5

Expert Comment

by:Netminder
ID: 6806039
sneeuw,

These questions are still open and our records show you logged in recently. Please resolve them appropriately as soon as possible. Continued disregard of your open questions will result in the force/acceptance of a comment as an answer; other actions affecting your account may also be taken. I will revisit these questions in approximately seven (7) days. Please note that the recommended minimum for an "Easy" question is 50 points.
http://experts-exchange.com/jsp/qShow.jsp?ta=winprog&qid=20183446
http://experts-exchange.com/jsp/qShow.jsp?ta=winprog&qid=20158806
http://experts-exchange.com/jsp/qShow.jsp?ta=cplusprog&qid=20192985
http://experts-exchange.com/jsp/qShow.jsp?ta=cplusprog&qid=20151309
http://experts-exchange.com/jsp/qShow.jsp?ta=cplusprog&qid=20137274
http://experts-exchange.com/jsp/qShow.jsp?ta=cplusprog&qid=20136466
http://experts-exchange.com/jsp/qShow.jsp?ta=delphi&qid=20088277
http://experts-exchange.com/jsp/qShow.jsp?ta=javascript&qid=20183228

EXPERTS: Please leave your thoughts on this question here.

Thanks,

Netminder
Community Support Moderator
Experts Exchange
0
 
LVL 11

Expert Comment

by:griessh
ID: 6819185
I suggest to split the points between dbusychild and jkr for their help.

PLEASE DO NOT ACCEPT THIS COMMENT AS AN ANSWER!
======
Werner
0
 
LVL 5

Expert Comment

by:Netminder
ID: 6826308
Force/accepted by

Netminder
Community Support Moderator
Experts Exchange

jkr: points for you at http://www.experts-exchange.com/jsp/qShow.jsp?ta=cplusprog&qid=20270942
0

Featured Post

Highfive Gives IT Their Time Back

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

Join & Write a Comment

Suggested Solutions

What is C++ STL?: STL stands for Standard Template Library and is a part of standard C++ libraries. It contains many useful data structures (containers) and algorithms, which can spare you a lot of the time. Today we will look at the STL Vector. …
Container Orchestration platforms empower organizations to scale their apps at an exceptional rate. This is the reason numerous innovation-driven companies are moving apps to an appropriated datacenter wide platform that empowers them to scale at a …
The goal of the video will be to teach the user the concept of local variables and scope. An example of a locally defined variable will be given as well as an explanation of what scope is in C++. The local variable and concept of scope will be relat…
The goal of the video will be to teach the user the difference and consequence of passing data by value vs passing data by reference in C++. An example of passing data by value as well as an example of passing data by reference will be be given. Bot…

747 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

13 Experts available now in Live!

Get 1:1 Help Now