• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 301
  • Last Modified:

Identify unicode format

I would like to know the encoding used for a given string in C. Means, I have a character buffer and like to know the data is in UTF-8 format or UTF-16 format. So that I can process my data accordingly.

Is there any standard function available for the same. If not, how will I do the same.

Thanks in Advance!
Deepak Kumar

0
deepakg76
Asked:
deepakg76
1 Solution
 
stefan73Commented:
Hi deepakg76,
> UTF-8 format or UTF-16
You need to check the BOM (byte order mark). It's at the beginning of a properly encoded unicode text file.

More details:
http://www.unicode.org/unicode/faq/utf_bom.html#BOM


Cheers,
Stefan
0
 
deepakg76Author Commented:
Thanks for the reply...

It is helpful only if i want to read the content from a file. If I have char buffer from another application or dll. I want to know from the string data that buffer is having utf-8 or utf-16 data. So that i can process accordingly.

Deepak
0
 
mjzalewskiCommented:
It's not possible to do this directly. You have to know what encoding format is being used by the application which sends the character buffer.

Encoding marks such as the BOM are specifically not recommended when the text data is already typed. So for example, there would be no BOM mark stored in a database -- the type of the column and the database environment would determine whether the data was utf-8 or utf-16.

You could use a heuristic. Odd length is certainly utf-8. Embedded 0x00, especially in even positions would certainly be utf-16. But there are buffers, especially short ones, which have equally valid utf-8 and utf-16 interpretations.
0

Featured Post

Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Tackle projects and never again get stuck behind a technical roadblock.
Join Now