Solved

what encoding?

Posted on 2001-09-07
2
420 Views
Last Modified: 2008-03-04
how can I find out what encoding a XML file is
saved in? ie whether it is saved in ANSI, UTF-8 or
Unicode?
0
Comment
Question by:slok
2 Comments
 
LVL 4

Accepted Solution

by:
edmund_mitchell earned 50 total points
ID: 6466041
Hello slok-

OK-

1) You can read the XML document itself:

An XML document must begin with markup called a prolog. A prolog contains either an XML declaration or a text declaration, optionally followed by a Document Type Declaration, optionally followed by comments or processing instructions. Whitespace may appear after any of these components of the prolog.

A document entity's prolog begins with an XML declaration and takes the form:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>

The version is required, but encoding and standalone declarations are optional. Therefore, reading the XML might not help to determine the encoding, but, you could get lucky.  Encoding declarations are recommended so that XML parsers can be sure they are decoding the document correctly.

2)  Check the byte-order mark

 The spec dictates that if UTF-16 encoding is used, a byte-order mark must be present at the beginning of the document. If no hints to a document's encoding are available, it is assumed that UTF-8 encoding is in effect, and it would be an error if the document were not actually encoded with UTF-8.
Entities encoded in UTF-16 must begin with the Byte Order Mark described by Annex F of [ISO/IEC 10646], Annex H of [ISO/IEC 10646-2000], section 2.4 of [Unicode], and section 2.7 of [Unicode3] (the ZERO WIDTH NO-BREAK SPACE character, #xFEFF). This is an encoding signature, not part of either the markup or the character data of the XML document. XML processors must be able to use this character to differentiate between UTF-8 and UTF-16 encoded documents.
This is what your average parser does, in addition to looking for other clues.

3) Hope they follow the rules, and hope your XML parser enforces the rules:

 In the absence of external character encoding information (such as MIME headers), parsed entities which are stored in an encoding other than UTF-8 or UTF-16 must begin with a text declaration (see 4.3.1 The Text Declaration, at:
http://www.w3.org/TR/2000/REC-xml-20001006#sec-TextDecl) containing an encoding declaration.
If you want to know the pattern required for the encoding declaration, it is described in the spec and guaranteed to work better than sleeping pills:
http://www.w3.org/TR/2000/REC-xml-20001006#charencoding
Note that since ASCII is a subset of UTF-8, ordinary ASCII entities do not strictly need an encoding declaration.

To boil all this down:
Its probably best to parse the byte-order mark, and check to see if it's UTF-16.  If there is no byte-order mark, check the encoding declaration.  If there is no encoding declaration, it's UTF-8.

I hope that helps (or at least helps you go to sleep right away :) )

Edmund


0
 
LVL 3

Author Comment

by:slok
ID: 6470910
I'm going through the articles now.

Give me a buzz if I don't reply/close this question
by the end of the week.
0

Featured Post

Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Browsing the questions asked to the Experts of this forum, you will be amazed to see how many times people are headaching about monster regular expressions (regex) to select that specific part of some HTML or XML file they want to extract. The examp…
Many times as a report developer I've been asked to display normalized data such as three rows with values Jack, Joe, and Bob as a single comma-separated string such as 'Jack, Joe, Bob', and vice versa.  Here's how to do it. 
Hi friends,  in this video  I'll show you how new windows 10 user can learn the using of windows 10. Thank you.
As a trusted technology advisor to your customers you are likely getting the daily question of, ‘should I put this in the cloud?’ As customer demands for cloud services increases, companies will see a shift from traditional buying patterns to new…

911 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

22 Experts available now in Live!

Get 1:1 Help Now