• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 455
  • Last Modified:

what encoding?

how can I find out what encoding a XML file is
saved in? ie whether it is saved in ANSI, UTF-8 or
Unicode?
0
slok
Asked:
slok
1 Solution
 
edmund_mitchellCommented:
Hello slok-

OK-

1) You can read the XML document itself:

An XML document must begin with markup called a prolog. A prolog contains either an XML declaration or a text declaration, optionally followed by a Document Type Declaration, optionally followed by comments or processing instructions. Whitespace may appear after any of these components of the prolog.

A document entity's prolog begins with an XML declaration and takes the form:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>

The version is required, but encoding and standalone declarations are optional. Therefore, reading the XML might not help to determine the encoding, but, you could get lucky.  Encoding declarations are recommended so that XML parsers can be sure they are decoding the document correctly.

2)  Check the byte-order mark

 The spec dictates that if UTF-16 encoding is used, a byte-order mark must be present at the beginning of the document. If no hints to a document's encoding are available, it is assumed that UTF-8 encoding is in effect, and it would be an error if the document were not actually encoded with UTF-8.
Entities encoded in UTF-16 must begin with the Byte Order Mark described by Annex F of [ISO/IEC 10646], Annex H of [ISO/IEC 10646-2000], section 2.4 of [Unicode], and section 2.7 of [Unicode3] (the ZERO WIDTH NO-BREAK SPACE character, #xFEFF). This is an encoding signature, not part of either the markup or the character data of the XML document. XML processors must be able to use this character to differentiate between UTF-8 and UTF-16 encoded documents.
This is what your average parser does, in addition to looking for other clues.

3) Hope they follow the rules, and hope your XML parser enforces the rules:

 In the absence of external character encoding information (such as MIME headers), parsed entities which are stored in an encoding other than UTF-8 or UTF-16 must begin with a text declaration (see 4.3.1 The Text Declaration, at:
http://www.w3.org/TR/2000/REC-xml-20001006#sec-TextDecl) containing an encoding declaration.
If you want to know the pattern required for the encoding declaration, it is described in the spec and guaranteed to work better than sleeping pills:
http://www.w3.org/TR/2000/REC-xml-20001006#charencoding
Note that since ASCII is a subset of UTF-8, ordinary ASCII entities do not strictly need an encoding declaration.

To boil all this down:
Its probably best to parse the byte-order mark, and check to see if it's UTF-16.  If there is no byte-order mark, check the encoding declaration.  If there is no encoding declaration, it's UTF-8.

I hope that helps (or at least helps you go to sleep right away :) )

Edmund


0
 
slokAuthor Commented:
I'm going through the articles now.

Give me a buzz if I don't reply/close this question
by the end of the week.
0

Featured Post

Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Tackle projects and never again get stuck behind a technical roadblock.
Join Now