Solved

what encoding?

Posted on 2001-09-07
2
427 Views
Last Modified: 2008-03-04
how can I find out what encoding a XML file is
saved in? ie whether it is saved in ANSI, UTF-8 or
Unicode?
0
Comment
Question by:slok
2 Comments
 
LVL 4

Accepted Solution

by:
edmund_mitchell earned 50 total points
ID: 6466041
Hello slok-

OK-

1) You can read the XML document itself:

An XML document must begin with markup called a prolog. A prolog contains either an XML declaration or a text declaration, optionally followed by a Document Type Declaration, optionally followed by comments or processing instructions. Whitespace may appear after any of these components of the prolog.

A document entity's prolog begins with an XML declaration and takes the form:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>

The version is required, but encoding and standalone declarations are optional. Therefore, reading the XML might not help to determine the encoding, but, you could get lucky.  Encoding declarations are recommended so that XML parsers can be sure they are decoding the document correctly.

2)  Check the byte-order mark

 The spec dictates that if UTF-16 encoding is used, a byte-order mark must be present at the beginning of the document. If no hints to a document's encoding are available, it is assumed that UTF-8 encoding is in effect, and it would be an error if the document were not actually encoded with UTF-8.
Entities encoded in UTF-16 must begin with the Byte Order Mark described by Annex F of [ISO/IEC 10646], Annex H of [ISO/IEC 10646-2000], section 2.4 of [Unicode], and section 2.7 of [Unicode3] (the ZERO WIDTH NO-BREAK SPACE character, #xFEFF). This is an encoding signature, not part of either the markup or the character data of the XML document. XML processors must be able to use this character to differentiate between UTF-8 and UTF-16 encoded documents.
This is what your average parser does, in addition to looking for other clues.

3) Hope they follow the rules, and hope your XML parser enforces the rules:

 In the absence of external character encoding information (such as MIME headers), parsed entities which are stored in an encoding other than UTF-8 or UTF-16 must begin with a text declaration (see 4.3.1 The Text Declaration, at:
http://www.w3.org/TR/2000/REC-xml-20001006#sec-TextDecl) containing an encoding declaration.
If you want to know the pattern required for the encoding declaration, it is described in the spec and guaranteed to work better than sleeping pills:
http://www.w3.org/TR/2000/REC-xml-20001006#charencoding
Note that since ASCII is a subset of UTF-8, ordinary ASCII entities do not strictly need an encoding declaration.

To boil all this down:
Its probably best to parse the byte-order mark, and check to see if it's UTF-16.  If there is no byte-order mark, check the encoding declaration.  If there is no encoding declaration, it's UTF-8.

I hope that helps (or at least helps you go to sleep right away :) )

Edmund


0
 
LVL 3

Author Comment

by:slok
ID: 6470910
I'm going through the articles now.

Give me a buzz if I don't reply/close this question
by the end of the week.
0

Featured Post

Resolve Critical IT Incidents Fast

If your data, services or processes become compromised, your organization can suffer damage in just minutes and how fast you communicate during a major IT incident is everything. Learn how to immediately identify incidents & best practices to resolve them quickly and effectively.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

The Problem How to write an Xquery that works like a SQL outer join, providing placeholders for absent data on the outer side?  I give a bit more background at the end. The situation expressed as relational data Let’s work through this.  I’ve …
I was working on a PowerPoint add-in the other day and a client asked me "can you implement a feature which processes a chart when it's pasted into a slide from another deck?". It got me wondering how to hook into built-in ribbon events in Office.
Microsoft Active Directory, the widely used IT infrastructure, is known for its high risk of credential theft. The best way to test your Active Directory’s vulnerabilities to pass-the-ticket, pass-the-hash, privilege escalation, and malware attacks …
Email security requires an ever evolving service that stays up to date with counter-evolving threats. The Email Laundry perform Research and Development to ensure their email security service evolves faster than cyber criminals. We apply our Threat…

809 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question