Solved

what encoding?

Posted on 2001-09-07
2
432 Views
Last Modified: 2008-03-04
how can I find out what encoding a XML file is
saved in? ie whether it is saved in ANSI, UTF-8 or
Unicode?
0
Comment
Question by:slok
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
2 Comments
 
LVL 4

Accepted Solution

by:
edmund_mitchell earned 50 total points
ID: 6466041
Hello slok-

OK-

1) You can read the XML document itself:

An XML document must begin with markup called a prolog. A prolog contains either an XML declaration or a text declaration, optionally followed by a Document Type Declaration, optionally followed by comments or processing instructions. Whitespace may appear after any of these components of the prolog.

A document entity's prolog begins with an XML declaration and takes the form:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>

The version is required, but encoding and standalone declarations are optional. Therefore, reading the XML might not help to determine the encoding, but, you could get lucky.  Encoding declarations are recommended so that XML parsers can be sure they are decoding the document correctly.

2)  Check the byte-order mark

 The spec dictates that if UTF-16 encoding is used, a byte-order mark must be present at the beginning of the document. If no hints to a document's encoding are available, it is assumed that UTF-8 encoding is in effect, and it would be an error if the document were not actually encoded with UTF-8.
Entities encoded in UTF-16 must begin with the Byte Order Mark described by Annex F of [ISO/IEC 10646], Annex H of [ISO/IEC 10646-2000], section 2.4 of [Unicode], and section 2.7 of [Unicode3] (the ZERO WIDTH NO-BREAK SPACE character, #xFEFF). This is an encoding signature, not part of either the markup or the character data of the XML document. XML processors must be able to use this character to differentiate between UTF-8 and UTF-16 encoded documents.
This is what your average parser does, in addition to looking for other clues.

3) Hope they follow the rules, and hope your XML parser enforces the rules:

 In the absence of external character encoding information (such as MIME headers), parsed entities which are stored in an encoding other than UTF-8 or UTF-16 must begin with a text declaration (see 4.3.1 The Text Declaration, at:
http://www.w3.org/TR/2000/REC-xml-20001006#sec-TextDecl) containing an encoding declaration.
If you want to know the pattern required for the encoding declaration, it is described in the spec and guaranteed to work better than sleeping pills:
http://www.w3.org/TR/2000/REC-xml-20001006#charencoding
Note that since ASCII is a subset of UTF-8, ordinary ASCII entities do not strictly need an encoding declaration.

To boil all this down:
Its probably best to parse the byte-order mark, and check to see if it's UTF-16.  If there is no byte-order mark, check the encoding declaration.  If there is no encoding declaration, it's UTF-8.

I hope that helps (or at least helps you go to sleep right away :) )

Edmund


0
 
LVL 3

Author Comment

by:slok
ID: 6470910
I'm going through the articles now.

Give me a buzz if I don't reply/close this question
by the end of the week.
0

Featured Post

Free Tool: SSL Checker

Scans your site and returns information about your SSL implementation and certificate. Helpful for debugging and validating your SSL configuration.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

The Client Need Led Us to RSS I recently had an investment company ask me how they might notify their constituents about their newsworthy publications.  Probably you would think "Facebook" or "Twitter" but this is an interesting client.  Their cons…
Many times as a report developer I've been asked to display normalized data such as three rows with values Jack, Joe, and Bob as a single comma-separated string such as 'Jack, Joe, Bob', and vice versa.  Here's how to do it. 
Monitoring a network: why having a policy is the best policy? Michael Kulchisky, MCSE, MCSA, MCP, VTSP, VSP, CCSP outlines the enormous benefits of having a policy-based approach when monitoring medium and large networks. Software utilized in this v…
Add bar graphs to Access queries using Unicode block characters. Graphs appear on every record in the color you want. Give life to numbers. Hopes this gives you ideas on visualizing your data in new ways ~ Create a calculated field in a query: …

705 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question