Solved

UTF-8 and more

Posted on 2000-02-23
9
1,052 Views
Last Modified: 2008-03-17
Any idea of what characters exactly are we talking of here ?

"There are some characters in the XML file that are not UTF-8 compatable (e.g. octal 224, octal 223, octal 205). And this would cause the XML parser to break"

0
Comment
Question by:Jitu
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 6
  • 2
9 Comments
 
LVL 1

Accepted Solution

by:
Deckmeister earned 25 total points
ID: 2553883
Hi,

UTF-8 is a transformation method of
Unicode, that preserves compatibility
with ASCII.
Indeed, the UTF-8 chareacters that can be found in ASCII characters are coded on 8 bits, with the same decimal value.
0
 
LVL 1

Expert Comment

by:Deckmeister
ID: 2554010
Hi again,

I just want to add some comments to my answer :

UTF means UCS Transformation Format
It is an exchange code (or transfer code) made to send ISO 10646 docs to a file server or on a network.
UTF-7 uses 7 bits for data exchange per character, while UTF-8 uses 8 bits.
The ISO 10646 norm binds all the known alphabets, using 32 bits per character.

One main characteristic of UTF-8 is the preservation of the ASCII characted set.
That is what I tried to explain in my answer.
All the characters of the ASCII set are coded on a single byte, whose value is the ASCII corresponding character value.

The last versions of Navigator or Explorer support UTF-8.
You just have to add in the <head> section of a document a meta-information:
<meta http-equiv="content-type" content="text/html; charset=utf-8">


XML documents use per default Unicode, which is a simple version of ISO 10646, and which codes characters on 16 bytes.
You can specify in an XML document what character set you use, but you should use Unicode.
If it isn't possible, then use ASCII.
0
 
LVL 1

Expert Comment

by:Deckmeister
ID: 2557429
Hi again,

The sentence
"There are some characters in the XML file that are not UTF-8 compatable (e.g. octal 224, octal 223, octal 205). And this would cause the XML parser to break"
surely means you are using special characters (for instance 'é').

All XML processors must accept UTF-8 and UTF-16.
If you want some examples about UTF-8, take a look at http://www.ascc.net/xml/test/wf/utf-8/application_xml/
There are some examples there, written in xml with UTF-8 (so use Explorer 5).
0
Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 1

Author Comment

by:Jitu
ID: 2557973
Deckmeister>
Can u pls tell me what exactly are these characters...can u type them in here pls...:
octal 224, octal 223, octal 205
0
 
LVL 1

Expert Comment

by:Deckmeister
ID: 2563989
Your question has a value of more than 25 points.
0
 
LVL 1

Author Comment

by:Jitu
ID: 2564578
If u could help me with the above Qs I  romise to triple it. :-)
0
 
LVL 1

Expert Comment

by:Deckmeister
ID: 2567777
One question:
octal 205, octal 223 and octal 224 are they ISO-Latin-1 characters? In fact, you say that they are not UTF-8 compatible (so I suppose they are not UTF-8 characters).

In the ISO-Latin-1 ASCII chart, the characters you specify mean:
205 1000 0101   Next Line NEL
223 1001 0011   Set transmission state STS
224 1001 0100   Cancel character CCH
These are reserved control characters (ie decimal values between 127 and 159 included)

Just a correction to my previous answer:
UTF-8 characters have a length between 1 and 6 BYTES. It is a variable length.
For more information about how UTF-8 works, you can read the RFC 2044.
0
 

Expert Comment

by:msonstei
ID: 2625406
Just a comment - I believe Deckmeister means UTF characters take X number of BITS to represent not BYTES.  Am I correct?
0
 
LVL 1

Expert Comment

by:Deckmeister
ID: 2626799
Msonstei>
No, I've said BYTES.
It seems amazing but it's true: UTF-8 characters have a length between 1 and 6 bytes. It is a variable length, whereas Unicode characters have a constant length of 2 bytes.
0

Featured Post

Webinar: Aligning, Automating, Winning

Join Dan Russo, Senior Manager of Operations Intelligence, for an in-depth discussion on how Dealertrack, leading provider of integrated digital solutions for the automotive industry, transformed their DevOps processes to increase collaboration and move with greater velocity.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
RSS Feed Enclosure URL 1 164
XSLT list item selection criteria not working 12 37
XML & .net 5 52
VB.Net. Reading xml value 6 36
The Problem How to write an Xquery that works like a SQL outer join, providing placeholders for absent data on the outer side?  I give a bit more background at the end. The situation expressed as relational data Let’s work through this.  I’ve …
Introduction In my previous article (http://www.experts-exchange.com/Microsoft/Development/MS-SQL-Server/SSIS/A_9150-Loading-XML-Using-SSIS.html) I showed you how the XML Source component can be used to load XML files into a SQL Server database, us…
Exchange organizations may use the Journaling Agent of the Transport Service to archive messages going through Exchange. However, if the Transport Service is integrated with some email content management application (such as an antispam), the admini…

740 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question