Still celebrating National IT Professionals Day with 3 months of free Premium Membership. Use Code ITDAY17

x
?
Solved

UTF-8 and more

Posted on 2000-02-23
9
Medium Priority
?
1,060 Views
Last Modified: 2008-03-17
Any idea of what characters exactly are we talking of here ?

"There are some characters in the XML file that are not UTF-8 compatable (e.g. octal 224, octal 223, octal 205). And this would cause the XML parser to break"

0
Comment
Question by:Jitu
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 6
  • 2
9 Comments
 
LVL 1

Accepted Solution

by:
Deckmeister earned 75 total points
ID: 2553883
Hi,

UTF-8 is a transformation method of
Unicode, that preserves compatibility
with ASCII.
Indeed, the UTF-8 chareacters that can be found in ASCII characters are coded on 8 bits, with the same decimal value.
0
 
LVL 1

Expert Comment

by:Deckmeister
ID: 2554010
Hi again,

I just want to add some comments to my answer :

UTF means UCS Transformation Format
It is an exchange code (or transfer code) made to send ISO 10646 docs to a file server or on a network.
UTF-7 uses 7 bits for data exchange per character, while UTF-8 uses 8 bits.
The ISO 10646 norm binds all the known alphabets, using 32 bits per character.

One main characteristic of UTF-8 is the preservation of the ASCII characted set.
That is what I tried to explain in my answer.
All the characters of the ASCII set are coded on a single byte, whose value is the ASCII corresponding character value.

The last versions of Navigator or Explorer support UTF-8.
You just have to add in the <head> section of a document a meta-information:
<meta http-equiv="content-type" content="text/html; charset=utf-8">


XML documents use per default Unicode, which is a simple version of ISO 10646, and which codes characters on 16 bytes.
You can specify in an XML document what character set you use, but you should use Unicode.
If it isn't possible, then use ASCII.
0
 
LVL 1

Expert Comment

by:Deckmeister
ID: 2557429
Hi again,

The sentence
"There are some characters in the XML file that are not UTF-8 compatable (e.g. octal 224, octal 223, octal 205). And this would cause the XML parser to break"
surely means you are using special characters (for instance 'é').

All XML processors must accept UTF-8 and UTF-16.
If you want some examples about UTF-8, take a look at http://www.ascc.net/xml/test/wf/utf-8/application_xml/
There are some examples there, written in xml with UTF-8 (so use Explorer 5).
0
Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

 
LVL 1

Author Comment

by:Jitu
ID: 2557973
Deckmeister>
Can u pls tell me what exactly are these characters...can u type them in here pls...:
octal 224, octal 223, octal 205
0
 
LVL 1

Expert Comment

by:Deckmeister
ID: 2563989
Your question has a value of more than 25 points.
0
 
LVL 1

Author Comment

by:Jitu
ID: 2564578
If u could help me with the above Qs I  romise to triple it. :-)
0
 
LVL 1

Expert Comment

by:Deckmeister
ID: 2567777
One question:
octal 205, octal 223 and octal 224 are they ISO-Latin-1 characters? In fact, you say that they are not UTF-8 compatible (so I suppose they are not UTF-8 characters).

In the ISO-Latin-1 ASCII chart, the characters you specify mean:
205 1000 0101   Next Line NEL
223 1001 0011   Set transmission state STS
224 1001 0100   Cancel character CCH
These are reserved control characters (ie decimal values between 127 and 159 included)

Just a correction to my previous answer:
UTF-8 characters have a length between 1 and 6 BYTES. It is a variable length.
For more information about how UTF-8 works, you can read the RFC 2044.
0
 

Expert Comment

by:msonstei
ID: 2625406
Just a comment - I believe Deckmeister means UTF characters take X number of BITS to represent not BYTES.  Am I correct?
0
 
LVL 1

Expert Comment

by:Deckmeister
ID: 2626799
Msonstei>
No, I've said BYTES.
It seems amazing but it's true: UTF-8 characters have a length between 1 and 6 bytes. It is a variable length, whereas Unicode characters have a constant length of 2 bytes.
0

Featured Post

RHCE - Red Hat OpenStack Prep Course

This course will provide in-depth training so that students who currently hold the EX200 & EX210 certifications can sit for the EX310 exam. Students will learn how to deploy & manage a full Red Hat environment with Ceph block storage, & integrate Ceph into other OpenStack service

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

The Client Need Led Us to RSS I recently had an investment company ask me how they might notify their constituents about their newsworthy publications.  Probably you would think "Facebook" or "Twitter" but this is an interesting client.  Their cons…
I was working on a PowerPoint add-in the other day and a client asked me "can you implement a feature which processes a chart when it's pasted into a slide from another deck?". It got me wondering how to hook into built-in ribbon events in Office.
Video by: ITPro.TV
In this episode Don builds upon the troubleshooting techniques by demonstrating how to properly monitor a vSphere deployment to detect problems before they occur. He begins the show using tools found within the vSphere suite as ends the show demonst…
Visualize your data even better in Access queries. Given a date and a value, this lesson shows how to compare that value with the previous value, calculate the difference, and display a circle if the value is the same, an up triangle if it increased…
Suggested Courses

705 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question