Strange html entity encoding by tidy

When inputing UTF8 characters to "tidy", I get quite weird results for some characters.


$ cat test_ISO88591.xml
<a>
 <b>åäöÅÄÖéÉüÜëË</b>
</a>

$ od -cx test_ISO88591.xml
0000000   <   a   >  \r  \n       <   b   >   å   ä   ö   Å   Ä   Ö   é
           613c    0d3e    200a    623c    e53e    f6e4    c4c5    e9d6
0000020   É   ü   Ü   ë   Ë   <   /   b   >  \r  \n   <   /   a   >
           fcc9    ebdc    3ccb    622f    0d3e    3c0a    612f    003e
0000037

$ iconv -t UTF-8 -f ISO-8859-1 < test_ISO88591.xml > test_UTF8.xml

$ tidy -xml < test_ISO88591.xml 2>/dev/null
<a>
<b>
&#229;&#228;&#246;&#197;&#196;&#214;&#233;&#201;&#252;&#220;&#235;&#203;</b>
</a>

$ tidy -xml < test_UTF8.xml 2>/dev/null
<a>
<b>
&#195;&#165;&#195;&#164;&#195;&#182;&#195;&#8230;&#195;&#8222;&#195;&#8211;&#195;&#169;&#195;&#8240;&#195;&#188;&#195;&#339;&#195;&#171;&#195;&#8249;</b>
</a>

Open in new window


In this example, there are a number of characters that get "wierded up" in the tidy handling of the UTF8 file.
Or do I misunderstand how this should work?

For instance, the &#195;&#8249; entity is of course the same "characters" (by the look of them) as the correct entity &#195;&#171; which it should have been converted to.
But even though they still both result in the "characters ë", the version of "«" defined by the former "&#8249;" is of course not at all translatable by UTF8 - it must be &#171 in order to get the UTF8 bit-calculations correct.

The same goes for &#8222; and &#8211; and &#8240; and &#339; and &#8230;

Is this by design or is this a bug? Is tidy incapable of handling UTF8 files and still keep translating to &#xxx; entities?
Stefan LennerbrantAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Dave BaldwinFixer of ProblemsCommented:
ISO88591 and UTF8 do not have the same character codes above 127.  They are different.  This site is the reference I've been using for years.  http://www.alanwood.net/
0
Stefan LennerbrantAuthor Commented:
Well, as I see it, UTF8 is just a coding method for unicode, it's not a character set at all, in itself.

So, all the iso-8859-1 characters above 128 may be encoded by using two-byte UTF-8
Of course many more characters may be encoded as well, but now I'm aiming at iso-8859-1 characters in the range 128-255 (see my example, for instance)

But ofcourse then the "correct" character values must be used for the encoding.
The html entity encoding "&#195;&#171;" is correct, but the encoding "&#195;&#8249;" is wrong - even though the corresponding characters (glyphs) "look the same".

And thus, "tidy" seems to choose the wrong characters for some UTF8 character values?
0
Dave BaldwinFixer of ProblemsCommented:
No.  A bunch of the iso-8859-1 character positions above 128 are empty in UTF8.  They must be translated to other 'code points'.  On this page http://www.alanwood.net/demos/symbol.html , if you search for &#171; you will see that it translates to &#8596; in Unicode.  I can't find &#8249; anywhere.
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Ultimate Tool Kit for Technology Solution Provider

Broken down into practical pointers and step-by-step instructions, the IT Service Excellence Tool Kit delivers expert advice for technology solution providers. Get your free copy now.

DansDadUKCommented:
>> ... UTF8 is just a coding method for unicode, it's not a character set at all, in itself ...

Agreed.


>> ... So, all the iso-8859-1 characters above 128 may be encoded by using two-byte UTF-8 ...

Agreed.
ISO-8859-1 is an exact subset of Unicode.
All of the characters represented by the (single-byte) code-points in the ISO-8859-1 character set are the same as those represented by the same (double-byte) values (each with a leading 0x00 byte) in the Unicode character set; and they all translate to two-byte UTF-8 values (but these are not the same as the Unicode code-point two-byte values).


>> ... the &#195;&#8249; entity is of course the same "characters" (by the look of them) as the correct entity &#195;&#171; which it should have been converted to ...

The characters represented by entities #171 and #8249 are not the same!

#171 = #xab = U+00AB = Left Pointing Double-Angle Quotation Mark

#8249 = #x2039 = U+2039 = Single Left Pointing Quotation Mark

The UTF-8 representations are:

U+00AB -> 0xc2ab
U+2039 -> 0xe280b9


I'm not familiar with 'tidy', so I'm not sure exactly what it is supposed to do, so perhaps some of the above is not relevant?
0
Stefan LennerbrantAuthor Commented:
Sorry, my bad - I looked at the next-to-last character when stating that "&#171;" was the value that tidy should have picked. I apologize for the confusion.

Instead, the output "&#195;&#8249;" entity should have been "&#195;&#139;" instead. Of course.

And, with this type/error out of the way, I do realize that tidy obviously makes the "wrong decision" on ALL html entities that shall encode UTF8 characters with the second byte between 128 and 159
Which, as stated, is non-printable. Well, maybe this is by design, and not an error.

So the (decimal) UTF two-byte code 195+171 is correctly made by tidy into "&#195;&#171;"
But the 195+139 UTF pair is "incorrectly" made into "&#195;&#8249;"

This, most probably, has to do with "tidy" not being aware of that the input is UTF8, in the first place.
There is an -utf8 parameter which retains UTF8 encoding, not using html entities at all. Maybe that is the way to go.

/Stefan
0
DansDadUKCommented:
>> ... But the 195+139 UTF pair is "incorrectly" made into "&#195;&#8249;" ...

I do not think that this is incorrect; my reasoning is as follows:

In the ISO-8859-1 character set, code-point 139 (0x8B) defines one of the (non-graphic) C1 control-code characters (the <PLD> "Partial Line Down" control-code).
Because ISO-8859-1 is a strict subset of Unicode, it is the same non-graphic 'character' in that character set, with code-point U+008B.
The Windows Latin-1 (CP1252) character set (often used on Windows as the default single-bye encoding), uses (most of) the characters in the C1 control-code range (0x80 -> 0x9f) to define extra graphic characters (this, of course, means that CP1252 is not a strict subset of Unicode).
in CP1252, code-point 139 (0x8B) is mapped to U+2039, which is the "Single Left Pointing Quotation Mark" character.
... and hexadecimal 2039 is decimal 8249, hence the resultant entity value.

i.e. if (on your *n*x system?) you think that code-point 139 (0x8B) is a graphic character ("Single Left Pointing Quotation Mark"), then you are using or defaulting to a local single-byte character set (CP1252 or something similar), which is not a strict Unicode subset.

I hope that the above makes sense!
0
Dave BaldwinFixer of ProblemsCommented:
Because ISO-8859-1 is a strict subset of Unicode
Not exactly.  This page http://www.alanwood.net/demos/wgl4.html#w0080 shows that Unicode has no characters from codepoint 128 to 159.  While all of the characters in ISO-8859-1 can be translated to some character in UTF8, the characters in the range from 128 to 159, they won't be the same code point number.

The 'translation' for numeric 'HTML entities' will be different for ISO-8859-1 and UTF8 in a number of case.  It is Not 'incorrect'.
0
Stefan LennerbrantAuthor Commented:
@DansDadUK, that was a perfect explanation! I do agree completely!
Thanks a lot for that extensive and detailed description.
0
DansDadUKCommented:
@Dave Baldwin:

I think that we'll have to agree to disagree.

I'd rather go by more 'official' definitions than the 'alanwood.net/demos/symbol' page you refer to (which is describing use of the Symbol font, but warns "... Symbol font should not be used in Web pages. This page is not a demonstration of how to use Symbol font; it provides a warning of the problems that it causes, and shows how to use Unicode instead ...").

For example:

In Unicode, the first row (zero) of the first plane (known as the Basic Multilingual Plane) defines code-points in the range U+0000 -> U+FFFF.
For the Unicode range U+0000 -> U+007F, see http://www.unicode.org/charts/PDF/U0000.pdf, which indicates that this range consists of the C0 control codes and the ISO 646 ASCII graphic characters.

For the Unicode range U+0080 -> U+00FF, see http://www.unicode.org/charts/PDF/U0080.pdf, which indicates that this range consists of the C1 control codes and the ISO 8859-1 graphic characters.

ISO-8859-1 is generally assumed to be the original DEC ISO 8859-1 (note only single-hyphen) set of graphic characters, with the addition of the C0 and C1 control-code characters, thus making it a full 256 entry characters set (and which hence matches the Unicode BMP row 0 definition exactly); I can't (just at this moment) find the official definition, but the following is taken from a Wikipedia article - see https://en.wikipedia.org/wiki/ISO/IEC_8859-1

Extract from Wikipedia article

... and the other page (reference by "alanwood.net/demos/wgl4.html#w0080"  to which you refer, stating that "...  Unicode has no characters from codepoint 128 to 159 ..." is misleading;  that code range does not include any glyphs (graphic characters) because they are all control-code characters, which do not have a 'shape'.
0
DansDadUKCommented:
... and I've now found the Unicode.org mapping of ISO-8859-1 - see http://ftp.unicode.org/Public/MAPPINGS/ISO8859/8859-1.TXT which includes both the C0 (0x00 -> 0x1f and 0x7f) and C1 (0x80 -> 0xff) control-code ranges.
0
DansDadUKCommented:
I've now had a chance to look up what 'tidy' is supposed to do (by looking up the 'man' page - I don't have a *n*x system to play with).

It appears that it may assume by default that the input is "Latin-1" - which means that it won't see multi-byte UTF-8 arrays as such, but will process bytes individually.

There appear to be options like:

-utf8 : use UTF-8 for both input and output
char-encoding : specifies the character encoding Tidy uses for both the input and output
input-encoding : specifies the character encoding Tidy uses for the input

e.g.:

'tidy' input-encoding description
So perhaps you should try using one of these options?
0
Stefan LennerbrantAuthor Commented:
I've received so much explanations and extra info on this question - thank you so very much, @DansDadUK and @DaveBaldwin !
0
Dave BaldwinFixer of ProblemsCommented:
You're welcome, glad to help.
0
DansDadUKCommented:
You are welcome; I hope that the advice led to a resolution of your issue.
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Shell Scripting

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.