Link to home
Create AccountLog in
Avatar of Stefan Lennerbrant
Stefan LennerbrantFlag for Sweden

asked on

Strange html entity encoding by tidy

When inputing UTF8 characters to "tidy", I get quite weird results for some characters.

$ cat test_ISO88591.xml

$ od -cx test_ISO88591.xml
0000000   <   a   >  \r  \n       <   b   >   å   ä   ö   Å   Ä   Ö   é
           613c    0d3e    200a    623c    e53e    f6e4    c4c5    e9d6
0000020   É   ü   Ü   ë   Ë   <   /   b   >  \r  \n   <   /   a   >
           fcc9    ebdc    3ccb    622f    0d3e    3c0a    612f    003e

$ iconv -t UTF-8 -f ISO-8859-1 < test_ISO88591.xml > test_UTF8.xml

$ tidy -xml < test_ISO88591.xml 2>/dev/null

$ tidy -xml < test_UTF8.xml 2>/dev/null

Open in new window

In this example, there are a number of characters that get "wierded up" in the tidy handling of the UTF8 file.
Or do I misunderstand how this should work?

For instance, the &#195;&#8249; entity is of course the same "characters" (by the look of them) as the correct entity &#195;&#171; which it should have been converted to.
But even though they still both result in the "characters ë", the version of "«" defined by the former "&#8249;" is of course not at all translatable by UTF8 - it must be &#171 in order to get the UTF8 bit-calculations correct.

The same goes for &#8222; and &#8211; and &#8240; and &#339; and &#8230;

Is this by design or is this a bug? Is tidy incapable of handling UTF8 files and still keep translating to &#xxx; entities?
Avatar of Dave Baldwin
Dave Baldwin
Flag of United States of America image

ISO88591 and UTF8 do not have the same character codes above 127.  They are different.  This site is the reference I've been using for years.
Avatar of Stefan Lennerbrant


Well, as I see it, UTF8 is just a coding method for unicode, it's not a character set at all, in itself.

So, all the iso-8859-1 characters above 128 may be encoded by using two-byte UTF-8
Of course many more characters may be encoded as well, but now I'm aiming at iso-8859-1 characters in the range 128-255 (see my example, for instance)

But ofcourse then the "correct" character values must be used for the encoding.
The html entity encoding "&#195;&#171;" is correct, but the encoding "&#195;&#8249;" is wrong - even though the corresponding characters (glyphs) "look the same".

And thus, "tidy" seems to choose the wrong characters for some UTF8 character values?
Avatar of Dave Baldwin
Dave Baldwin
Flag of United States of America image

Link to home
Create an account to see this answer
Signing up is free. No credit card required.
Create Account
Link to home
Create an account to see this answer
Signing up is free. No credit card required.
Create Account
Sorry, my bad - I looked at the next-to-last character when stating that "&#171;" was the value that tidy should have picked. I apologize for the confusion.

Instead, the output "&#195;&#8249;" entity should have been "&#195;&#139;" instead. Of course.

And, with this type/error out of the way, I do realize that tidy obviously makes the "wrong decision" on ALL html entities that shall encode UTF8 characters with the second byte between 128 and 159
Which, as stated, is non-printable. Well, maybe this is by design, and not an error.

So the (decimal) UTF two-byte code 195+171 is correctly made by tidy into "&#195;&#171;"
But the 195+139 UTF pair is "incorrectly" made into "&#195;&#8249;"

This, most probably, has to do with "tidy" not being aware of that the input is UTF8, in the first place.
There is an -utf8 parameter which retains UTF8 encoding, not using html entities at all. Maybe that is the way to go.

>> ... But the 195+139 UTF pair is "incorrectly" made into "&#195;&#8249;" ...

I do not think that this is incorrect; my reasoning is as follows:

In the ISO-8859-1 character set, code-point 139 (0x8B) defines one of the (non-graphic) C1 control-code characters (the <PLD> "Partial Line Down" control-code).
Because ISO-8859-1 is a strict subset of Unicode, it is the same non-graphic 'character' in that character set, with code-point U+008B.
The Windows Latin-1 (CP1252) character set (often used on Windows as the default single-bye encoding), uses (most of) the characters in the C1 control-code range (0x80 -> 0x9f) to define extra graphic characters (this, of course, means that CP1252 is not a strict subset of Unicode).
in CP1252, code-point 139 (0x8B) is mapped to U+2039, which is the "Single Left Pointing Quotation Mark" character.
... and hexadecimal 2039 is decimal 8249, hence the resultant entity value.

i.e. if (on your *n*x system?) you think that code-point 139 (0x8B) is a graphic character ("Single Left Pointing Quotation Mark"), then you are using or defaulting to a local single-byte character set (CP1252 or something similar), which is not a strict Unicode subset.

I hope that the above makes sense!
Because ISO-8859-1 is a strict subset of Unicode
Not exactly.  This page shows that Unicode has no characters from codepoint 128 to 159.  While all of the characters in ISO-8859-1 can be translated to some character in UTF8, the characters in the range from 128 to 159, they won't be the same code point number.

The 'translation' for numeric 'HTML entities' will be different for ISO-8859-1 and UTF8 in a number of case.  It is Not 'incorrect'.
@DansDadUK, that was a perfect explanation! I do agree completely!
Thanks a lot for that extensive and detailed description.
@Dave Baldwin:

I think that we'll have to agree to disagree.

I'd rather go by more 'official' definitions than the '' page you refer to (which is describing use of the Symbol font, but warns "... Symbol font should not be used in Web pages. This page is not a demonstration of how to use Symbol font; it provides a warning of the problems that it causes, and shows how to use Unicode instead ...").

For example:

In Unicode, the first row (zero) of the first plane (known as the Basic Multilingual Plane) defines code-points in the range U+0000 -> U+FFFF.
For the Unicode range U+0000 -> U+007F, see, which indicates that this range consists of the C0 control codes and the ISO 646 ASCII graphic characters.

For the Unicode range U+0080 -> U+00FF, see, which indicates that this range consists of the C1 control codes and the ISO 8859-1 graphic characters.

ISO-8859-1 is generally assumed to be the original DEC ISO 8859-1 (note only single-hyphen) set of graphic characters, with the addition of the C0 and C1 control-code characters, thus making it a full 256 entry characters set (and which hence matches the Unicode BMP row 0 definition exactly); I can't (just at this moment) find the official definition, but the following is taken from a Wikipedia article - see

User generated image

... and the other page (reference by ""  to which you refer, stating that "...  Unicode has no characters from codepoint 128 to 159 ..." is misleading;  that code range does not include any glyphs (graphic characters) because they are all control-code characters, which do not have a 'shape'.
... and I've now found the mapping of ISO-8859-1 - see which includes both the C0 (0x00 -> 0x1f and 0x7f) and C1 (0x80 -> 0xff) control-code ranges.
I've now had a chance to look up what 'tidy' is supposed to do (by looking up the 'man' page - I don't have a *n*x system to play with).

It appears that it may assume by default that the input is "Latin-1" - which means that it won't see multi-byte UTF-8 arrays as such, but will process bytes individually.

There appear to be options like:

-utf8 : use UTF-8 for both input and output
char-encoding : specifies the character encoding Tidy uses for both the input and output
input-encoding : specifies the character encoding Tidy uses for the input


User generated image
So perhaps you should try using one of these options?
I've received so much explanations and extra info on this question - thank you so very much, @DansDadUK and @DaveBaldwin !
You're welcome, glad to help.
You are welcome; I hope that the advice led to a resolution of your issue.