Go Premium for a chance to win a PS4. Enter to Win

x
?
Solved

Strange html entity encoding by tidy

Posted on 2016-09-13
14
Medium Priority
?
153 Views
Last Modified: 2016-09-19
When inputing UTF8 characters to "tidy", I get quite weird results for some characters.


$ cat test_ISO88591.xml
<a>
 <b>åäöÅÄÖéÉüÜëË</b>
</a>

$ od -cx test_ISO88591.xml
0000000   <   a   >  \r  \n       <   b   >   å   ä   ö   Å   Ä   Ö   é
           613c    0d3e    200a    623c    e53e    f6e4    c4c5    e9d6
0000020   É   ü   Ü   ë   Ë   <   /   b   >  \r  \n   <   /   a   >
           fcc9    ebdc    3ccb    622f    0d3e    3c0a    612f    003e
0000037

$ iconv -t UTF-8 -f ISO-8859-1 < test_ISO88591.xml > test_UTF8.xml

$ tidy -xml < test_ISO88591.xml 2>/dev/null
<a>
<b>
&#229;&#228;&#246;&#197;&#196;&#214;&#233;&#201;&#252;&#220;&#235;&#203;</b>
</a>

$ tidy -xml < test_UTF8.xml 2>/dev/null
<a>
<b>
&#195;&#165;&#195;&#164;&#195;&#182;&#195;&#8230;&#195;&#8222;&#195;&#8211;&#195;&#169;&#195;&#8240;&#195;&#188;&#195;&#339;&#195;&#171;&#195;&#8249;</b>
</a>

Open in new window


In this example, there are a number of characters that get "wierded up" in the tidy handling of the UTF8 file.
Or do I misunderstand how this should work?

For instance, the &#195;&#8249; entity is of course the same "characters" (by the look of them) as the correct entity &#195;&#171; which it should have been converted to.
But even though they still both result in the "characters ë", the version of "«" defined by the former "&#8249;" is of course not at all translatable by UTF8 - it must be &#171 in order to get the UTF8 bit-calculations correct.

The same goes for &#8222; and &#8211; and &#8240; and &#339; and &#8230;

Is this by design or is this a bug? Is tidy incapable of handling UTF8 files and still keep translating to &#xxx; entities?
0
Comment
Question by:Stefan Lennerbrant
  • 6
  • 4
  • 4
14 Comments
 
LVL 84

Expert Comment

by:Dave Baldwin
ID: 41796001
ISO88591 and UTF8 do not have the same character codes above 127.  They are different.  This site is the reference I've been using for years.  http://www.alanwood.net/
0
 

Author Comment

by:Stefan Lennerbrant
ID: 41796192
Well, as I see it, UTF8 is just a coding method for unicode, it's not a character set at all, in itself.

So, all the iso-8859-1 characters above 128 may be encoded by using two-byte UTF-8
Of course many more characters may be encoded as well, but now I'm aiming at iso-8859-1 characters in the range 128-255 (see my example, for instance)

But ofcourse then the "correct" character values must be used for the encoding.
The html entity encoding "&#195;&#171;" is correct, but the encoding "&#195;&#8249;" is wrong - even though the corresponding characters (glyphs) "look the same".

And thus, "tidy" seems to choose the wrong characters for some UTF8 character values?
0
 
LVL 84

Accepted Solution

by:
Dave Baldwin earned 1000 total points
ID: 41796435
No.  A bunch of the iso-8859-1 character positions above 128 are empty in UTF8.  They must be translated to other 'code points'.  On this page http://www.alanwood.net/demos/symbol.html , if you search for &#171; you will see that it translates to &#8596; in Unicode.  I can't find &#8249; anywhere.
0
What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

 
LVL 16

Assisted Solution

by:DansDadUK
DansDadUK earned 1000 total points
ID: 41797728
>> ... UTF8 is just a coding method for unicode, it's not a character set at all, in itself ...

Agreed.


>> ... So, all the iso-8859-1 characters above 128 may be encoded by using two-byte UTF-8 ...

Agreed.
ISO-8859-1 is an exact subset of Unicode.
All of the characters represented by the (single-byte) code-points in the ISO-8859-1 character set are the same as those represented by the same (double-byte) values (each with a leading 0x00 byte) in the Unicode character set; and they all translate to two-byte UTF-8 values (but these are not the same as the Unicode code-point two-byte values).


>> ... the &#195;&#8249; entity is of course the same "characters" (by the look of them) as the correct entity &#195;&#171; which it should have been converted to ...

The characters represented by entities #171 and #8249 are not the same!

#171 = #xab = U+00AB = Left Pointing Double-Angle Quotation Mark

#8249 = #x2039 = U+2039 = Single Left Pointing Quotation Mark

The UTF-8 representations are:

U+00AB -> 0xc2ab
U+2039 -> 0xe280b9


I'm not familiar with 'tidy', so I'm not sure exactly what it is supposed to do, so perhaps some of the above is not relevant?
0
 

Author Comment

by:Stefan Lennerbrant
ID: 41798156
Sorry, my bad - I looked at the next-to-last character when stating that "&#171;" was the value that tidy should have picked. I apologize for the confusion.

Instead, the output "&#195;&#8249;" entity should have been "&#195;&#139;" instead. Of course.

And, with this type/error out of the way, I do realize that tidy obviously makes the "wrong decision" on ALL html entities that shall encode UTF8 characters with the second byte between 128 and 159
Which, as stated, is non-printable. Well, maybe this is by design, and not an error.

So the (decimal) UTF two-byte code 195+171 is correctly made by tidy into "&#195;&#171;"
But the 195+139 UTF pair is "incorrectly" made into "&#195;&#8249;"

This, most probably, has to do with "tidy" not being aware of that the input is UTF8, in the first place.
There is an -utf8 parameter which retains UTF8 encoding, not using html entities at all. Maybe that is the way to go.

/Stefan
0
 
LVL 16

Expert Comment

by:DansDadUK
ID: 41798252
>> ... But the 195+139 UTF pair is "incorrectly" made into "&#195;&#8249;" ...

I do not think that this is incorrect; my reasoning is as follows:

In the ISO-8859-1 character set, code-point 139 (0x8B) defines one of the (non-graphic) C1 control-code characters (the <PLD> "Partial Line Down" control-code).
Because ISO-8859-1 is a strict subset of Unicode, it is the same non-graphic 'character' in that character set, with code-point U+008B.
The Windows Latin-1 (CP1252) character set (often used on Windows as the default single-bye encoding), uses (most of) the characters in the C1 control-code range (0x80 -> 0x9f) to define extra graphic characters (this, of course, means that CP1252 is not a strict subset of Unicode).
in CP1252, code-point 139 (0x8B) is mapped to U+2039, which is the "Single Left Pointing Quotation Mark" character.
... and hexadecimal 2039 is decimal 8249, hence the resultant entity value.

i.e. if (on your *n*x system?) you think that code-point 139 (0x8B) is a graphic character ("Single Left Pointing Quotation Mark"), then you are using or defaulting to a local single-byte character set (CP1252 or something similar), which is not a strict Unicode subset.

I hope that the above makes sense!
0
 
LVL 84

Expert Comment

by:Dave Baldwin
ID: 41798506
Because ISO-8859-1 is a strict subset of Unicode
Not exactly.  This page http://www.alanwood.net/demos/wgl4.html#w0080 shows that Unicode has no characters from codepoint 128 to 159.  While all of the characters in ISO-8859-1 can be translated to some character in UTF8, the characters in the range from 128 to 159, they won't be the same code point number.

The 'translation' for numeric 'HTML entities' will be different for ISO-8859-1 and UTF8 in a number of case.  It is Not 'incorrect'.
0
 

Author Comment

by:Stefan Lennerbrant
ID: 41798508
@DansDadUK, that was a perfect explanation! I do agree completely!
Thanks a lot for that extensive and detailed description.
0
 
LVL 16

Expert Comment

by:DansDadUK
ID: 41798605
@Dave Baldwin:

I think that we'll have to agree to disagree.

I'd rather go by more 'official' definitions than the 'alanwood.net/demos/symbol' page you refer to (which is describing use of the Symbol font, but warns "... Symbol font should not be used in Web pages. This page is not a demonstration of how to use Symbol font; it provides a warning of the problems that it causes, and shows how to use Unicode instead ...").

For example:

In Unicode, the first row (zero) of the first plane (known as the Basic Multilingual Plane) defines code-points in the range U+0000 -> U+FFFF.
For the Unicode range U+0000 -> U+007F, see http://www.unicode.org/charts/PDF/U0000.pdf, which indicates that this range consists of the C0 control codes and the ISO 646 ASCII graphic characters.

For the Unicode range U+0080 -> U+00FF, see http://www.unicode.org/charts/PDF/U0080.pdf, which indicates that this range consists of the C1 control codes and the ISO 8859-1 graphic characters.

ISO-8859-1 is generally assumed to be the original DEC ISO 8859-1 (note only single-hyphen) set of graphic characters, with the addition of the C0 and C1 control-code characters, thus making it a full 256 entry characters set (and which hence matches the Unicode BMP row 0 definition exactly); I can't (just at this moment) find the official definition, but the following is taken from a Wikipedia article - see https://en.wikipedia.org/wiki/ISO/IEC_8859-1

Extract from Wikipedia article

... and the other page (reference by "alanwood.net/demos/wgl4.html#w0080"  to which you refer, stating that "...  Unicode has no characters from codepoint 128 to 159 ..." is misleading;  that code range does not include any glyphs (graphic characters) because they are all control-code characters, which do not have a 'shape'.
0
 
LVL 16

Expert Comment

by:DansDadUK
ID: 41798654
... and I've now found the Unicode.org mapping of ISO-8859-1 - see http://ftp.unicode.org/Public/MAPPINGS/ISO8859/8859-1.TXT which includes both the C0 (0x00 -> 0x1f and 0x7f) and C1 (0x80 -> 0xff) control-code ranges.
0
 
LVL 16

Expert Comment

by:DansDadUK
ID: 41799492
I've now had a chance to look up what 'tidy' is supposed to do (by looking up the 'man' page - I don't have a *n*x system to play with).

It appears that it may assume by default that the input is "Latin-1" - which means that it won't see multi-byte UTF-8 arrays as such, but will process bytes individually.

There appear to be options like:

-utf8 : use UTF-8 for both input and output
char-encoding : specifies the character encoding Tidy uses for both the input and output
input-encoding : specifies the character encoding Tidy uses for the input

e.g.:

'tidy' input-encoding description
So perhaps you should try using one of these options?
0
 

Author Comment

by:Stefan Lennerbrant
ID: 41804484
I've received so much explanations and extra info on this question - thank you so very much, @DansDadUK and @DaveBaldwin !
0
 
LVL 84

Expert Comment

by:Dave Baldwin
ID: 41804504
You're welcome, glad to help.
0
 
LVL 16

Expert Comment

by:DansDadUK
ID: 41804574
You are welcome; I hope that the advice led to a resolution of your issue.
0

Featured Post

What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

I. Introduction There's an interesting discussion going on now in an Experts Exchange Group — Attachments with no extension (http://www.experts-exchange.com/discussions/210281/Attachments-with-no-extension.html). This reminded me of questions tha…
The purpose of this article is to demonstrate how we can use conditional statements using Python.
Learn how to get help with Linux/Unix bash shell commands. Use help to read help documents for built in bash shell commands.: Use man to interface with the online reference manuals for shell commands.: Use man to search man pages for unknown command…
Learn how to navigate the file tree with the shell. Use pwd to print the current working directory: Use ls to list a directory's contents: Use cd to change to a new directory: Use wildcards instead of typing out long directory names: Use ../ to move…
Suggested Courses
Course of the Month12 days, 18 hours left to enroll

971 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question