Solved

Strange html entity encoding by tidy

Posted on 2016-09-13
14
68 Views
Last Modified: 2016-09-19
When inputing UTF8 characters to "tidy", I get quite weird results for some characters.


$ cat test_ISO88591.xml
<a>
 <b>åäöÅÄÖéÉüÜëË</b>
</a>

$ od -cx test_ISO88591.xml
0000000   <   a   >  \r  \n       <   b   >   å   ä   ö   Å   Ä   Ö   é
           613c    0d3e    200a    623c    e53e    f6e4    c4c5    e9d6
0000020   É   ü   Ü   ë   Ë   <   /   b   >  \r  \n   <   /   a   >
           fcc9    ebdc    3ccb    622f    0d3e    3c0a    612f    003e
0000037

$ iconv -t UTF-8 -f ISO-8859-1 < test_ISO88591.xml > test_UTF8.xml

$ tidy -xml < test_ISO88591.xml 2>/dev/null
<a>
<b>
&#229;&#228;&#246;&#197;&#196;&#214;&#233;&#201;&#252;&#220;&#235;&#203;</b>
</a>

$ tidy -xml < test_UTF8.xml 2>/dev/null
<a>
<b>
&#195;&#165;&#195;&#164;&#195;&#182;&#195;&#8230;&#195;&#8222;&#195;&#8211;&#195;&#169;&#195;&#8240;&#195;&#188;&#195;&#339;&#195;&#171;&#195;&#8249;</b>
</a>

Open in new window


In this example, there are a number of characters that get "wierded up" in the tidy handling of the UTF8 file.
Or do I misunderstand how this should work?

For instance, the &#195;&#8249; entity is of course the same "characters" (by the look of them) as the correct entity &#195;&#171; which it should have been converted to.
But even though they still both result in the "characters ë", the version of "«" defined by the former "&#8249;" is of course not at all translatable by UTF8 - it must be &#171 in order to get the UTF8 bit-calculations correct.

The same goes for &#8222; and &#8211; and &#8240; and &#339; and &#8230;

Is this by design or is this a bug? Is tidy incapable of handling UTF8 files and still keep translating to &#xxx; entities?
0
Comment
Question by:stefanlennerbrant
  • 6
  • 4
  • 4
14 Comments
 
LVL 82

Expert Comment

by:Dave Baldwin
ID: 41796001
ISO88591 and UTF8 do not have the same character codes above 127.  They are different.  This site is the reference I've been using for years.  http://www.alanwood.net/
0
 

Author Comment

by:stefanlennerbrant
ID: 41796192
Well, as I see it, UTF8 is just a coding method for unicode, it's not a character set at all, in itself.

So, all the iso-8859-1 characters above 128 may be encoded by using two-byte UTF-8
Of course many more characters may be encoded as well, but now I'm aiming at iso-8859-1 characters in the range 128-255 (see my example, for instance)

But ofcourse then the "correct" character values must be used for the encoding.
The html entity encoding "&#195;&#171;" is correct, but the encoding "&#195;&#8249;" is wrong - even though the corresponding characters (glyphs) "look the same".

And thus, "tidy" seems to choose the wrong characters for some UTF8 character values?
0
 
LVL 82

Accepted Solution

by:
Dave Baldwin earned 250 total points
ID: 41796435
No.  A bunch of the iso-8859-1 character positions above 128 are empty in UTF8.  They must be translated to other 'code points'.  On this page http://www.alanwood.net/demos/symbol.html , if you search for &#171; you will see that it translates to &#8596; in Unicode.  I can't find &#8249; anywhere.
0
 
LVL 16

Assisted Solution

by:DansDadUK
DansDadUK earned 250 total points
ID: 41797728
>> ... UTF8 is just a coding method for unicode, it's not a character set at all, in itself ...

Agreed.


>> ... So, all the iso-8859-1 characters above 128 may be encoded by using two-byte UTF-8 ...

Agreed.
ISO-8859-1 is an exact subset of Unicode.
All of the characters represented by the (single-byte) code-points in the ISO-8859-1 character set are the same as those represented by the same (double-byte) values (each with a leading 0x00 byte) in the Unicode character set; and they all translate to two-byte UTF-8 values (but these are not the same as the Unicode code-point two-byte values).


>> ... the &#195;&#8249; entity is of course the same "characters" (by the look of them) as the correct entity &#195;&#171; which it should have been converted to ...

The characters represented by entities #171 and #8249 are not the same!

#171 = #xab = U+00AB = Left Pointing Double-Angle Quotation Mark

#8249 = #x2039 = U+2039 = Single Left Pointing Quotation Mark

The UTF-8 representations are:

U+00AB -> 0xc2ab
U+2039 -> 0xe280b9


I'm not familiar with 'tidy', so I'm not sure exactly what it is supposed to do, so perhaps some of the above is not relevant?
0
 

Author Comment

by:stefanlennerbrant
ID: 41798156
Sorry, my bad - I looked at the next-to-last character when stating that "&#171;" was the value that tidy should have picked. I apologize for the confusion.

Instead, the output "&#195;&#8249;" entity should have been "&#195;&#139;" instead. Of course.

And, with this type/error out of the way, I do realize that tidy obviously makes the "wrong decision" on ALL html entities that shall encode UTF8 characters with the second byte between 128 and 159
Which, as stated, is non-printable. Well, maybe this is by design, and not an error.

So the (decimal) UTF two-byte code 195+171 is correctly made by tidy into "&#195;&#171;"
But the 195+139 UTF pair is "incorrectly" made into "&#195;&#8249;"

This, most probably, has to do with "tidy" not being aware of that the input is UTF8, in the first place.
There is an -utf8 parameter which retains UTF8 encoding, not using html entities at all. Maybe that is the way to go.

/Stefan
0
 
LVL 16

Expert Comment

by:DansDadUK
ID: 41798252
>> ... But the 195+139 UTF pair is "incorrectly" made into "&#195;&#8249;" ...

I do not think that this is incorrect; my reasoning is as follows:

In the ISO-8859-1 character set, code-point 139 (0x8B) defines one of the (non-graphic) C1 control-code characters (the <PLD> "Partial Line Down" control-code).
Because ISO-8859-1 is a strict subset of Unicode, it is the same non-graphic 'character' in that character set, with code-point U+008B.
The Windows Latin-1 (CP1252) character set (often used on Windows as the default single-bye encoding), uses (most of) the characters in the C1 control-code range (0x80 -> 0x9f) to define extra graphic characters (this, of course, means that CP1252 is not a strict subset of Unicode).
in CP1252, code-point 139 (0x8B) is mapped to U+2039, which is the "Single Left Pointing Quotation Mark" character.
... and hexadecimal 2039 is decimal 8249, hence the resultant entity value.

i.e. if (on your *n*x system?) you think that code-point 139 (0x8B) is a graphic character ("Single Left Pointing Quotation Mark"), then you are using or defaulting to a local single-byte character set (CP1252 or something similar), which is not a strict Unicode subset.

I hope that the above makes sense!
0
 
LVL 82

Expert Comment

by:Dave Baldwin
ID: 41798506
Because ISO-8859-1 is a strict subset of Unicode
Not exactly.  This page http://www.alanwood.net/demos/wgl4.html#w0080 shows that Unicode has no characters from codepoint 128 to 159.  While all of the characters in ISO-8859-1 can be translated to some character in UTF8, the characters in the range from 128 to 159, they won't be the same code point number.

The 'translation' for numeric 'HTML entities' will be different for ISO-8859-1 and UTF8 in a number of case.  It is Not 'incorrect'.
0
How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

 

Author Comment

by:stefanlennerbrant
ID: 41798508
@DansDadUK, that was a perfect explanation! I do agree completely!
Thanks a lot for that extensive and detailed description.
0
 
LVL 16

Expert Comment

by:DansDadUK
ID: 41798605
@Dave Baldwin:

I think that we'll have to agree to disagree.

I'd rather go by more 'official' definitions than the 'alanwood.net/demos/symbol' page you refer to (which is describing use of the Symbol font, but warns "... Symbol font should not be used in Web pages. This page is not a demonstration of how to use Symbol font; it provides a warning of the problems that it causes, and shows how to use Unicode instead ...").

For example:

In Unicode, the first row (zero) of the first plane (known as the Basic Multilingual Plane) defines code-points in the range U+0000 -> U+FFFF.
For the Unicode range U+0000 -> U+007F, see http://www.unicode.org/charts/PDF/U0000.pdf, which indicates that this range consists of the C0 control codes and the ISO 646 ASCII graphic characters.

For the Unicode range U+0080 -> U+00FF, see http://www.unicode.org/charts/PDF/U0080.pdf, which indicates that this range consists of the C1 control codes and the ISO 8859-1 graphic characters.

ISO-8859-1 is generally assumed to be the original DEC ISO 8859-1 (note only single-hyphen) set of graphic characters, with the addition of the C0 and C1 control-code characters, thus making it a full 256 entry characters set (and which hence matches the Unicode BMP row 0 definition exactly); I can't (just at this moment) find the official definition, but the following is taken from a Wikipedia article - see https://en.wikipedia.org/wiki/ISO/IEC_8859-1

Extract from Wikipedia article

... and the other page (reference by "alanwood.net/demos/wgl4.html#w0080"  to which you refer, stating that "...  Unicode has no characters from codepoint 128 to 159 ..." is misleading;  that code range does not include any glyphs (graphic characters) because they are all control-code characters, which do not have a 'shape'.
0
 
LVL 16

Expert Comment

by:DansDadUK
ID: 41798654
... and I've now found the Unicode.org mapping of ISO-8859-1 - see http://ftp.unicode.org/Public/MAPPINGS/ISO8859/8859-1.TXT which includes both the C0 (0x00 -> 0x1f and 0x7f) and C1 (0x80 -> 0xff) control-code ranges.
0
 
LVL 16

Expert Comment

by:DansDadUK
ID: 41799492
I've now had a chance to look up what 'tidy' is supposed to do (by looking up the 'man' page - I don't have a *n*x system to play with).

It appears that it may assume by default that the input is "Latin-1" - which means that it won't see multi-byte UTF-8 arrays as such, but will process bytes individually.

There appear to be options like:

-utf8 : use UTF-8 for both input and output
char-encoding : specifies the character encoding Tidy uses for both the input and output
input-encoding : specifies the character encoding Tidy uses for the input

e.g.:

'tidy' input-encoding description
So perhaps you should try using one of these options?
0
 

Author Comment

by:stefanlennerbrant
ID: 41804484
I've received so much explanations and extra info on this question - thank you so very much, @DansDadUK and @DaveBaldwin !
0
 
LVL 82

Expert Comment

by:Dave Baldwin
ID: 41804504
You're welcome, glad to help.
0
 
LVL 16

Expert Comment

by:DansDadUK
ID: 41804574
You are welcome; I hope that the advice led to a resolution of your issue.
0

Featured Post

Threat Intelligence Starter Resources

Integrating threat intelligence can be challenging, and not all companies are ready. These resources can help you build awareness and prepare for defense.

Join & Write a Comment

Suggested Solutions

Linux users are sometimes dumbfounded by the severe lack of documentation on a topic. Sometimes, the documentation is copious, but other times, you end up with some obscure "it varies depending on your distribution" over and over when searching for …
How to remove superseded packages in windows w60 or w61 installation media (.wim) or online system to prevent unnecessary space. w60 means Windows Vista or Windows Server 2008. w61 means Windows 7 or Windows Server 2008 R2. There are various …
Learn how to find files with the shell using the find and locate commands. Use locate to find a needle in a haystack.: With locate, check if the file still exists.: Use find to get the actual location of the file.:
This demo shows you how to set up the containerized NetScaler CPX with NetScaler Management and Analytics System in a non-routable Mesos/Marathon environment for use with Micro-Services applications.

744 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

13 Experts available now in Live!

Get 1:1 Help Now