troubleshooting Question

Strange html entity encoding by tidy

Avatar of Stefan Lennerbrant
Stefan LennerbrantFlag for Sweden asked on
Shell ScriptingXMLLinux
14 Comments2 Solutions313 ViewsLast Modified:
When inputing UTF8 characters to "tidy", I get quite weird results for some characters.


$ cat test_ISO88591.xml
<a>
 <b>åäöÅÄÖéÉüÜëË</b>
</a>

$ od -cx test_ISO88591.xml
0000000   <   a   >  \r  \n       <   b   >   å   ä   ö   Å   Ä   Ö   é
           613c    0d3e    200a    623c    e53e    f6e4    c4c5    e9d6
0000020   É   ü   Ü   ë   Ë   <   /   b   >  \r  \n   <   /   a   >
           fcc9    ebdc    3ccb    622f    0d3e    3c0a    612f    003e
0000037

$ iconv -t UTF-8 -f ISO-8859-1 < test_ISO88591.xml > test_UTF8.xml

$ tidy -xml < test_ISO88591.xml 2>/dev/null
<a>
<b>
&#229;&#228;&#246;&#197;&#196;&#214;&#233;&#201;&#252;&#220;&#235;&#203;</b>
</a>

$ tidy -xml < test_UTF8.xml 2>/dev/null
<a>
<b>
&#195;&#165;&#195;&#164;&#195;&#182;&#195;&#8230;&#195;&#8222;&#195;&#8211;&#195;&#169;&#195;&#8240;&#195;&#188;&#195;&#339;&#195;&#171;&#195;&#8249;</b>
</a>

In this example, there are a number of characters that get "wierded up" in the tidy handling of the UTF8 file.
Or do I misunderstand how this should work?

For instance, the &#195;&#8249; entity is of course the same "characters" (by the look of them) as the correct entity &#195;&#171; which it should have been converted to.
But even though they still both result in the "characters ë", the version of "«" defined by the former "&#8249;" is of course not at all translatable by UTF8 - it must be &#171 in order to get the UTF8 bit-calculations correct.

The same goes for &#8222; and &#8211; and &#8240; and &#339; and &#8230;

Is this by design or is this a bug? Is tidy incapable of handling UTF8 files and still keep translating to &#xxx; entities?
ASKER CERTIFIED SOLUTION
Join our community to see this answer!
Unlock 2 Answers and 14 Comments.
Start Free Trial
Learn from the best

Network and collaborate with thousands of CTOs, CISOs, and IT Pros rooting for you and your success.

Andrew Hancock - VMware vExpert
See if this solution works for you by signing up for a 7 day free trial.
Unlock 2 Answers and 14 Comments.
Try for 7 days

”The time we save is the biggest benefit of E-E to our team. What could take multiple guys 2 hours or more each to find is accessed in around 15 minutes on Experts Exchange.

-Mike Kapnisakis, Warner Bros