We help IT Professionals succeed at work.
Get Started

Strange html entity encoding by tidy

312 Views
Last Modified: 2016-09-19
When inputing UTF8 characters to "tidy", I get quite weird results for some characters.


$ cat test_ISO88591.xml
<a>
 <b>åäöÅÄÖéÉüÜëË</b>
</a>

$ od -cx test_ISO88591.xml
0000000   <   a   >  \r  \n       <   b   >   å   ä   ö   Å   Ä   Ö   é
           613c    0d3e    200a    623c    e53e    f6e4    c4c5    e9d6
0000020   É   ü   Ü   ë   Ë   <   /   b   >  \r  \n   <   /   a   >
           fcc9    ebdc    3ccb    622f    0d3e    3c0a    612f    003e
0000037

$ iconv -t UTF-8 -f ISO-8859-1 < test_ISO88591.xml > test_UTF8.xml

$ tidy -xml < test_ISO88591.xml 2>/dev/null
<a>
<b>
&#229;&#228;&#246;&#197;&#196;&#214;&#233;&#201;&#252;&#220;&#235;&#203;</b>
</a>

$ tidy -xml < test_UTF8.xml 2>/dev/null
<a>
<b>
&#195;&#165;&#195;&#164;&#195;&#182;&#195;&#8230;&#195;&#8222;&#195;&#8211;&#195;&#169;&#195;&#8240;&#195;&#188;&#195;&#339;&#195;&#171;&#195;&#8249;</b>
</a>

Open in new window


In this example, there are a number of characters that get "wierded up" in the tidy handling of the UTF8 file.
Or do I misunderstand how this should work?

For instance, the &#195;&#8249; entity is of course the same "characters" (by the look of them) as the correct entity &#195;&#171; which it should have been converted to.
But even though they still both result in the "characters ë", the version of "«" defined by the former "&#8249;" is of course not at all translatable by UTF8 - it must be &#171 in order to get the UTF8 bit-calculations correct.

The same goes for &#8222; and &#8211; and &#8240; and &#339; and &#8230;

Is this by design or is this a bug? Is tidy incapable of handling UTF8 files and still keep translating to &#xxx; entities?
Comment
Watch Question
Fixer of Problems
CERTIFIED EXPERT
Most Valuable Expert 2014
Commented:
This problem has been solved!
Unlock 2 Answers and 14 Comments.
See Answers
Why Experts Exchange?

Experts Exchange always has the answer, or at the least points me in the correct direction! It is like having another employee that is extremely experienced.

Jim Murphy
Programmer at Smart IT Solutions

When asked, what has been your best career decision?

Deciding to stick with EE.

Mohamed Asif
Technical Department Head

Being involved with EE helped me to grow personally and professionally.

Carl Webster
CTP, Sr Infrastructure Consultant
Ask ANY Question

Connect with Certified Experts to gain insight and support on specific technology challenges including:

  • Troubleshooting
  • Research
  • Professional Opinions
Did You Know?

We've partnered with two important charities to provide clean water and computer science education to those who need it most. READ MORE