When inputing UTF8 characters to "tidy", I get quite weird results for some characters.
$ cat test_ISO88591.xml
$ od -cx test_ISO88591.xml
0000000 < a > \r \n < b > å ä ö Å Ä Ö é
613c 0d3e 200a 623c e53e f6e4 c4c5 e9d6
0000020 É ü Ü ë Ë < / b > \r \n < / a >
fcc9 ebdc 3ccb 622f 0d3e 3c0a 612f 003e
$ iconv -t UTF-8 -f ISO-8859-1 < test_ISO88591.xml > test_UTF8.xml
$ tidy -xml < test_ISO88591.xml 2>/dev/null
$ tidy -xml < test_UTF8.xml 2>/dev/null
In this example, there are a number of characters that get "wierded up" in the tidy handling of the UTF8 file.
Or do I misunderstand how this should work?
For instance, the Ã‹ entity is of course the same "characters" (by the look of them) as the correct entity Ã« which it should have been converted to.
But even though they still both result in the "characters Ã«", the version of "«" defined by the former "‹" is of course not at all translatable by UTF8 - it must be « in order to get the UTF8 bit-calculations correct.
The same goes for „ and – and ‰ and œ and …
Is this by design or is this a bug? Is tidy incapable of handling UTF8 files and still keep translating to &#xxx; entities?