In this example, there are a number of characters that get "wierded up" in the tidy handling of the UTF8 file.
Or do I misunderstand how this should work?
For instance, the Ë entity is of course the same "characters" (by the look of them) as the correct entity ë which it should have been converted to.
But even though they still both result in the "characters ë", the version of "«" defined by the former "‹" is of course not at all translatable by UTF8 - it must be « in order to get the UTF8 bit-calculations correct.
The same goes for „ and – and ‰ and œ and …
Is this by design or is this a bug? Is tidy incapable of handling UTF8 files and still keep translating to &#xxx; entities?
Shell ScriptingXMLLinux
Last Comment
DansDadUK
8/22/2022 - Mon
Dave Baldwin
ISO88591 and UTF8 do not have the same character codes above 127. They are different. This site is the reference I've been using for years. http://www.alanwood.net/
Stefan Lennerbrant
ASKER
Well, as I see it, UTF8 is just a coding method for unicode, it's not a character set at all, in itself.
So, all the iso-8859-1 characters above 128 may be encoded by using two-byte UTF-8
Of course many more characters may be encoded as well, but now I'm aiming at iso-8859-1 characters in the range 128-255 (see my example, for instance)
But ofcourse then the "correct" character values must be used for the encoding.
The html entity encoding "ë" is correct, but the encoding "Ë" is wrong - even though the corresponding characters (glyphs) "look the same".
And thus, "tidy" seems to choose the wrong characters for some UTF8 character values?
Sorry, my bad - I looked at the next-to-last character when stating that "«" was the value that tidy should have picked. I apologize for the confusion.
Instead, the output "Ë" entity should have been "Ë" instead. Of course.
And, with this type/error out of the way, I do realize that tidy obviously makes the "wrong decision" on ALL html entities that shall encode UTF8 characters with the second byte between 128 and 159
Which, as stated, is non-printable. Well, maybe this is by design, and not an error.
So the (decimal) UTF two-byte code 195+171 is correctly made by tidy into "ë"
But the 195+139 UTF pair is "incorrectly" made into "Ë"
This, most probably, has to do with "tidy" not being aware of that the input is UTF8, in the first place.
There is an -utf8 parameter which retains UTF8 encoding, not using html entities at all. Maybe that is the way to go.
>> ... But the 195+139 UTF pair is "incorrectly" made into "Ë" ...
I do not think that this is incorrect; my reasoning is as follows:
In the ISO-8859-1 character set, code-point 139 (0x8B) defines one of the (non-graphic) C1 control-code characters (the <PLD> "Partial Line Down" control-code).
Because ISO-8859-1 is a strict subset of Unicode, it is the same non-graphic 'character' in that character set, with code-point U+008B.
The Windows Latin-1 (CP1252) character set (often used on Windows as the default single-bye encoding), uses (most of) the characters in the C1 control-code range (0x80 -> 0x9f) to define extra graphic characters (this, of course, means that CP1252 is not a strict subset of Unicode).
in CP1252, code-point 139 (0x8B) is mapped to U+2039, which is the "Single Left Pointing Quotation Mark" character.
... and hexadecimal2039 is decimal8249, hence the resultant entity value.
i.e. if (on your *n*x system?) you think that code-point 139 (0x8B) is a graphic character ("Single Left Pointing Quotation Mark"), then you are using or defaulting to a local single-byte character set (CP1252 or something similar), which is not a strict Unicode subset.
I hope that the above makes sense!
Dave Baldwin
Because ISO-8859-1 is a strict subset of Unicode
Not exactly. This page http://www.alanwood.net/demos/wgl4.html#w0080 shows that Unicode has no characters from codepoint 128 to 159. While all of the characters in ISO-8859-1 can be translated to some character in UTF8, the characters in the range from 128 to 159, they won't be the same code point number.
The 'translation' for numeric 'HTML entities' will be different for ISO-8859-1 and UTF8 in a number of case. It is Not 'incorrect'.
Stefan Lennerbrant
ASKER
@DansDadUK, that was a perfect explanation! I do agree completely!
Thanks a lot for that extensive and detailed description.
I'd rather go by more 'official' definitions than the 'alanwood.net/demos/symbol' page you refer to (which is describing use of the Symbol font, but warns "... Symbol font should not be used in Web pages. This page is not a demonstration of how to use Symbol font; it provides a warning of the problems that it causes, and shows how to use Unicode instead ...").
For example:
In Unicode, the first row (zero) of the first plane (known as the Basic Multilingual Plane) defines code-points in the range U+0000 -> U+FFFF.
For the Unicode range U+0000 -> U+007F, see http://www.unicode.org/charts/PDF/U0000.pdf, which indicates that this range consists of the C0 control codes and the ISO 646 ASCII graphic characters.
For the Unicode range U+0080 -> U+00FF, see http://www.unicode.org/charts/PDF/U0080.pdf, which indicates that this range consists of the C1 control codes and the ISO 8859-1 graphic characters.
ISO-8859-1 is generally assumed to be the original DEC ISO 8859-1 (note only single-hyphen) set of graphic characters, with the addition of the C0 and C1 control-code characters, thus making it a full 256 entry characters set (and which hence matches the Unicode BMP row 0 definition exactly); I can't (just at this moment) find the official definition, but the following is taken from a Wikipedia article - see https://en.wikipedia.org/wiki/ISO/IEC_8859-1
... and the other page (reference by "alanwood.net/demos/wgl4.html#w0080" to which you refer, stating that "... Unicode has no characters from codepoint 128 to 159 ..." is misleading; that code range does not include any glyphs (graphic characters) because they are all control-code characters, which do not have a 'shape'.
I've now had a chance to look up what 'tidy' is supposed to do (by looking up the 'man' page - I don't have a *n*x system to play with).
It appears that it may assume by default that the input is "Latin-1" - which means that it won't see multi-byte UTF-8 arrays as such, but will process bytes individually.
There appear to be options like:
-utf8 : use UTF-8 for both input and output
char-encoding : specifies the character encoding Tidy uses for both the input and output
input-encoding : specifies the character encoding Tidy uses for the input
e.g.:
So perhaps you should try using one of these options?