kjenney
asked on
QueryPath UTF-8 Encoding
I have a page that I have specifically converted to UTF-8 to eliminate unwanted characters. I have verified the encoding and the page comes up fine locally in all browsers. When I parse the page with QueryPath (htmlqp) I am left with a phantom character:
U+00E2 â c3 a2 LATIN SMALL LETTER A WITH CIRCUMFLEX
in place of
U+0027 ' 27 APOSTROPHE
I've tried adding the options convert_from_encoding => utf-8 and strip_low_ascii but I'm still left with this character. Any ideas how to fix this?
U+00E2 â c3 a2 LATIN SMALL LETTER A WITH CIRCUMFLEX
in place of
U+0027 ' 27 APOSTROPHE
I've tried adding the options convert_from_encoding => utf-8 and strip_low_ascii but I'm still left with this character. Any ideas how to fix this?
I don't know this library, but by looking around, I saw that you may try the option convert_to_encoding => utf-8. Maybe also combined with the from form.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Thus removing the useless character works, you can still get others when modifying the page.
This is clearly an encoding issue. Maybe a problem exists within the library you use (which I hope they already tested it), maybe there is a miss use of it. Though, this should be fixed by using encoding conversion.
Hoping you won't have to modify this page soon if you continue with your fix.
This is clearly an encoding issue. Maybe a problem exists within the library you use (which I hope they already tested it), maybe there is a miss use of it. Though, this should be fixed by using encoding conversion.
Hoping you won't have to modify this page soon if you continue with your fix.
ASKER
No solution given to filter out the character. Substitution worked for me.