Link to home
Start Free TrialLog in
Avatar of kjenney
kjenney

asked on

QueryPath UTF-8 Encoding

I have a page that I have specifically converted to UTF-8 to eliminate unwanted characters. I have verified the encoding and the page comes up fine locally in all browsers. When I parse the page with QueryPath (htmlqp) I am left with a phantom character:

U+00E2      â      c3 a2      LATIN SMALL LETTER A WITH CIRCUMFLEX

in place of

U+0027      '      27      APOSTROPHE

I've tried adding the options convert_from_encoding => utf-8 and strip_low_ascii but I'm still left with this character. Any ideas how to fix this?
Avatar of djon2003
djon2003
Flag of Canada image

I don't know this library, but by looking around, I saw that you may try the option convert_to_encoding => utf-8. Maybe also combined with the from form.
ASKER CERTIFIED SOLUTION
Avatar of kjenney
kjenney

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Thus removing the useless character works, you can still get others when modifying the page.

This is clearly an encoding issue. Maybe a problem exists within the library you use (which I hope they already tested it), maybe there is a miss use of it. Though, this should be fixed by using encoding conversion.

Hoping you won't have to modify this page soon if you continue with your fix.
Avatar of kjenney
kjenney

ASKER

No solution given to filter out the character. Substitution worked for me.