[Last Call] Learn how to a build a cloud-first strategyRegister Now

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 1171
  • Last Modified:

QueryPath UTF-8 Encoding

I have a page that I have specifically converted to UTF-8 to eliminate unwanted characters. I have verified the encoding and the page comes up fine locally in all browsers. When I parse the page with QueryPath (htmlqp) I am left with a phantom character:

U+00E2      รข      c3 a2      LATIN SMALL LETTER A WITH CIRCUMFLEX

in place of

U+0027      '      27      APOSTROPHE

I've tried adding the options convert_from_encoding => utf-8 and strip_low_ascii but I'm still left with this character. Any ideas how to fix this?
0
kjenney
Asked:
kjenney
  • 2
  • 2
1 Solution
 
djon2003Commented:
I don't know this library, but by looking around, I saw that you may try the option convert_to_encoding => utf-8. Maybe also combined with the from form.
0
 
kjenneyAuthor Commented:
I ended up just substituting the unwanted character. None of the options worked to remove it.
0
 
djon2003Commented:
Thus removing the useless character works, you can still get others when modifying the page.

This is clearly an encoding issue. Maybe a problem exists within the library you use (which I hope they already tested it), maybe there is a miss use of it. Though, this should be fixed by using encoding conversion.

Hoping you won't have to modify this page soon if you continue with your fix.
0
 
kjenneyAuthor Commented:
No solution given to filter out the character. Substitution worked for me.
0

Featured Post

Free Tool: Port Scanner

Check which ports are open to the outside world. Helps make sure that your firewall rules are working as intended.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

  • 2
  • 2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now