Solved

How to zap "curly quotes" or "smart quotes" from scraped page

Posted on 2004-08-01
6
343 Views
Last Modified: 2012-06-21
I'm scraping content (news headlines) from another web site which I then feed into a MySQL database.  But the morons who create content for the site have "special quotes" or "curly quotes" generated by Microsoft products in their text.  So when a headline has an apostrophe, it shows up as a question mark on my web pages.  So instead of "can't", you see "can?t"

I ran the HTML generated by the offending page thourgh a hex editor and found the apostrophe it creates has a hex code of 92.  A real apostrophe has a hex code of 27.

I've always been confused as hell by character sets and getting around these kinds problems.  Is there a function in PHP that will solve this problem.  I tried the htmlspecialchars() function with no results.  Any tips/help would be great.
0
Comment
Question by:nysus1
  • 3
  • 3
6 Comments
 
LVL 36

Accepted Solution

by:
Zyloch earned 500 total points
ID: 11689316
You can try htmlentities()

Regards,
${Zyloch}
0
 

Author Comment

by:nysus1
ID: 11689347
You da man!  That did it.  But why would htmlentities work and not htmspecialchars?  The PHP at manual at http://us4.php.net/htmlentities says the two functions are identical 'except with htmlentities(), all characters which have HTML character entity equivalents are translated into these entities.'  Not too clear to me.

0
 
LVL 36

Expert Comment

by:Zyloch
ID: 11689387
Yes, it means with htmlspecialchars, it'll only change ",',& and some other basic chars but htmlentities changes every single one that has an equivalent char code in HTML, basically anything that appears on the ASCII chart.

Regards,
${Zyloch}
0
Announcing the Most Valuable Experts of 2016

MVEs are more concerned with the satisfaction of those they help than with the considerable points they can earn. They are the types of people you feel privileged to call colleagues. Join us in honoring this amazing group of Experts.

 

Author Comment

by:nysus1
ID: 11689466
OK, one more question then, if I might.  Why does htmlspecialchars need a third argument, then?  If all it is translating is &, ", ' <, and > chars, why would it need to know which character set to use in the conversion?  Wouldn't those basic characters have the same ascii code across the different character sets?
0
 
LVL 36

Expert Comment

by:Zyloch
ID: 11689717
Not necessarily. For instance, Big5 is mainly for Asian languages and may have different character codes representing different characters. Most of the time, though, you'll only be using the default.

Regards,
${Zyloch}
0
 

Author Comment

by:nysus1
ID: 11690688
OK, thanks for you help and explanation.
0

Featured Post

Free Tool: Postgres Monitoring System

A PHP and Perl based system to collect and display usage statistics from PostgreSQL databases.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Generating table dynamically is the most common issue faced by php developers.... So it seems there is a need of an article that explains the basic concept of generating tables dynamically. It just requires a basic knowledge of html and little maths…
Part of the Global Positioning System A geocode (https://developers.google.com/maps/documentation/geocoding/) is the major subset of a GPS coordinate (http://en.wikipedia.org/wiki/Global_Positioning_System), the other parts being the altitude and t…
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
The viewer will learn how to count occurrences of each item in an array.

838 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question