[Last Call] Learn about multicloud storage options and how to improve your company's cloud strategy. Register Now

x
?
Solved

How to zap "curly quotes" or "smart quotes" from scraped page

Posted on 2004-08-01
6
Medium Priority
?
347 Views
Last Modified: 2012-06-21
I'm scraping content (news headlines) from another web site which I then feed into a MySQL database.  But the morons who create content for the site have "special quotes" or "curly quotes" generated by Microsoft products in their text.  So when a headline has an apostrophe, it shows up as a question mark on my web pages.  So instead of "can't", you see "can?t"

I ran the HTML generated by the offending page thourgh a hex editor and found the apostrophe it creates has a hex code of 92.  A real apostrophe has a hex code of 27.

I've always been confused as hell by character sets and getting around these kinds problems.  Is there a function in PHP that will solve this problem.  I tried the htmlspecialchars() function with no results.  Any tips/help would be great.
0
Comment
Question by:nysus1
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 3
  • 3
6 Comments
 
LVL 36

Accepted Solution

by:
Zyloch earned 2000 total points
ID: 11689316
You can try htmlentities()

Regards,
${Zyloch}
0
 

Author Comment

by:nysus1
ID: 11689347
You da man!  That did it.  But why would htmlentities work and not htmspecialchars?  The PHP at manual at http://us4.php.net/htmlentities says the two functions are identical 'except with htmlentities(), all characters which have HTML character entity equivalents are translated into these entities.'  Not too clear to me.

0
 
LVL 36

Expert Comment

by:Zyloch
ID: 11689387
Yes, it means with htmlspecialchars, it'll only change ",',& and some other basic chars but htmlentities changes every single one that has an equivalent char code in HTML, basically anything that appears on the ASCII chart.

Regards,
${Zyloch}
0
What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

 

Author Comment

by:nysus1
ID: 11689466
OK, one more question then, if I might.  Why does htmlspecialchars need a third argument, then?  If all it is translating is &, ", ' <, and > chars, why would it need to know which character set to use in the conversion?  Wouldn't those basic characters have the same ascii code across the different character sets?
0
 
LVL 36

Expert Comment

by:Zyloch
ID: 11689717
Not necessarily. For instance, Big5 is mainly for Asian languages and may have different character codes representing different characters. Most of the time, though, you'll only be using the default.

Regards,
${Zyloch}
0
 

Author Comment

by:nysus1
ID: 11690688
OK, thanks for you help and explanation.
0

Featured Post

What’s Wrong with Your Cloud Strategy ?

Even as many CIOs are embracing a cloud-first strategy, the reality is that moving to the cloud is a lengthy process and the end-state is likely to be a blend of multiple clouds—public and private. Learn why multicloud solutions matter in this webinar by Nimble Storage.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

These days socially coordinated efforts have turned into a critical requirement for enterprises.
This article discusses how to create an extensible mechanism for linked drop downs.
The viewer will learn how to count occurrences of each item in an array.
The viewer will learn how to look for a specific file type in a local or remote server directory using PHP.
Suggested Courses

656 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question