I am trying to pull news from sites such as CNN yahoo or MSN. Scrape and then parse the results of the page

Hoping to hear from Batalf on this one.
Worth 500 Points.
I have seen many ways to scrape a page. All are great.
an Example would be to pull an article "paragraph out of a page"
and re display it.

I would like to be able to have it take a varible call it work count:
set it to 300 charters then ad ... at the end.

I need it to be formatted like a paragraph.
Also I would like to give it a keyword and of the keyword apears BOLD that keyword in the paragraph:
$keyword = "money";

<p>this is the sentence about<strong> money</strong>. It is a short sentence...</p>
-Jason
LVL 3
jbrashear72Asked:
Who is Participating?
 
keteracelConnect With a Mentor Commented:
You'll have to write a 'scraper' for each site. It's not a trivial problem. Have a look at my answer to a previous question for scraping of DMOZ:

http://www.experts-exchange.com/Web/Web_Languages/PHP/Q_21226800.html

also note that RSS are given to you for free... scraping may well be illegal - against copyright law! Use the scripts only if you are prepared for potential consequences!
0
 
mensuckCommented:
Hi

All of these sites have XML feeds so using a feed would be better than scraping the page. With using XML you exactly know what content you are working with, in the case of scraping you have to know exactly what URL contains the data to scrape. Give me example of (3 URLS) pages from each of your examples (with KEY WORDS) and I will show how to make a generic function that will process all of them using the same functions!

Suzanne
0
 
jbrashear72Author Commented:
Well I am also wanting to scrape from DMOZ and Other sites. Those were just examples.
Not all sites are RSS.
I dont want to rely on RSS/XML feeds.
0
 
bozoka45Commented:
The problem with what you're trying to do is that it might not work the same for each time for a particular site unless the HTML is unique (IE: they use a template). Like mensuck said, a few examples will be needed in order to help you out.
0
 
jbrashear72Author Commented:
I want to pull in a complete page and strip out all tags except say what is in <p>
or <h1> or <h2>
or <strong>
Stip out this content. What I am trying to acomplish is to illegal This is a test.
Thanks for the concern.
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.