• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 230
  • Last Modified:

I am trying to pull news from sites such as CNN yahoo or MSN. Scrape and then parse the results of the page

Hoping to hear from Batalf on this one.
Worth 500 Points.
I have seen many ways to scrape a page. All are great.
an Example would be to pull an article "paragraph out of a page"
and re display it.

I would like to be able to have it take a varible call it work count:
set it to 300 charters then ad ... at the end.

I need it to be formatted like a paragraph.
Also I would like to give it a keyword and of the keyword apears BOLD that keyword in the paragraph:
$keyword = "money";

<p>this is the sentence about<strong> money</strong>. It is a short sentence...</p>
1 Solution

All of these sites have XML feeds so using a feed would be better than scraping the page. With using XML you exactly know what content you are working with, in the case of scraping you have to know exactly what URL contains the data to scrape. Give me example of (3 URLS) pages from each of your examples (with KEY WORDS) and I will show how to make a generic function that will process all of them using the same functions!

jbrashear72Author Commented:
Well I am also wanting to scrape from DMOZ and Other sites. Those were just examples.
Not all sites are RSS.
I dont want to rely on RSS/XML feeds.
The problem with what you're trying to do is that it might not work the same for each time for a particular site unless the HTML is unique (IE: they use a template). Like mensuck said, a few examples will be needed in order to help you out.
You'll have to write a 'scraper' for each site. It's not a trivial problem. Have a look at my answer to a previous question for scraping of DMOZ:


also note that RSS are given to you for free... scraping may well be illegal - against copyright law! Use the scripts only if you are prepared for potential consequences!
jbrashear72Author Commented:
I want to pull in a complete page and strip out all tags except say what is in <p>
or <h1> or <h2>
or <strong>
Stip out this content. What I am trying to acomplish is to illegal This is a test.
Thanks for the concern.

Featured Post

Free Tool: Site Down Detector

Helpful to verify reports of your own downtime, or to double check a downed website you are trying to access.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Tackle projects and never again get stuck behind a technical roadblock.
Join Now