I am trying to pull news from sites such as CNN yahoo or MSN. Scrape and then parse the results of the page

Posted on 2005-05-16
Last Modified: 2008-03-06
Hoping to hear from Batalf on this one.
Worth 500 Points.
I have seen many ways to scrape a page. All are great.
an Example would be to pull an article "paragraph out of a page"
and re display it.

I would like to be able to have it take a varible call it work count:
set it to 300 charters then ad ... at the end.

I need it to be formatted like a paragraph.
Also I would like to give it a keyword and of the keyword apears BOLD that keyword in the paragraph:
$keyword = "money";

<p>this is the sentence about<strong> money</strong>. It is a short sentence...</p>
Question by:jbrashear72
    LVL 10

    Expert Comment


    All of these sites have XML feeds so using a feed would be better than scraping the page. With using XML you exactly know what content you are working with, in the case of scraping you have to know exactly what URL contains the data to scrape. Give me example of (3 URLS) pages from each of your examples (with KEY WORDS) and I will show how to make a generic function that will process all of them using the same functions!

    LVL 3

    Author Comment

    Well I am also wanting to scrape from DMOZ and Other sites. Those were just examples.
    Not all sites are RSS.
    I dont want to rely on RSS/XML feeds.
    LVL 1

    Expert Comment

    The problem with what you're trying to do is that it might not work the same for each time for a particular site unless the HTML is unique (IE: they use a template). Like mensuck said, a few examples will be needed in order to help you out.
    LVL 9

    Accepted Solution

    You'll have to write a 'scraper' for each site. It's not a trivial problem. Have a look at my answer to a previous question for scraping of DMOZ:

    also note that RSS are given to you for free... scraping may well be illegal - against copyright law! Use the scripts only if you are prepared for potential consequences!
    LVL 3

    Author Comment

    I want to pull in a complete page and strip out all tags except say what is in <p>
    or <h1> or <h2>
    or <strong>
    Stip out this content. What I am trying to acomplish is to illegal This is a test.
    Thanks for the concern.

    Featured Post

    Looking for New Ways to Advertise?

    Engage with tech pros in our community with native advertising, as a Vendor Expert, and more.

    Join & Write a Comment

    The Client Need Led Us to RSS I recently had an investment company ask me how they might notify their constituents about their newsworthy publications.  Probably you would think "Facebook" or "Twitter" but this is an interesting client.  Their cons…
    Deprecated and Headed for the Dustbin By now, you have probably heard that some PHP features, while convenient, can also cause PHP security problems.  This article discusses one of those, called register_globals.  It is a thing you do not want.  …
    The viewer will learn how to dynamically set the form action using jQuery.
    This tutorial will teach you the core code needed to finalize the addition of a watermark to your image. The viewer will use a small PHP class to learn and create a watermark.

    745 members asked questions and received personalized solutions in the past 7 days.

    Join the community of 500,000 technology professionals and ask your questions.

    Join & Ask a Question

    Need Help in Real-Time?

    Connect with top rated Experts

    16 Experts available now in Live!

    Get 1:1 Help Now