Link to home
Start Free TrialLog in
Avatar of dialektkid
dialektkid

asked on

Create RSS from web page using cURL

Hello

I am trying to scrape a site by tags using the cURL library

See the code below. I am running WAMP and have the cURL library enabled but the feed does not write anything besides the top - it will not iterate through the nodes.


<?php

        $url = 'http://www.thehoneycomb.com/default.cfm';
        $title = 'The Honeycomb';
        $description = 'Events';

        $userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';

        header('Content-type: text/xml; charset=utf-8', true);

        echo '<?xml version="1.0" encoding="UTF-8"?'.'>' . "\n";
        echo '<rss version="2.0">' . "\n";
        echo '<channel>' . "\n";
        echo '  <title>' . $title . '</title>' . "\n";
        echo '  <link>' . $url . '</link>' . "\n";
        echo '  <description>' . $description . '</description>' . "\n";

        $curl = curl_init($url);
        curl_setopt($curl, CURLOPT_USERAGENT, $userAgent);
        curl_setopt($curl, CURLOPT_AUTOREFERER, true);
        curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
        curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1 );
        curl_setopt($curl, CURLOPT_TIMEOUT, 2 );                

        $html = curl_exec( $curl );

        $html = @mb_convert_encoding($html, 'HTML-ENTITIES', 'utf-8');   

        curl_close( $curl );

        $dom = new DOMDocument();

        @$dom->loadHTML($html);

        $nodes = $dom->getElementsByTagName('*');

        $date = '';
        
        $description = '';

        foreach($nodes as $node){

                if($node->nodeName == 'p'){
                        $date =  strtotime($node->nodeValue);
                }

                if($node->nodeName == 'tr'){

                        $inodes = $node->childNodes;

                        foreach($inodes as $inode){

                                if($inode->nodeName == 'a' && $inode->getAttribute('class') == 'permalink'){
                                        echo '<item>' . "\n";
                                        echo '<title>' . @mb_convert_encoding(htmlspecialchars($inode->getAttribute('title')), 'utf-8') . '</title>' . "\n";
                                        echo '<link>' . $inode->getAttribute('href') . '</link>' . "\n";
                                        echo '<description>' . $inode->getAttribute('td') . '</description>' . "\n";
                                        if($date){
                                                echo '<pubDate>' . date(DATE_RSS, $date) . '</pubDate>' . "\n";
                                        }
                                        echo '</item>' . "\n";
                                }
                        }
                }
        }

        echo '</channel></rss>';

?>

Open in new window

ASKER CERTIFIED SOLUTION
Avatar of CSecurity
CSecurity
Flag of Iran, Islamic Republic of image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of dialektkid
dialektkid

ASKER

Nothing is being written to the  description field.  The link field looks good! For pub date I can just use the server time.
There is no desc. also available for links and <a href tags on that page.
Sorry, found how to fix it.

Replace
@$td = $link['attributes']['td'];

to
@$td = strip_tags($link['contents']);
It might be hard to work with that site, partially because of this:

http://validator.w3.org/check?uri=http%3A%2F%2Fwww.thehoneycomb.com%2Fdefault.cfm&charset=%28detect+automatically%29&doctype=Inline&group=0

Do you own the content of the site?  If so, you might want to hire a developer to get into the ColdFusion code and just write out a native RSS feed.  If you do not own the site, please be careful about using the content in an RSS feed - make sure you have permission to republish the information so you do not run afoul of copyrights.
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Any progress?  Any questions?  Have you tried installing and running the code snippet I posted yet?

Best, ~Ray
Tested and working code here, with explanation of what is afoot: 12/19/09 10:48 AM, ID: 26087292

Cheers, ~Ray