asked on

Create RSS from web page using cURL

Hello

I am trying to scrape a site by tags using the cURL library

See the code below. I am running WAMP and have the cURL library enabled but the feed does not write anything besides the top - it will not iterate through the nodes.

<?php

        $url = 'http://www.thehoneycomb.com/default.cfm';
        $title = 'The Honeycomb';
        $description = 'Events';

        $userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';

        header('Content-type: text/xml; charset=utf-8', true);

        echo '<?xml version="1.0" encoding="UTF-8"?'.'>' . "\n";
        echo '<rss version="2.0">' . "\n";
        echo '<channel>' . "\n";
        echo '  <title>' . $title . '</title>' . "\n";
        echo '  <link>' . $url . '</link>' . "\n";
        echo '  <description>' . $description . '</description>' . "\n";

        $curl = curl_init($url);
        curl_setopt($curl, CURLOPT_USERAGENT, $userAgent);
        curl_setopt($curl, CURLOPT_AUTOREFERER, true);
        curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
        curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1 );
        curl_setopt($curl, CURLOPT_TIMEOUT, 2 );                

        $html = curl_exec( $curl );

        $html = @mb_convert_encoding($html, 'HTML-ENTITIES', 'utf-8');   

        curl_close( $curl );

        $dom = new DOMDocument();

        @$dom->loadHTML($html);

        $nodes = $dom->getElementsByTagName('*');

        $date = '';
        
        $description = '';

        foreach($nodes as $node){

                if($node->nodeName == 'p'){
                        $date =  strtotime($node->nodeValue);
                }

                if($node->nodeName == 'tr'){

                        $inodes = $node->childNodes;

                        foreach($inodes as $inode){

                                if($inode->nodeName == 'a' && $inode->getAttribute('class') == 'permalink'){
                                        echo '<item>' . "\n";
                                        echo '<title>' . @mb_convert_encoding(htmlspecialchars($inode->getAttribute('title')), 'utf-8') . '</title>' . "\n";
                                        echo '<link>' . $inode->getAttribute('href') . '</link>' . "\n";
                                        echo '<description>' . $inode->getAttribute('td') . '</description>' . "\n";
                                        if($date){
                                                echo '<pubDate>' . date(DATE_RSS, $date) . '</pubDate>' . "\n";
                                        }
                                        echo '</item>' . "\n";
                                }
                        }
                }
        }

        echo '</channel></rss>';

?>

Open in new window

ASKER CERTIFIED SOLUTION

CSecurity

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

dialektkid

ASKER

Nothing is being written to the description field. The link field looks good! For pub date I can just use the server time.

CSecurity

There is no desc. also available for links and <a href tags on that page.

CSecurity

Sorry, found how to fix it.

Replace
@$td = $link['attributes']['td'];

to
@$td = strip_tags($link['contents']);

Ray Paseur

It might be hard to work with that site, partially because of this:

http://validator.w3.org/check?uri=http%3A%2F%2Fwww.thehoneycomb.com%2Fdefault.cfm&charset=%28detect+automatically%29&doctype=Inline&group=0

Do you own the content of the site? If so, you might want to hire a developer to get into the ColdFusion code and just write out a native RSS feed. If you do not own the site, please be careful about using the content in an RSS feed - make sure you have permission to republish the information so you do not run afoul of copyrights.

SOLUTION

Ray Paseur

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

Ray Paseur

Any progress? Any questions? Have you tried installing and running the code snippet I posted yet?

Best, ~Ray

Ray Paseur

Tested and working code here, with explanation of what is afoot: 12/19/09 10:48 AM, ID: 26087292

Cheers, ~Ray