troubleshooting Question

Create RSS from web page using cURL

Avatar of dialektkid
dialektkid asked on
Web Languages and StandardsPHP
8 Comments2 Solutions865 ViewsLast Modified:
Hello

I am trying to scrape a site by tags using the cURL library

See the code below. I am running WAMP and have the cURL library enabled but the feed does not write anything besides the top - it will not iterate through the nodes.


<?php

        $url = 'http://www.thehoneycomb.com/default.cfm';
        $title = 'The Honeycomb';
        $description = 'Events';

        $userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';

        header('Content-type: text/xml; charset=utf-8', true);

        echo '<?xml version="1.0" encoding="UTF-8"?'.'>' . "\n";
        echo '<rss version="2.0">' . "\n";
        echo '<channel>' . "\n";
        echo '  <title>' . $title . '</title>' . "\n";
        echo '  <link>' . $url . '</link>' . "\n";
        echo '  <description>' . $description . '</description>' . "\n";

        $curl = curl_init($url);
        curl_setopt($curl, CURLOPT_USERAGENT, $userAgent);
        curl_setopt($curl, CURLOPT_AUTOREFERER, true);
        curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
        curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1 );
        curl_setopt($curl, CURLOPT_TIMEOUT, 2 );                

        $html = curl_exec( $curl );

        $html = @mb_convert_encoding($html, 'HTML-ENTITIES', 'utf-8');   

        curl_close( $curl );

        $dom = new DOMDocument();

        @$dom->loadHTML($html);

        $nodes = $dom->getElementsByTagName('*');

        $date = '';
        
        $description = '';

        foreach($nodes as $node){

                if($node->nodeName == 'p'){
                        $date =  strtotime($node->nodeValue);
                }

                if($node->nodeName == 'tr'){

                        $inodes = $node->childNodes;

                        foreach($inodes as $inode){

                                if($inode->nodeName == 'a' && $inode->getAttribute('class') == 'permalink'){
                                        echo '<item>' . "\n";
                                        echo '<title>' . @mb_convert_encoding(htmlspecialchars($inode->getAttribute('title')), 'utf-8') . '</title>' . "\n";
                                        echo '<link>' . $inode->getAttribute('href') . '</link>' . "\n";
                                        echo '<description>' . $inode->getAttribute('td') . '</description>' . "\n";
                                        if($date){
                                                echo '<pubDate>' . date(DATE_RSS, $date) . '</pubDate>' . "\n";
                                        }
                                        echo '</item>' . "\n";
                                }
                        }
                }
        }

        echo '</channel></rss>';

?>
Join the community to see this answer!
Join our exclusive community to see this answer & millions of others.
Unlock 2 Answers and 8 Comments.
Join the Community
Learn from the best

Network and collaborate with thousands of CTOs, CISOs, and IT Pros rooting for you and your success.

Andrew Hancock - VMware vExpert
See if this solution works for you by signing up for a 7 day free trial.
Unlock 2 Answers and 8 Comments.
Try for 7 days

”The time we save is the biggest benefit of E-E to our team. What could take multiple guys 2 hours or more each to find is accessed in around 15 minutes on Experts Exchange.

-Mike Kapnisakis, Warner Bros