dialektkid
asked on
Create RSS from web page using cURL
Hello
I am trying to scrape a site by tags using the cURL library
See the code below. I am running WAMP and have the cURL library enabled but the feed does not write anything besides the top - it will not iterate through the nodes.
I am trying to scrape a site by tags using the cURL library
See the code below. I am running WAMP and have the cURL library enabled but the feed does not write anything besides the top - it will not iterate through the nodes.
<?php
$url = 'http://www.thehoneycomb.com/default.cfm';
$title = 'The Honeycomb';
$description = 'Events';
$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';
header('Content-type: text/xml; charset=utf-8', true);
echo '<?xml version="1.0" encoding="UTF-8"?'.'>' . "\n";
echo '<rss version="2.0">' . "\n";
echo '<channel>' . "\n";
echo ' <title>' . $title . '</title>' . "\n";
echo ' <link>' . $url . '</link>' . "\n";
echo ' <description>' . $description . '</description>' . "\n";
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_USERAGENT, $userAgent);
curl_setopt($curl, CURLOPT_AUTOREFERER, true);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1 );
curl_setopt($curl, CURLOPT_TIMEOUT, 2 );
$html = curl_exec( $curl );
$html = @mb_convert_encoding($html, 'HTML-ENTITIES', 'utf-8');
curl_close( $curl );
$dom = new DOMDocument();
@$dom->loadHTML($html);
$nodes = $dom->getElementsByTagName('*');
$date = '';
$description = '';
foreach($nodes as $node){
if($node->nodeName == 'p'){
$date = strtotime($node->nodeValue);
}
if($node->nodeName == 'tr'){
$inodes = $node->childNodes;
foreach($inodes as $inode){
if($inode->nodeName == 'a' && $inode->getAttribute('class') == 'permalink'){
echo '<item>' . "\n";
echo '<title>' . @mb_convert_encoding(htmlspecialchars($inode->getAttribute('title')), 'utf-8') . '</title>' . "\n";
echo '<link>' . $inode->getAttribute('href') . '</link>' . "\n";
echo '<description>' . $inode->getAttribute('td') . '</description>' . "\n";
if($date){
echo '<pubDate>' . date(DATE_RSS, $date) . '</pubDate>' . "\n";
}
echo '</item>' . "\n";
}
}
}
}
echo '</channel></rss>';
?>
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
There is no desc. also available for links and <a href tags on that page.
Sorry, found how to fix it.
Replace
@$td = $link['attributes']['td'];
to
@$td = strip_tags($link['contents ']);
Replace
@$td = $link['attributes']['td'];
to
@$td = strip_tags($link['contents
It might be hard to work with that site, partially because of this:
http://validator.w3.org/check?uri=http%3A%2F%2Fwww.thehoneycomb.com%2Fdefault.cfm&charset=%28detect+automatically%29&doctype=Inline&group=0
Do you own the content of the site? If so, you might want to hire a developer to get into the ColdFusion code and just write out a native RSS feed. If you do not own the site, please be careful about using the content in an RSS feed - make sure you have permission to republish the information so you do not run afoul of copyrights.
http://validator.w3.org/check?uri=http%3A%2F%2Fwww.thehoneycomb.com%2Fdefault.cfm&charset=%28detect+automatically%29&doctype=Inline&group=0
Do you own the content of the site? If so, you might want to hire a developer to get into the ColdFusion code and just write out a native RSS feed. If you do not own the site, please be careful about using the content in an RSS feed - make sure you have permission to republish the information so you do not run afoul of copyrights.
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Any progress? Any questions? Have you tried installing and running the code snippet I posted yet?
Best, ~Ray
Best, ~Ray
Tested and working code here, with explanation of what is afoot: 12/19/09 10:48 AM, ID: 26087292
Cheers, ~Ray
Cheers, ~Ray
ASKER