[Okta Webinar] Learn how to a build a cloud-first strategyRegister Now

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 783
  • Last Modified:

Create RSS from web page using cURL

Hello

I am trying to scrape a site by tags using the cURL library

See the code below. I am running WAMP and have the cURL library enabled but the feed does not write anything besides the top - it will not iterate through the nodes.


<?php

        $url = 'http://www.thehoneycomb.com/default.cfm';
        $title = 'The Honeycomb';
        $description = 'Events';

        $userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';

        header('Content-type: text/xml; charset=utf-8', true);

        echo '<?xml version="1.0" encoding="UTF-8"?'.'>' . "\n";
        echo '<rss version="2.0">' . "\n";
        echo '<channel>' . "\n";
        echo '  <title>' . $title . '</title>' . "\n";
        echo '  <link>' . $url . '</link>' . "\n";
        echo '  <description>' . $description . '</description>' . "\n";

        $curl = curl_init($url);
        curl_setopt($curl, CURLOPT_USERAGENT, $userAgent);
        curl_setopt($curl, CURLOPT_AUTOREFERER, true);
        curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
        curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1 );
        curl_setopt($curl, CURLOPT_TIMEOUT, 2 );                

        $html = curl_exec( $curl );

        $html = @mb_convert_encoding($html, 'HTML-ENTITIES', 'utf-8');   

        curl_close( $curl );

        $dom = new DOMDocument();

        @$dom->loadHTML($html);

        $nodes = $dom->getElementsByTagName('*');

        $date = '';
        
        $description = '';

        foreach($nodes as $node){

                if($node->nodeName == 'p'){
                        $date =  strtotime($node->nodeValue);
                }

                if($node->nodeName == 'tr'){

                        $inodes = $node->childNodes;

                        foreach($inodes as $inode){

                                if($inode->nodeName == 'a' && $inode->getAttribute('class') == 'permalink'){
                                        echo '<item>' . "\n";
                                        echo '<title>' . @mb_convert_encoding(htmlspecialchars($inode->getAttribute('title')), 'utf-8') . '</title>' . "\n";
                                        echo '<link>' . $inode->getAttribute('href') . '</link>' . "\n";
                                        echo '<description>' . $inode->getAttribute('td') . '</description>' . "\n";
                                        if($date){
                                                echo '<pubDate>' . date(DATE_RSS, $date) . '</pubDate>' . "\n";
                                        }
                                        echo '</item>' . "\n";
                                }
                        }
                }
        }

        echo '</channel></rss>';

?>

Open in new window

0
dialektkid
Asked:
dialektkid
  • 4
  • 3
2 Solutions
 
CSecurityCommented:
Your page is complex and you have a lot of recursive tags, so you can't do that with a single childNode, you need to use RegEx, also ALL links on that page, none of them have Title and Date property, so test this:



<?php

function extract_tags( $html, $tag, $selfclosing = null, $return_the_entire_tag = false, $charset = 'ISO-8859-1' ){
 
	if ( is_array($tag) ){
		$tag = implode('|', $tag);
	}
 
	//If the user didn't specify if $tag is a self-closing tag we try to auto-detect it
	//by checking against a list of known self-closing tags.
	$selfclosing_tags = array( 'area', 'base', 'basefont', 'br', 'hr', 'input', 'img', 'link', 'meta', 'col', 'param' );
	if ( is_null($selfclosing) ){
		$selfclosing = in_array( $tag, $selfclosing_tags );
	}
 
	//The regexp is different for normal and self-closing tags because I can't figure out 
	//how to make a sufficiently robust unified one.
	if ( $selfclosing ){
		$tag_pattern = 
			'@<(?P<tag>'.$tag.')			# <tag
			(?P<attributes>\s[^>]+)?		# attributes, if any
			\s*/?>					# /> or just >, being lenient here 
			@xsi';
	} else {
		$tag_pattern = 
			'@<(?P<tag>'.$tag.')			# <tag
			(?P<attributes>\s[^>]+)?		# attributes, if any
			\s*>					# >
			(?P<contents>.*?)			# tag contents
			</(?P=tag)>				# the closing </tag>
			@xsi';
	}
 
	$attribute_pattern = 
		'@
		(?P<name>\w+)							# attribute name
		\s*=\s*
		(
			(?P<quote>[\"\'])(?P<value_quoted>.*?)(?P=quote)	# a quoted value
			|							# or
			(?P<value_unquoted>[^\s"\']+?)(?:\s+|$)			# an unquoted value (terminated by whitespace or EOF) 
		)
		@xsi';
 
	//Find all tags 
	if ( !preg_match_all($tag_pattern, $html, $matches, PREG_SET_ORDER | PREG_OFFSET_CAPTURE ) ){
		//Return an empty array if we didn't find anything
		return array();
	}
 
	$tags = array();
	foreach ($matches as $match){
 
		//Parse tag attributes, if any
		$attributes = array();
		if ( !empty($match['attributes'][0]) ){ 
 
			if ( preg_match_all( $attribute_pattern, $match['attributes'][0], $attribute_data, PREG_SET_ORDER ) ){
				//Turn the attribute data into a name->value array
				foreach($attribute_data as $attr){
					if( !empty($attr['value_quoted']) ){
						$value = $attr['value_quoted'];
					} else if( !empty($attr['value_unquoted']) ){
						$value = $attr['value_unquoted'];
					} else {
						$value = '';
					}
 
					//Passing the value through html_entity_decode is handy when you want
					//to extract link URLs or something like that. You might want to remove
					//or modify this call if it doesn't fit your situation.
					$value = html_entity_decode( $value, ENT_QUOTES, $charset );
 
					$attributes[$attr['name']] = $value;
				}
			}
 
		}
 

		$tag = array(
			'tag_name' => $match['tag'][0],
			'offset' => $match[0][1], 
			'contents' => !empty($match['contents'])?$match['contents'][0]:'', //empty for self-closing tags
			'attributes' => $attributes, 
		);
		if ( $return_the_entire_tag ){
			$tag['full_tag'] = $match[0][0]; 			
		}
 
		$tags[] = $tag;
	}
 
	return $tags;
}

        $url = 'http://www.thehoneycomb.com/default.cfm';
        $title = 'The Honeycomb';
        $description = 'Events';

        $userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';

        header('Content-type: text/xml; charset=utf-8', true);

        echo '<?xml version="1.0" encoding="UTF-8"?'.'>' . "\n";
        echo '<rss version="2.0">' . "\n";
        echo '<channel>' . "\n";
        echo '  <title>' . $title . '</title>' . "\n";
        echo '  <link>' . $url . '</link>' . "\n";
        echo '  <description>' . $description . '</description>' . "\n";

        $curl = curl_init($url);
        curl_setopt($curl, CURLOPT_USERAGENT, $userAgent);
        curl_setopt($curl, CURLOPT_AUTOREFERER, true);
        curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
        curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1 );
        curl_setopt($curl, CURLOPT_TIMEOUT, 100 );                

        $html = curl_exec( $curl );
        curl_close( $curl );


$title = "";
$href = "";
$td = "";
$date = "2009/12/18";
$nodes = extract_tags( $html, 'a' );

foreach($nodes as $link)
{
@$title = $link['attributes']['title'];
@$href = $link['attributes']['href'];
@$td = $link['attributes']['td'];

					echo '<item>' . "\n";
                                        echo '<title>' . $title . '</title>' . "\n";
                                        echo '<link>' . $href. '</link>' . "\n";
                                        echo '<description>' . $td . '</description>' . "\n";
                                        if($date)
					{
                                                echo '<pubDate>' . $date. '</pubDate>' . "\n";
                                        }
                                        echo '</item>' . "\n";
	
}

?>

Open in new window

0
 
dialektkidAuthor Commented:
Nothing is being written to the  description field.  The link field looks good! For pub date I can just use the server time.
0
 
CSecurityCommented:
There is no desc. also available for links and <a href tags on that page.
0
What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

 
CSecurityCommented:
Sorry, found how to fix it.

Replace
@$td = $link['attributes']['td'];

to
@$td = strip_tags($link['contents']);
0
 
Ray PaseurCommented:
It might be hard to work with that site, partially because of this:

http://validator.w3.org/check?uri=http%3A%2F%2Fwww.thehoneycomb.com%2Fdefault.cfm&charset=%28detect+automatically%29&doctype=Inline&group=0

Do you own the content of the site?  If so, you might want to hire a developer to get into the ColdFusion code and just write out a native RSS feed.  If you do not own the site, please be careful about using the content in an RSS feed - make sure you have permission to republish the information so you do not run afoul of copyrights.
0
 
Ray PaseurCommented:
This will scrape the HTML and create an RSS feed.  Usually when I design a site, I try to think ahead about what kind of information I would present in RSS, and aggregate that information somewhere that makes RSS publishing easy.  From the look of the HTML here, that thought process was never undertaken.  As you can see, it is easy enough to scrape useful data out of the stuff produced by the CF machine.  But this is a "brittle" implementation since a change in the HTML might break the code we used to extract data.  That's why I think a better approach is to write the native RSS feed directly from the CF code.

More on RSS here:
http://cyber.law.harvard.edu/rss/rss.html

Best regards, ~Ray
<?php // RAY_temp_honeycomb_rss.php
error_reporting(E_ALL);
 
// SCRAPE HTML AND CREATE RSS
 
// THE URL
$url = 'http://www.thehoneycomb.com/default.cfm';
 
// LIE TO THE WEB SITE - APPEAR TO BE GOOGLE??
$userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';

// TRY TO READ THE WEB SITE
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_USERAGENT, $userAgent);
curl_setopt($curl, CURLOPT_AUTOREFERER, true);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1 );
curl_setopt($curl, CURLOPT_TIMEOUT, 2 );
$htm = curl_exec($curl);
$err = curl_errno($curl);
curl_close($curl);
if ($htm === FALSE) die("CURL FAIL: $url TIMEOUT=2, CURL_ERRNO=$err");
 
 
// TRIMMING OFF THE UNWANTED INFORMATION FROM THE MALFORMED HTML
$str = 'iconWHITEspace.gif';
$poz = strpos($htm, $str);
$htm = substr($htm, $poz);
 
$str = '</TABLE>';
$poz = strpos($htm, $str);
$htm = substr($htm, 0, $poz);
 
// USE THE NIGHT ICON AS A SEPARATOR TO MAKE AN ARRAY OF DATES AND EVENTS
$arr = explode('iconNight.gif', $htm);
unset($arr[0]);
 
 
// TIDY UP EACH ELEMENT OF THE ARRAY
foreach ($arr as $ptr => $txt)
{
   // FIND THE FRONT
   $poz = strpos($txt, '<SPAN CLASS="textHeaderBold">');
   $arr[$ptr] = substr($txt, $poz);
 
   // FIND THE END
   $poz = strpos($arr[$ptr], '<A HREF="events/default.cfm">');
   $arr[$ptr] = substr($arr[$ptr],0,$poz);
 
   // REMOVE THE WHITESPACE
   $arr[$ptr] = str_replace('&nbsp;', ' ', $arr[$ptr]);
   $arr[$ptr] = preg_replace('/\s\s+/', ' ', $arr[$ptr]);
   $arr[$ptr] = trim($arr[$ptr]);
}
// ACTIVATE THIS TO VISUALIZE WHAT WE HAVE DONE
// var_dump($arr);
 
 
// CREATE THE RSS ARRAY
$rss = array();
foreach ($arr as $ptr => $txt)
{
   // END-SPAN IS A USEFUL DELIMITER
   $xyz = explode('</SPAN>', $txt);
   $abc = trim(strip_tags($xyz[0]));
   $abc = str_replace('<br>', '', $abc);
   $abc = htmlentities($abc);
   $rss[$ptr]["title"] = $abc;
 
   // LEFT-TRIM TO THE FIRST BREAK
   $poz = strpos($xyz[1], '<br>');
   $abc = substr($xyz[1], $poz);
 
   // TIDY UP
   $abc = trim($abc, '<br>');
   $abc = str_replace('<br><br>', '<br>', $abc);
   $abc = htmlentities($abc);
   $rss[$ptr]["descr"] = $abc;
}
// ACTIVATE THIS TO VISUALIZE THE ARRAY
// var_dump($rss);
 
 
// TOP OF THE RSS FEED
$title = 'The Honeycomb.com';
$descr = 'Upcoming Events';
$pdate = date(DATE_RSS);
 
// WRITE THE RSS HEADER AND TOP OF THE FEED
// MAN PAGE: http://cyber.law.harvard.edu/rss/rss.html
header('Content-type: text/xml; charset=utf-8', true);
 
echo '<?xml version="1.0" encoding="UTF-8"?>' . "\n";
echo '<rss version="2.0">'                    . "\n";
echo '<channel>'                              . "\n";
 
echo '  <title>'       . $title . '</title>'       . "\n";
echo '  <link>'        . $url   . '</link>'        . "\n";
echo '  <description>' . $descr . '</description>' . "\n";
echo '  <pubDate>'     . $pdate . '</pubDate>'     . "\n";
 
// INSERT EACH ITEM
foreach ($rss as $ptr => $itm)
{
    echo '   <item>' . "\n";
    echo '      <title>' .       $itm["title"] . '</title>' . "\n";
    echo '      <description>' . $itm["descr"] . '</description>' . "\n";
    echo '   </item>' . "\n";
}
 
// WRAP UP THE FEED
echo '</channel></rss>';

Open in new window

0
 
Ray PaseurCommented:
Any progress?  Any questions?  Have you tried installing and running the code snippet I posted yet?

Best, ~Ray
0
 
Ray PaseurCommented:
Tested and working code here, with explanation of what is afoot: 12/19/09 10:48 AM, ID: 26087292

Cheers, ~Ray
0

Featured Post

What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

  • 4
  • 3
Tackle projects and never again get stuck behind a technical roadblock.
Join Now