Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 3744
  • Last Modified:

RSS Feed to Mysql Database

Hi,

I am using this script from Daniel Iversen to save RSS feeds from the following URL's:
http://localhost/test/index_clean.php?feed_url=http://feeds.feedburner.com/thewhir_magazine&access_key=youraccesskey

and

http://localhost/test/index_clean.php?feed_url=http://rss.cnn.com/rss/cnn_topstories.rss&access_key=youraccesskey

What I am finding is that this script does not store all the feeds. So if there are 15 feed items it sometimes just stores 10. It varies. Am I doing something wrong?

Or is there a better script that will allow me to store RSS feeds to a database?

<?php

/*

==================================================================
==================================================================
RSS Ingest v 0.1
- by Daniel Iversen, daniel (a dot) iversen (the at sign) gmail (another dot) com
------------------------------------------------------------------

INTRO:
- - - - - - - - -
 - Just a tiny primitive PHP script that you can run on a cron to ingest
   an RSS feed into a flat database table.

 - You can run the PHP many times with different RSS URLs

 - Its so simple to use; 1 php file, 4 parameters to change and you are done

 - Its super simple code - you can easily fiddle with it

 - If you find it useful, please drop a mail ;)


INSTALLATION AND USAGE:
- - - - - - - - - - - - - - -
 - Well its free, no license or warranty for this tool - if your house burns or your cat gets sick don not blame  me ;)
 - Change the few variables in this PHP (db connection info and access key/password)
	The 4 variables you need to look for in this PHP (usually grouped) are;

	$db_hostname="     - ....your db hostname here - maybe 'localhost' is enough...
	$db_username="     - ....your db username here...
	$db_password="     - ....your db password here...
	mysql_select_db("  - ....your db name here...

$private_access_key="youraccesskey";

 - create table using SQL below
 - create table indexes if needed
 - upload PHP to your server
 - hit the PHP with the URL:
     http://<your server>/<RSSIngest path>/<PHP file>?feed_url=<your RSS URL - must stay the same as its a key>&access_key=<secret key>
	 e.g.
     http://nexle.dk/rssingest/index.php?feed_url=http://www.instapaper.com/starred/rss/580483/qU7TKdkHYNmcjNJQSMH1QODLc&access_key=ThisIsNotMyKey
 - watch for errors and things
 - if all is o.k., install the PHP URL in a cron job (e.g. the free setcronjob.com) and relax ;)
 - If you do find it useful, please shoot me a mail, would love to know
 - If you make bug fixes, please feel free to send back for everyones benefit



CREATE TABLE SQL:
- - - - - - - - - - - - - - -
	CREATE TABLE  `rssingest` (
	 `item_id` VARCHAR( 32 ) NOT NULL ,
	 `feed_url` VARCHAR( 512 ) NOT NULL ,
	 `item_content` VARCHAR( 4000 ) NULL ,
	 `item_title` VARCHAR( 255 ) NOT NULL ,
	 `item_date` TIMESTAMP NOT NULL ,
	 `item_url` VARCHAR( 512 ) NOT NULL ,
	 `item_status` CHAR( 2 ) NOT NULL ,
	 `item_category_id` INT NULL ,
	 `fetch_date` TIMESTAMP NOT NULL
	) ENGINE = MYISAM ;



NOTES:
- - - - -
 - Only tested on MySQL
 - Done in an hour so there could be lots of bugs
 - Only tested with my particular RSS feed that I needed to ingest (Instapaper)
 - No attempt to prevent SQL injection but should be o.k. since its password protected and for your own use only


==================================================================
==================================================================

*/
//require_once("dbCon/dbcon.php");

$db_hostname="localhost";
$db_username="root";
$db_password="";

$private_access_key="youraccesskey";

// Check a few bits and pieces

if(isset($_GET['feed_url']))
{
	$feed_url = $_GET['feed_url'];
}
else
{
	die("Need to pass the (consistent) 'feed url'");
}


if(isset($_GET['access_key']))
{

	if($_GET['access_key']==$private_access_key)
	{
		echo "Access key correct, proceeding...<br/><br/>";
	}
	else
	{
		die("wrong access key");
	}
}
else
{
	die("Need to pass the 'access_key' URL parameter");
}


try
{
	/*  query the database */
	// $db = getCon();

	$db = mysql_connect($db_hostname,$db_username,$db_password);
	if (!$db)
	{
		die("Could not connect: " . mysql_error());
	}
	mysql_select_db("test", $db); // Nainil

	echo "Starting to work with feed URL '" . $feed_url . "'";

	/* Parse XML from  http://www.instapaper.com/starred/rss/580483/qU7TKdkHYNmcjNJQSMH1QODLc */
	//$RSS_DOC = simpleXML_load_file('http://www.instapaper.com/starred/rss/580483/qU7TKdkHYNmcjNJQSMH1QODLc');

	libxml_use_internal_errors(true);
	$RSS_DOC = simpleXML_load_file($feed_url);
	if (!$RSS_DOC) {
		echo "Failed loading XML\n";
		foreach(libxml_get_errors() as $error) {
			echo "\t", $error->message;
		}
	}


	/* Get title, link, managing editor, and copyright from the document  */
	$rss_title = $RSS_DOC->channel->title;
	$rss_link = $RSS_DOC->channel->link;
	$rss_editor = $RSS_DOC->channel->managingEditor;
	$rss_copyright = $RSS_DOC->channel->copyright;
	$rss_date = $RSS_DOC->channel->pubDate;

	//Loop through each item in the RSS document

	foreach($RSS_DOC->channel->item as $RSSitem)
	{

		$item_id 	= md5($RSSitem->title);
		$fetch_date = date("Y-m-j G:i:s"); //NOTE: we don't use a DB SQL function so its database independant
		$item_title = $RSSitem->title;
		$item_content = $RSSitem->description;
		$item_date  = date("Y-m-j G:i:s", strtotime($RSSitem->pubDate));
		$item_url	= $RSSitem->link;

		echo "Processing item '" , $item_id , "' on " , $fetch_date 	, "<br/>";
		echo $item_title, " - ";
		echo $item_date, "<br/>";
		echo $item_url, "<br/>";

		// Does record already exist? Only insert if new item...

		$item_exists_sql = "SELECT item_id FROM rssingest where item_id = '" . $item_id . "'";
		$item_exists = mysql_query($item_exists_sql, $db);
		if(mysql_num_rows($item_exists)<1)
		{
			echo "<font color=green>Inserting new item..</font><br/>";
			$item_insert_sql = "INSERT INTO rssingest(item_id, feed_url, item_title, item_content, item_date, item_url, fetch_date) VALUES ('" . $item_id . "', '" . $feed_url . "', '" . $item_title . "', '" . $item_content . "', '" . $item_date . "', '" . $item_url . "', '" . $fetch_date . "')";
			$insert_item = mysql_query($item_insert_sql, $db);
		}
		else
		{
			echo "<font color=blue>Not inserting existing item..</font><br/>";
		}

		echo "<br/>";
	}

	// End of form //
} catch (Exception $e)
{
    echo 'Caught exception: ',  $e->getMessage(), "\n";
}
?>

Open in new window

0
nainil
Asked:
nainil
  • 3
  • 2
  • 2
4 Solutions
 
hieloCommented:
That would happen when RSS items "collide" due to having EXACTLY the same title.  You can try changing:
$item_id       = md5($RSSitem->title);

to:
$item_id       = md5($RSSitem->title . $RSSitem->description);

that way an item will NOT be inserted only if both Title AND Description are identical.
0
 
Ray PaseurCommented:
Wow, couple of red flags here:


- Well its free, no license or warranty for this tool - if your house burns or your cat gets sick don not blame  me ;)
- Done in an hour so there could be lots of bugs
- Only tested with my particular RSS feed that I needed to ingest (Instapaper)


Why would you choose something with prominent warnings like that?!

Anyway, I cannot see what might be wrong, but a conditional statement that  excludes some of the inserts might be the first place I would put diagnostic code.
0
 
nainilAuthor Commented:
@hielo : Thanks. That didn't help much. I still cannot get all feeds to populate in the DB. The strange part is that it does display all the feeds on the page but it does not insert all into the db.

@ray_paseur: Ray I agree that there are red flags. However it's always easy to use code that exists than develop something new in the time that one has. I'll put some diagnostic code to find what is going wrong.

However if you have some other suggestions or script that can help please do share. Thanks.
0
Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

 
Ray PaseurCommented:
I would add error_reporting(E_ALL) and try some data visualization, like var_dump($RSS_Doc).  I think hielo is on the right track, and I would also add the pubDate field to the md5() message digest.  It seems possible to me that the same title and description could be published repeatedly with a text that said something like "Breaking news update..."  However if the publishers did this it seems unlikely that the updates would have the same timestamp as the original article.
0
 
hieloCommented:
>>I still cannot get all feeds to populate in the DB.
are the items that are NOT inserted already in the DB?  In other words, are you seeing "Not inserting existing item.." for the "failed" insertions? If so, then the reason is that it finds a duplicate item in the db? Do you actually want duplicates?

Also, change:
$item_insert_sql = "INSERT INTO rssingest(item_id, feed_url, item_title, item_content, item_date, item_url, fetch_date) VALUES ('" . $item_id . "', '" . $feed_url . "', '" . $item_title . "', '" . $item_content . "', '" . $item_date . "', '" . $item_url . "', '" . $fetch_date . "')";

Open in new window

to:
$item_insert_sql = "INSERT INTO rssingest(item_id, feed_url, item_title, item_content, item_date, item_url, fetch_date) VALUES ('" . mysql_real_escape_string($item_id) . "', '" . mysql_real_escape_string($feed_url) . "', '" . mysql_real_escape_string($item_title) . "', '" . mysql_real_escape_string($item_content) . "', '" . mysql_real_escape_string($item_date) . "', '" . mysql_real_escape_string($item_url) . "', '" . mysql_real_escape_string($fetch_date) . "')";

Open in new window

0
 
Ray PaseurCommented:
A few last thoughts then I'll sign off on this question.  You might want to add an AUTO_INCREMENT key to the table rssingest.  You might also want to mark the item_id column UNIQUE.  Then if you try to insert a duplicate, MySQL will throw error #1062.  You can easily trap this and you will know that you had one of those md5() collisions.

A die() statement before line 142 might make sense.  If your script ever gets there, the values in $RSS_DOC are not object properties, but FALSE.

Install a copy of phpMyAdmin to make data visualization easier.

Also, this feed has just 10 <item> elements.
http://www.instapaper.com/starred/rss/580483/qU7TKdkHYNmcjNJQSMH1QODLc
0
 
nainilAuthor Commented:
I guess I figured out the issue.

It was the mysql insert statement that was causing special characters to break the insert statement and hence the query was failing. Thanks for your help.
0

Featured Post

Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

  • 3
  • 2
  • 2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now