PHP: preg_match - grab RSS feed from <head> data

I need help with REGEX.

I need to grab the RSS feed of a given website from the <head> tag.

For example:

http://digg.com

 <link rel="alternate" type="application/rss+xml" title="front page stories in rss" href="/rss/index.xml"/>

I want:

/rss/index.xml
jpschreibmanAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

BenMorelCommented:
Hi, this should work ;)
Ben
function grabRSS($html)
{
	if (preg_match('!<link[^>]+type=["\']application/rss\\+xml["\'][^>]+href=["\']([^"\'>]+)["\'][^>]*>!i', $html, $out))
	{
		return $out[1];
	}
	
	return false;
}

Open in new window

0
jpschreibmanAuthor Commented:
I am still having trouble:

The function below should return the full url of the rss feed.

It works when I try reddit.com, but fails at slashdot.org. What gives?
// grab RSS feed from website <head>
preg_match('!<link[^>]+type=["\']application/rss\\+xml["\'][^>]+href=["\']([^"\'>]+)["\'][^>]*>!i', $remote_page, $matches); 
$rss_url = $matches[1];
$check = strpos($rss_url, $url);
If($check === FALSE) {$rss_url=$url.$rss_url;}
print $rss_url; 

Open in new window

0
BenMorelCommented:
Sorry for that, that's because the attributes are not in the same order.
Seems difficult to do that in one single regexp, i'd rather use that function that works in both cases :
function grabRSS($html)
{
	preg_match_all('!<link([^>]*)>!i', $html, $link);
	
	foreach ($link[1] as $attributes)
	{
		if (preg_match('!type=["\']application/rss\\+xml["\']!i', $attributes))
		{
			if (preg_match('!href=["\']([^"\']+)["\']!i', $attributes, $href))
			{
				return $href[1];
			}
		}
	}
 
	return false;
}

Open in new window

0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Rowby Goren Makes an Impact on Screen and Online

Learn about longtime user Rowby Goren and his great contributions to the site. We explore his method for posing questions that are likely to yield a solution, and take a look at how his career transformed from a Hollywood writer to a website entrepreneur.

b0lsc0ttIT ManagerCommented:
Do you need to check the type value or any other attributes in the tag?  It sounds like you just want the href value and so your code and expression could be simplified.

If you are interested in what I mean then let me know exactly what you need from the tag above.  It sounds like the attrbute order may be changed but I can see about making another way to do it with a little less code maybe.  For example using some alternation and maybe even named groups.

bol
0
jpschreibmanAuthor Commented:
Thanks again.
0
BenMorelCommented:
In fact, it could be done easily in a single preg_match if we needed only one attribute : the href.
However, <link> tags can be used for other types of files : css for example.
So we need to grab the type, and check that it is application/rss+xml.
A regexp is order-sensitive, so we need to distinguish between type then href, and href then type.

So my code grabs all <link> tags, then for each of them checks the type until it founds the correct one, then grabs the href.
Maybe it could be simplified a bit, but not much I'm afraid :)
Then "clean" way to do that would probably be parsing the file with DOM.

Regards,
Ben
0
jpschreibmanAuthor Commented:
Ben:

Could you send me a link to more info about the DOM method?

I would like to learn more.

Josh
0
BenMorelCommented:
More info can be found on php.net :
http://www.php.net/dom

However you'd rather begin with a tutorial.
The disadvantage of DOM for html files is that it may fail loading common html files, as they often contain a lot of errors. It is prefered for valid xml streams. The web is not "clean" :)

Ben
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Web Languages and Standards

From novice to tech pro — start learning today.