Reading XML Namespaces using PHP Without regex.

Chris Harte2015 Top Expert (Most Article Points)
CERTIFIED EXPERT
A developer with over twenty years experience. Started on mainframes and Cobol, now on mobiles and Android.
Published:
Updated:
There are a number of people out there who will tell you that the only way to parse an RSS feed containing name spaces is to use regular expressions. They are wrong and, frankly, should know better. In this essay I am going to show you how to parse an RSS feed using standard PHP libraries. Why namespaces are used in xml files is not within the scope of this document. I am just going to show you how to read them. This article assumes you already know how to code in PHP, but are having difficulty extracting data from an RSS feed.

There are many functions in the standard PHP for dealing with xml, I am going to use simplexml because I find it is the easiest. Others will give you more control of the information gathered from the feed, but when all you want to do is read all the content of an RSS this will do everything you need. I am using a specific feed that conforms to standards, the methods discussed in this article can be applied to any feed.

The supposed problem.
This is the source of a genuine RSS feed. It contains the usual suspects of channel, title, description, item etc. It also contains name spaces and values stored as attributes.
<?xml version="1.0"?>
                      <rss version="2.0"
                           
                           xmlns:media="http://search.yahoo.com/mrss/" 
                           xmlns:dcterms="http://purl.org/dc/terms/" 
                           xmlns:pbscontent="http://www.pbs.org/rss/pbscontent/" 
                           xmlns:pbsvideo="http://www.pbs.org/rss/pbsvideo/" >
                      <channel>
                          <title>The Local Show | PBS Video</title>
                          <description>The Local Show RSS feed for PBS programming.</description>
                          <link>http://video.pbs.org</link><language>en-us</language>
                          <generator>http://video.pbs.org</generator>
                          <lastBuildDate>Fri, 15 Mar 2013 10:16:35 -0400</lastBuildDate>
                          <pubDate>Fri, 15 Mar 2013 10:16:35 -0400</pubDate>
                          <item>
                              <title>The Local Show | KC Makers, Celebrating Extraordinary Women</title>
                              <link>http://video.pbs.org/video/2338801013/</link>
                              <description>This week, we celebrate the achievements of just a few of the extraordinary women who live in the 					Metro.</description>
                              <guid>http://video.pbs.org/video/2338801013/</guid>
                              <pubDate>02/25/2013</pubDate>
                              <media:description>The Local Show celebrates Kansas City&#39;s Makers: Women Who Make America.</media:description>
                              <media:content medium="video" duration="1611000" />
                              <media:thumbnail url="http://pbs.merlin.cdn.prod.s3.amazonaws.com/Video%20Asset/KCPT/local-								show/70943/images/567745_ThumbnailCOVEDefault_20130225174212.jpg.resize.142x80.jpg" 
                      			type="image/jpeg" height="60" width="142" />
                              <media:rating scheme="urn:v-chip">nr</media:rating>
                              <media:player url="http://video.pbs.org/video/2338801013/" />
                              <category domain="PBS/taxonomy/topic">Arts &amp; Entertainment</category>
                              <media:category scheme="http://www.pbs.org/rss/pbscontent/taxonomy/topic">Arts &amp; Entertainment</media:category>
                              <category domain="PBS/taxonomy/topic">Culture &amp; Society</category>
                              <media:category scheme="http://www.pbs.org/rss/pbscontent/taxonomy/topic">Culture &amp; Society</media:category>
                              <category domain="PBS/taxonomy/topic">Health</category>
                              <media:category scheme="http://www.pbs.org/rss/pbscontent/taxonomy/topic">Health</media:category>
                              <category domain="PBS/taxonomy/topic">News &amp; Public Affairs</category>
                              <media:category scheme="http://www.pbs.org/rss/pbscontent/taxonomy/topic">News &amp; Public Affairs</media:category>
                              <category domain="PBS/taxonomy/topic">Parents</category>
                              <media:category scheme="http://www.pbs.org/rss/pbscontent/taxonomy/topic">Parents</media:category><category domain="PBS/taxonomy/topic">Technology</category><media:category scheme="http://www.pbs.org/rss/pbscontent/taxonomy/topic">Technology</media:category>
                              <pbsvideo:content_type>Episode</pbsvideo:content_type>
                          </item>
                      [...Many more items]
                          </channel>
                      </rss>

Open in new window


Just so there is no misunderstanding, namespaces are xml tags that contain a colon ':'. In this example the first name space is
<media:description></media:description>

Open in new window


This has the value
The Local Show celebrates Kansas City&#39;s Makers: Women Who Make America.

The namespace  <media:content> contains no data, but does have two attributes, video and duration
<media:content medium="video" duration="1611000" />

Open in new window

These are not read the same way as values, but can be read using a standard library method, which I will cover later.  As for the code, let us start at the beginning. Get the url, store the contents in a variable using file_get_contents then convert them to an xml resource using simplexml_load_string.
$url = "http://video.pbs.org/program/local-show/rss/";
                      
                      $contents = file_get_contents($url);
                      
                      $xml = simplexml_load_string($contents);

Open in new window

The manual says you can do a straight load string from the url, but I prefer this method. It has no overhead and is easier to debug, should I have to. This xml is a well formed RSS and consists of the tags <channel> and <item>. To read the data within them we use nested foreach constructs
foreach ($xml->channel as $channel)
                      {
                          foreach ($channel->item as $item)
                          {
                               foreach ($item as $feed)
                               {
                                   echo "The feed : ". $feed. "<br />";
                               }
                          }
                      }

Open in new window

This will echo all the data in the standard tags but none of namespace tags or any of the attributes. To get to the namespace media: we use the getNamspaces() and the children() methods
foreach ($xml->channel as $channel)
                      {
                           foreach ($channel->item as $item)
                          {
                               $ns = $item->getNamespaces(true);  //Apply method to <item> tag
                      
                               $child = $item->children($ns["media"]); //Extract the “media:” namespace 
                      
                               foreach ($item as $feed)
                               {
                      	echo "The feed : ". $feed. "<br />";
                              }
                          }
                      }

Open in new window

Then add an extra foreach to handle the new namespace variables
foreach ($xml->channel as $channel)
                      {
                          foreach ($channel->item as $item)
                          {
                              $ns = $item->getNamespaces(true);
                      
                              $child = $item->children($ns["media"]);
                      
                              foreach ($item as $feed)
                              {
                                  echo "The feed : ". $feed. "<br />";
                              }
                      
                              foreach ($child as $name)
                              {
                                  echo "the name space : ".$name ."<br />"; //Output namespace values
                              }
                          }
                      }

Open in new window

This will echo the values of the namespace tags. Some of the tags, however, contain no data and will echo blank lines to the screen. But they do have attributes. These can be accessed with the attributes() method, and they have to be accessed directly. For example to get to the attributes of the namespace <media:content> we use this foreach statement on the previously populated variable $child addressing the content value

foreach ($child->content->attributes() as $attrib_name => $attrib_value )
                      {
                          echo "the attribute name : ".$attrib_name." the attribute value : ".$attrib_value;
                      }

Open in new window


This will echo the name of the attribute and the value of the attribute. Since we already know the name of the attribute, which is medium, we can access its value

echo "medium : ".$child->content->attributes()->medium;

Open in new window


In this feed there are several tags called <category> and several namespaces called <media:category>. Internally these will be stored in an array. There is a method called count() that will count all the elements in the array. The results can be used to address the individual tags and attributes.

$cats = $item->category->count();
                      
                      for ($i = 0; $i < $cats; $i++)
                      {
                                echo $item->category[$i];
                                echo $child->category[$i]->attributes()->scheme;
                      }

Open in new window


Here is the complete listing that will output some of the feed data to your screen. The lack of html is deliberate, I am not a web designer and line breaks are as much as I need.
<?php
                      $url = "http://video.pbs.org/program/local-show/rss/";
                      
                      $contents = file_get_contents($url);
                      $xml = simplexml_load_string($contents);
                      
                      foreach ($xml->channel as $channel)
                      {
                          foreach ($channel->item as $item)
                          {
                              $ns = $item->getNamespaces(true);
                              $child = $item->children($ns["media"]);
                              
                              echo "Programme Title : " . $item->title . "<br />";
                              echo "Video Link : " . $item->link . "<br />";
                              echo "Description : " . $item->description . "<br />";
                              echo "Transmitted  : " . $item->pubDate . "<br />";
                      
                              $cats = $item->category->count();
                      
                              echo "Found in the following categories :<br />";
                      
                              for ($i = 0; $i < $cats; $i++)
                              {
                                  echo $item->category[$i];
                                  echo " link : ". $child->category[$i]->attributes()->scheme . "<br/>";
                              }
                      
                              echo "Rating: " . $child->rating . "<br/>";
                      
                      	  //Calculate the time from milliseconds
                              $x = $child->content->attributes()->duration /(60 * 1000);
                              $m = floor($x);
                              $s = number_format(($x - $m) * 60);
                              
                              echo "Duration :  $m minutes  $s seconds <br/><br/><br/>";
                          }
                      }
                      ?>

Open in new window


There you have it, a parsed feed with namespaces, attributes and no regex in sight.
2
5,255 Views
Chris Harte2015 Top Expert (Most Article Points)
CERTIFIED EXPERT
A developer with over twenty years experience. Started on mainframes and Cobol, now on mobiles and Android.

Comments (1)

Ray

Thank you very much for your effort

I asked them to generate a smaller file for me.

Here is a full one: https://dl.dropbox.com/u/33313692/CodesAndDescription.xml

Have a question about something in this article? You can receive help directly from the article author. Sign up for a free trial to get started.