HTML Manipulation Made Easy

Imagine that you have a web page that has a stock price buried inside a table cell, like this:
<table id="stockPrices">
    <td class='name' width='250'><b>Stock:</b> ABC</td>
    <td class='price' width='200'><b>Price:</b> $123.00</td>
    <td class='name' width='250'><b>Stock:</b> XYZ</td>
    <td class='price' width='200'><b>Price:</b> $100.00</td>
Now let's say you want to write a PHP script to pull the stock prices off of that page. You COULD write a regular expression to do it, but regexes aren't always the right tool, and it might break on simple HTML change. What if the web page author decides to change the <td> tags slightly, or changes something that invalidates your regular expression? You might have to start from the beginning again!

A better way is to use a free PHP library called Simple HTML DOM, found here:

It makes it EXTREMELY easy to process an HTML page and get extract data out of it. The library parses an HTML page into an object and gives you advanced searching commands so you can look for HTML tags that match certain criteria, and allows you to extract the contents in a variety of ways. In the above example, I might write something like this:
// Include the library

// Parse the page with the Simple HTML DOM shortcut file_get_html()
$dom = file_get_html("");

// NOTE: There is also one for str_get_html if you already have 
// the HTML in a string variable: $dom = str_get_html("<html>...</html>");

// Find all <TR> tags inside any table with the ID of "stockPrices":
$TRs = $dom->find("table[id=stockPrices] tr");

// Now loop through and grab our values:
foreach($TRs as $TR)
	// children(0) is the first TD inside the TR, while children(1) would be the second TD and so on...
  $stockNameTD = $TR->children(0);
  $stockPriceTD = $TR->children(1);
  // plaintext gives us the content without any HTML formatting
  $stockName = $stockNameTD->plaintext; 
  $stockPrice = $stockPriceTD->plaintext;
  // You could also chain the commands together like this:
  $stockName = $TR->children(0)->plaintext;
  $stockPrice = $TR->children(1)->plaintext;
If you're familiar with jQuery's style of selecting elements on a page, you'll be right at home here. It's pretty much identical. Here are some more examples:
// Find every <TD> in the page
$TDs         = $dom->find('td'); 

// Find every <DIV> that also has class='someClass' as an attribute
$ClassyDIVs  = $dom->find('div.someClass'); 

// Same thing as above, but the longer, more generic way
$ClassyDIVs  = $dom->find('div[class=someClass]'); 

// Find every <IMG> with a width of 200
$AttrIMGs    = $dom->find('img[width=200]'); 

// Find every <A> inside of a <SPAN>
$SpanLinks   = $dom->find("span a");
All of those examples give you an array of matching elements. Let's say you only wanted the second element of $ClassyDIVs:

// Find every <DIV> that also has class='someClass' as an attribute
$ClassyDIVs = $dom->find('div.someClass');

// Get the second matching element
$SecondClassyDIV = $ClassyDIVs[1];
You can shorten this by putting the index at the end of the find method:
// Find the second <DIV> that also has class='someClass' as an attribute
$SecondClassyDIV = $dom->find('div.someClass',1);

If you wanted to access any specific attribute on an element, it's right there as a property of the element:
// Find every <A> inside of a <SPAN>
$SpanLinks = $dom->find("span a");

// Show all the HREFs
foreach($SpanLinks as $SpanLink)
  echo $SpanLink->href . "\n";
Now, you can do more than just read elements. You can also modify them and regenerate the final HTML page:
// Find all images and update their src
$IMGs = $dom->find('img');
foreach($IMGs as $IMG)
  $IMG->src = "/some_image.jpg";

// Generate and display the modified HTML document
echo $dom;
However, all of this is just to wet your appetite. There are plenty of good and useful examples on the Simple HTML DOM web site. So if you're working with HTML manipulation of any sort (scraping, updating, etc...), make sure you try out this library. It can really simplify your life!

Copyright © 2012 - Jonathan Hilgeman. All Rights Reserved. 

Comments (2)

I'd have to concur with the views expressed here. I'd still love to try out the other libraries/methods at some point though.
After using regular expressions for a while I switched to simple_html_dom for parsing data from the site for the world championships in daegu to produce a json dataset.

It made the task sooooo easy. I loooove regular expressions and felt like it was a betrayal but it would've been too much work for this task.

I am not complaining and think this simple_html_dom tutorial is better than the actual website or any tutorial I can find.  I am waiting for the sequel to this article.

Have a question about something in this article? You can receive help directly from the article author. Sign up for a free trial to get started.