<

HTML Manipulation Made Easy

Published on
15,329 Points
6,429 Views
4 Endorsements
Last Modified:
Approved
Imagine that you have a web page that has a stock price buried inside a table cell, like this:
<table id="stockPrices">
  <tr>
    <td class='name' width='250'><b>Stock:</b> ABC</td>
    <td class='price' width='200'><b>Price:</b> $123.00</td>
  </tr>
  <tr>
    <td class='name' width='250'><b>Stock:</b> XYZ</td>
    <td class='price' width='200'><b>Price:</b> $100.00</td>
  </tr>
</table>

Open in new window

Now let's say you want to write a PHP script to pull the stock prices off of that page. You COULD write a regular expression to do it, but regexes aren't always the right tool, and it might break on simple HTML change. What if the web page author decides to change the <td> tags slightly, or changes something that invalidates your regular expression? You might have to start from the beginning again!

A better way is to use a free PHP library called Simple HTML DOM, found here:

http://simplehtmldom.sourceforge.net/

It makes it EXTREMELY easy to process an HTML page and get extract data out of it. The library parses an HTML page into an object and gives you advanced searching commands so you can look for HTML tags that match certain criteria, and allows you to extract the contents in a variety of ways. In the above example, I might write something like this:
<?php
// Include the library
require("simple_html_dom.php");

// Parse the page with the Simple HTML DOM shortcut file_get_html()
$dom = file_get_html("http://www.fakestockprices.com/some_page.html");

// NOTE: There is also one for str_get_html if you already have 
// the HTML in a string variable: $dom = str_get_html("<html>...</html>");

// Find all <TR> tags inside any table with the ID of "stockPrices":
$TRs = $dom->find("table[id=stockPrices] tr");

// Now loop through and grab our values:
foreach($TRs as $TR)
{
	// children(0) is the first TD inside the TR, while children(1) would be the second TD and so on...
  $stockNameTD = $TR->children(0);
  $stockPriceTD = $TR->children(1);
  
  // plaintext gives us the content without any HTML formatting
  $stockName = $stockNameTD->plaintext; 
  $stockPrice = $stockPriceTD->plaintext;
  
  // You could also chain the commands together like this:
  $stockName = $TR->children(0)->plaintext;
  $stockPrice = $TR->children(1)->plaintext;
}
?>

Open in new window

If you're familiar with jQuery's style of selecting elements on a page, you'll be right at home here. It's pretty much identical. Here are some more examples:
// Find every <TD> in the page
$TDs         = $dom->find('td'); 

// Find every <DIV> that also has class='someClass' as an attribute
$ClassyDIVs  = $dom->find('div.someClass'); 

// Same thing as above, but the longer, more generic way
$ClassyDIVs  = $dom->find('div[class=someClass]'); 

// Find every <IMG> with a width of 200
$AttrIMGs    = $dom->find('img[width=200]'); 

// Find every <A> inside of a <SPAN>
$SpanLinks   = $dom->find("span a");

Open in new window

All of those examples give you an array of matching elements. Let's say you only wanted the second element of $ClassyDIVs:

// Find every <DIV> that also has class='someClass' as an attribute
$ClassyDIVs = $dom->find('div.someClass');

// Get the second matching element
$SecondClassyDIV = $ClassyDIVs[1];

Open in new window

You can shorten this by putting the index at the end of the find method:
// Find the second <DIV> that also has class='someClass' as an attribute
$SecondClassyDIV = $dom->find('div.someClass',1);

Open in new window


If you wanted to access any specific attribute on an element, it's right there as a property of the element:
// Find every <A> inside of a <SPAN>
$SpanLinks = $dom->find("span a");

// Show all the HREFs
foreach($SpanLinks as $SpanLink)
{
  echo $SpanLink->href . "\n";
}

Open in new window

Now, you can do more than just read elements. You can also modify them and regenerate the final HTML page:
// Find all images and update their src
$IMGs = $dom->find('img');
foreach($IMGs as $IMG)
{
  $IMG->src = "/some_image.jpg";
}

// Generate and display the modified HTML document
echo $dom;

Open in new window

However, all of this is just to wet your appetite. There are plenty of good and useful examples on the Simple HTML DOM web site. So if you're working with HTML manipulation of any sort (scraping, updating, etc...), make sure you try out this library. It can really simplify your life!
4
Author:gr8gonzo
Ask questions about what you read
If you have a question about something within an article, you can receive help directly from the article author. Experts Exchange article authors are available to answer questions and further the discussion.
Get 7 days free