<

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x

HTML Manipulation Made Easy

Published on
15,108 Points
6,208 Views
4 Endorsements
Last Modified:
Approved
Imagine that you have a web page that has a stock price buried inside a table cell, like this:
<table id="stockPrices">
  <tr>
    <td class='name' width='250'><b>Stock:</b> ABC</td>
    <td class='price' width='200'><b>Price:</b> $123.00</td>
  </tr>
  <tr>
    <td class='name' width='250'><b>Stock:</b> XYZ</td>
    <td class='price' width='200'><b>Price:</b> $100.00</td>
  </tr>
</table>

Open in new window

Now let's say you want to write a PHP script to pull the stock prices off of that page. You COULD write a regular expression to do it, but regexes aren't always the right tool, and it might break on simple HTML change. What if the web page author decides to change the <td> tags slightly, or changes something that invalidates your regular expression? You might have to start from the beginning again!

A better way is to use a free PHP library called Simple HTML DOM, found here:

http://simplehtmldom.sourceforge.net/

It makes it EXTREMELY easy to process an HTML page and get extract data out of it. The library parses an HTML page into an object and gives you advanced searching commands so you can look for HTML tags that match certain criteria, and allows you to extract the contents in a variety of ways. In the above example, I might write something like this:
<?php
// Include the library
require("simple_html_dom.php");

// Parse the page with the Simple HTML DOM shortcut file_get_html()
$dom = file_get_html("http://www.fakestockprices.com/some_page.html");

// NOTE: There is also one for str_get_html if you already have 
// the HTML in a string variable: $dom = str_get_html("<html>...</html>");

// Find all <TR> tags inside any table with the ID of "stockPrices":
$TRs = $dom->find("table[id=stockPrices] tr");

// Now loop through and grab our values:
foreach($TRs as $TR)
{
	// children(0) is the first TD inside the TR, while children(1) would be the second TD and so on...
  $stockNameTD = $TR->children(0);
  $stockPriceTD = $TR->children(1);
  
  // plaintext gives us the content without any HTML formatting
  $stockName = $stockNameTD->plaintext; 
  $stockPrice = $stockPriceTD->plaintext;
  
  // You could also chain the commands together like this:
  $stockName = $TR->children(0)->plaintext;
  $stockPrice = $TR->children(1)->plaintext;
}
?>

Open in new window

If you're familiar with jQuery's style of selecting elements on a page, you'll be right at home here. It's pretty much identical. Here are some more examples:
// Find every <TD> in the page
$TDs         = $dom->find('td'); 

// Find every <DIV> that also has class='someClass' as an attribute
$ClassyDIVs  = $dom->find('div.someClass'); 

// Same thing as above, but the longer, more generic way
$ClassyDIVs  = $dom->find('div[class=someClass]'); 

// Find every <IMG> with a width of 200
$AttrIMGs    = $dom->find('img[width=200]'); 

// Find every <A> inside of a <SPAN>
$SpanLinks   = $dom->find("span a");

Open in new window

All of those examples give you an array of matching elements. Let's say you only wanted the second element of $ClassyDIVs:

// Find every <DIV> that also has class='someClass' as an attribute
$ClassyDIVs = $dom->find('div.someClass');

// Get the second matching element
$SecondClassyDIV = $ClassyDIVs[1];

Open in new window

You can shorten this by putting the index at the end of the find method:
// Find the second <DIV> that also has class='someClass' as an attribute
$SecondClassyDIV = $dom->find('div.someClass',1);

Open in new window


If you wanted to access any specific attribute on an element, it's right there as a property of the element:
// Find every <A> inside of a <SPAN>
$SpanLinks = $dom->find("span a");

// Show all the HREFs
foreach($SpanLinks as $SpanLink)
{
  echo $SpanLink->href . "\n";
}

Open in new window

Now, you can do more than just read elements. You can also modify them and regenerate the final HTML page:
// Find all images and update their src
$IMGs = $dom->find('img');
foreach($IMGs as $IMG)
{
  $IMG->src = "/some_image.jpg";
}

// Generate and display the modified HTML document
echo $dom;

Open in new window

However, all of this is just to wet your appetite. There are plenty of good and useful examples on the Simple HTML DOM web site. So if you're working with HTML manipulation of any sort (scraping, updating, etc...), make sure you try out this library. It can really simplify your life!
4
Author:gr8gonzo
2 Comments

Expert Comment

by:recursion_man
I'd have to concur with the views expressed here. I'd still love to try out the other libraries/methods at some point though.
After using regular expressions for a while I switched to simple_html_dom for parsing data from the site for the world championships in daegu to produce a json dataset.

It made the task sooooo easy. I loooove regular expressions and felt like it was a betrayal but it would've been too much work for this task.
0
LVL 1

Expert Comment

by:rgb192
I am not complaining and think this simple_html_dom tutorial is better than the actual website or any tutorial I can find.  I am waiting for the sequel to this article.
0

Featured Post

Become a Microsoft Certified Solutions Expert

This course teaches how to install and configure Windows Server 2012 R2.  It is the first step on your path to becoming a Microsoft Certified Solutions Expert (MCSE).

Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
The viewer will learn how to dynamically set the form action using jQuery.

Keep in touch with Experts Exchange

Tech news and trends delivered to your inbox every month