<table id="stockPrices">
<tr>
<td class='name' width='250'><b>Stock:</b> ABC</td>
<td class='price' width='200'><b>Price:</b> $123.00</td>
</tr>
<tr>
<td class='name' width='250'><b>Stock:</b> XYZ</td>
<td class='price' width='200'><b>Price:</b> $100.00</td>
</tr>
</table>
Now let's say you want to write a PHP script to pull the stock prices off of that page. You COULD write a regular expression to do it, but regexes aren't always the right tool, and it might break on simple HTML change. What if the web page author decides to change the <td> tags slightly, or changes something that invalidates your regular expression? You might have to start from the beginning again!
<?php
// Include the library
require("simple_html_dom.php");
// Parse the page with the Simple HTML DOM shortcut file_get_html()
$dom = file_get_html("http://www.fakestockprices.com/some_page.html");
// NOTE: There is also one for str_get_html if you already have
// the HTML in a string variable: $dom = str_get_html("<html>...</html>");
// Find all <TR> tags inside any table with the ID of "stockPrices":
$TRs = $dom->find("table[id=stockPrices] tr");
// Now loop through and grab our values:
foreach($TRs as $TR)
{
// children(0) is the first TD inside the TR, while children(1) would be the second TD and so on...
$stockNameTD = $TR->children(0);
$stockPriceTD = $TR->children(1);
// plaintext gives us the content without any HTML formatting
$stockName = $stockNameTD->plaintext;
$stockPrice = $stockPriceTD->plaintext;
// You could also chain the commands together like this:
$stockName = $TR->children(0)->plaintext;
$stockPrice = $TR->children(1)->plaintext;
}
?>
If you're familiar with jQuery's style of selecting elements on a page, you'll be right at home here. It's pretty much identical. Here are some more examples:
// Find every <TD> in the page
$TDs = $dom->find('td');
// Find every <DIV> that also has class='someClass' as an attribute
$ClassyDIVs = $dom->find('div.someClass');
// Same thing as above, but the longer, more generic way
$ClassyDIVs = $dom->find('div[class=someClass]');
// Find every <IMG> with a width of 200
$AttrIMGs = $dom->find('img[width=200]');
// Find every <A> inside of a <SPAN>
$SpanLinks = $dom->find("span a");
All of those examples give you an array of matching elements. Let's say you only wanted the second element of $ClassyDIVs:
// Find every <DIV> that also has class='someClass' as an attribute
$ClassyDIVs = $dom->find('div.someClass');
// Get the second matching element
$SecondClassyDIV = $ClassyDIVs[1];
You can shorten this by putting the index at the end of the find method:
// Find the second <DIV> that also has class='someClass' as an attribute
$SecondClassyDIV = $dom->find('div.someClass',1);
// Find every <A> inside of a <SPAN>
$SpanLinks = $dom->find("span a");
// Show all the HREFs
foreach($SpanLinks as $SpanLink)
{
echo $SpanLink->href . "\n";
}
Now, you can do more than just read elements. You can also modify them and regenerate the final HTML page:
// Find all images and update their src
$IMGs = $dom->find('img');
foreach($IMGs as $IMG)
{
$IMG->src = "/some_image.jpg";
}
// Generate and display the modified HTML document
echo $dom;
However, all of this is just to wet your appetite. There are plenty of good and useful examples on the Simple HTML DOM web site. So if you're working with HTML manipulation of any sort (scraping, updating, etc...), make sure you try out this library. It can really simplify your life!
Have a question about something in this article? You can receive help directly from the article author. Sign up for a free trial to get started.
Comments (2)
Commented:
After using regular expressions for a while I switched to simple_html_dom for parsing data from the site for the world championships in daegu to produce a json dataset.
It made the task sooooo easy. I loooove regular expressions and felt like it was a betrayal but it would've been too much work for this task.
Commented: