Retreive specific parts of HTML documents without the hassle of monster regular expressions

Pierre FrançoisSenior consultant
CERTIFIED EXPERT
Published:
Browsing the questions asked to the Experts of this forum, you will be amazed to see how many times people are headaching about monster regular expressions (regex) to select that specific part of some HTML or XML file they want to extract. The examples in this article are code in PHP.

Even when their regex seem to work for their case, they realize that it is very hard to have a regex which is safe in all cases. V.gr., if you try to match the beginning of a paragraph, which is one of the simplest things you could imagine, you will try to match "<p>", isn't it?

Well, in that case, you are going to miss <P>, like in the first versions of HTML. I can already hear the answer of the regex fans: You only have to adapt your pattern to match "<[pP]>".

What happens when the coder of the HTML makes use of his liberty to add spaces between "<p" and the closing ">" like in "<p >"? The regex aren't still defeated: you can save the coding by having "<[pP] *>" as pattern.

And what if I want to match also <p some-attribute="some-string"> along with <p> without attributes? No problem: the regex becomes "<[pP][^>]*>". Pretty simple, no? But is doesn't work when one of the attribute strings contains a ">" char, or when the opening < and the closing > are on different lines, and that's really the point: XML and HTML files are not organised line by line as regex are written for.

It is better to stop here: I hope you start to understand that using regex pattern matching on HTML/XML files is like using a knife as a screwdriver: it works, but you will have to be very good for not breaking anything, and you will loose a bunch of time, if you don't hurt yourself.

The adequate tool to select any part of an XML document is XPath. And any HTML document can easily be turned into an XML document thanks to the DOMDocument classes that we can find in most programming languages. I will illustrate how to do this  this with PHP.

Assuming $htmlfile contains the document you want to parse, you can build an XML tree with the following PHP instructions:
$doc = new DOMDocument();
                      $doc->loadHTML($htmlfile);

Open in new window

Note: all the HTML tags will be converted into xhtml tags, this implies using lowercase letters.

Let us suppose we want to get all the images (IMG), you can select the corresponding list of nodes from the tree by doing:
$xpath = new DOMXpath($doc);
                      $nodelist = $xpath->query("//img");

Open in new window


The query syntax here query("//img") is infinitely more transparent than the corresponding regex, and it works even when the IMG tag spans several lines of text.

Before closing this article, this is how you can print the selected parts of the HTML file in a loop:
foreach ($nodelist as $node) {
                      	echo $doc->savexml($node);
                      }

Open in new window


The versatility of xpath is huge. You not only can select a specific tag, like in the example above with "//img", you can really navigate in the file you are browsing.

If, v.gr., I want to browse http://fr.weather.com/weather/today-Bruxelles-BEXX0005, I can select the frame with the current conditions with the following xpath statement:

$xpath->query("//div[@id='today_current']");

Open in new window


I can complexify the xpath for selecting only a subset of that frame, reaching levels of granularity that are far beyond of scope of the most monstrous regex, keeping my code perfectly readable. For a full reference about Xpath, see <http://www.w3.org/TR/xpath/>.

My conclusion is clear: if you want to improve dramatically your productivity in processing HTML pages, it is urgent for you to upgrade from regex to XML processing.
0
5,090 Views
Pierre FrançoisSenior consultant
CERTIFIED EXPERT

Comments (0)

Have a question about something in this article? You can receive help directly from the article author. Sign up for a free trial to get started.