<

Retreive specific parts of HTML documents without the hassle of monster regular expressions

Published on
10,438 Points
4,438 Views
Last Modified:
Approved
Browsing the questions asked to the Experts of this forum, you will be amazed to see how many times people are headaching about monster regular expressions (regex) to select that specific part of some HTML or XML file they want to extract. The examples in this article are code in PHP.

Even when their regex seem to work for their case, they realize that it is very hard to have a regex which is safe in all cases. V.gr., if you try to match the beginning of a paragraph, which is one of the simplest things you could imagine, you will try to match "<p>", isn't it?

Well, in that case, you are going to miss <P>, like in the first versions of HTML. I can already hear the answer of the regex fans: You only have to adapt your pattern to match "<[pP]>".

What happens when the coder of the HTML makes use of his liberty to add spaces between "<p" and the closing ">" like in "<p >"? The regex aren't still defeated: you can save the coding by having "<[pP] *>" as pattern.

And what if I want to match also <p some-attribute="some-string"> along with <p> without attributes? No problem: the regex becomes "<[pP][^>]*>". Pretty simple, no? But is doesn't work when one of the attribute strings contains a ">" char, or when the opening < and the closing > are on different lines, and that's really the point: XML and HTML files are not organised line by line as regex are written for.

It is better to stop here: I hope you start to understand that using regex pattern matching on HTML/XML files is like using a knife as a screwdriver: it works, but you will have to be very good for not breaking anything, and you will loose a bunch of time, if you don't hurt yourself.

The adequate tool to select any part of an XML document is XPath. And any HTML document can easily be turned into an XML document thanks to the DOMDocument classes that we can find in most programming languages. I will illustrate how to do this  this with PHP.

Assuming $htmlfile contains the document you want to parse, you can build an XML tree with the following PHP instructions:
$doc = new DOMDocument();
$doc->loadHTML($htmlfile);

Open in new window

Note: all the HTML tags will be converted into xhtml tags, this implies using lowercase letters.

Let us suppose we want to get all the images (IMG), you can select the corresponding list of nodes from the tree by doing:
$xpath = new DOMXpath($doc);
$nodelist = $xpath->query("//img");

Open in new window


The query syntax here query("//img") is infinitely more transparent than the corresponding regex, and it works even when the IMG tag spans several lines of text.

Before closing this article, this is how you can print the selected parts of the HTML file in a loop:
foreach ($nodelist as $node) {
	echo $doc->savexml($node);
}

Open in new window


The versatility of xpath is huge. You not only can select a specific tag, like in the example above with "//img", you can really navigate in the file you are browsing.

If, v.gr., I want to browse http://fr.weather.com/weather/today-Bruxelles-BEXX0005, I can select the frame with the current conditions with the following xpath statement:

$xpath->query("//div[@id='today_current']");

Open in new window


I can complexify the xpath for selecting only a subset of that frame, reaching levels of granularity that are far beyond of scope of the most monstrous regex, keeping my code perfectly readable. For a full reference about Xpath, see <http://www.w3.org/TR/xpath/>.

My conclusion is clear: if you want to improve dramatically your productivity in processing HTML pages, it is urgent for you to upgrade from regex to XML processing.
0
Comment
0 Comments

Featured Post

Cloud Class® Course: Microsoft Azure 2017

Azure has a changed a lot since it was originally introduce by adding new services and features. Do you know everything you need to about Azure? This course will teach you about the Azure App Service, monitoring and application insights, DevOps, and Team Services.

Join & Write a Comment

From store locators to asset tracking and route optimization, learn how leading companies are using Google Maps APIs throughout the customer journey to increase checkout conversions, boost user engagement, and optimize order fulfillment. Powered …
Wrapper-1-Query. Use an Excel function to calculate a column for an Access query. Part 1. Shows a query in Access that has a calculated column with the results of an Excel worksheet function. See how to call a wrapper function from a query, and …

Keep in touch with Experts Exchange

Tech news and trends delivered to your inbox every month