<

Go Premium for a chance to win a PS4. Enter to Win

x

Retreive specific parts of HTML documents without the hassle of monster regular expressions

Published on
10,336 Points
4,336 Views
Last Modified:
Browsing the questions asked to the Experts of this forum, you will be amazed to see how many times people are headaching about monster regular expressions (regex) to select that specific part of some HTML or XML file they want to extract. The examples in this article are code in PHP.

Even when their regex seem to work for their case, they realize that it is very hard to have a regex which is safe in all cases. V.gr., if you try to match the beginning of a paragraph, which is one of the simplest things you could imagine, you will try to match "<p>", isn't it?

Well, in that case, you are going to miss <P>, like in the first versions of HTML. I can already hear the answer of the regex fans: You only have to adapt your pattern to match "<[pP]>".

What happens when the coder of the HTML makes use of his liberty to add spaces between "<p" and the closing ">" like in "<p >"? The regex aren't still defeated: you can save the coding by having "<[pP] *>" as pattern.

And what if I want to match also <p some-attribute="some-string"> along with <p> without attributes? No problem: the regex becomes "<[pP][^>]*>". Pretty simple, no? But is doesn't work when one of the attribute strings contains a ">" char, or when the opening < and the closing > are on different lines, and that's really the point: XML and HTML files are not organised line by line as regex are written for.

It is better to stop here: I hope you start to understand that using regex pattern matching on HTML/XML files is like using a knife as a screwdriver: it works, but you will have to be very good for not breaking anything, and you will loose a bunch of time, if you don't hurt yourself.

The adequate tool to select any part of an XML document is XPath. And any HTML document can easily be turned into an XML document thanks to the DOMDocument classes that we can find in most programming languages. I will illustrate how to do this  this with PHP.

Assuming $htmlfile contains the document you want to parse, you can build an XML tree with the following PHP instructions:
$doc = new DOMDocument();
$doc->loadHTML($htmlfile);

Open in new window

Note: all the HTML tags will be converted into xhtml tags, this implies using lowercase letters.

Let us suppose we want to get all the images (IMG), you can select the corresponding list of nodes from the tree by doing:
$xpath = new DOMXpath($doc);
$nodelist = $xpath->query("//img");

Open in new window


The query syntax here query("//img") is infinitely more transparent than the corresponding regex, and it works even when the IMG tag spans several lines of text.

Before closing this article, this is how you can print the selected parts of the HTML file in a loop:
foreach ($nodelist as $node) {
	echo $doc->savexml($node);
}

Open in new window


The versatility of xpath is huge. You not only can select a specific tag, like in the example above with "//img", you can really navigate in the file you are browsing.

If, v.gr., I want to browse http://fr.weather.com/weather/today-Bruxelles-BEXX0005, I can select the frame with the current conditions with the following xpath statement:

$xpath->query("//div[@id='today_current']");

Open in new window


I can complexify the xpath for selecting only a subset of that frame, reaching levels of granularity that are far beyond of scope of the most monstrous regex, keeping my code perfectly readable. For a full reference about Xpath, see <http://www.w3.org/TR/xpath/>.

My conclusion is clear: if you want to improve dramatically your productivity in processing HTML pages, it is urgent for you to upgrade from regex to XML processing.
0
Comment
0 Comments

Featured Post

Free Tool: Site Down Detector

Helpful to verify reports of your own downtime, or to double check a downed website you are trying to access.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Join & Write a Comment

This video shows how to quickly and easily deploy an email signature for all users in Office 365 and prevent it from being added to replies and forwards. (the resulting signature is applied on the server level in Exchange Online) The email signat…
Is your OST file inaccessible, Need to transfer OST file from one computer to another? Want to convert OST file to PST? If the answer to any of the above question is yes, then look no further. With the help of Stellar OST to PST Converter, you can e…
Suggested Courses

Keep in touch with Experts Exchange

Tech news and trends delivered to your inbox every month