Community Pick: Many members of our community have endorsed this article.
Editor's Choice: This article has been selected by our editors as an exceptional contribution.

A Simple Linux script to retrieve information from the web using xpath selection

Pierre FrançoisSenior consultant
CERTIFIED EXPERT
Published:
Have you ever been frustrated by having to click seven times in order to retrieve a small bit of information from the web, always the same seven clicks, scrolling down and down until you reach your target? When you know the benefits of the command line interface of advanced systems like Unix, Linux, BDS, and so on, scripting with bash helps you to avoid the hassle of repeating tasks, but is this possible when you have to get your information from some webpage?

A simple example taken from real life

I will describe how to get such information from the Internet. Suppose that I need to know when it is high tide in Antwerp. You can get that information from a table that you can find at http://www.skynet.be/nieuws-sport/weer/mijn-weer?cityId=6450, where 6450 is the code for the city of Antwerp.

By inspecting the HTML code of this page, you can discover that the information you are looking for stays inside of a div tag which has "tides" as class attribute. If you use Firefox, you can easily find it by right clicking on the information you want to localise and select Inspect Element (Q) in the pop-up menu.

Extracting information from a webpage

The program I am using for extracting a part of an HTML page is xmllint which is a part of the libxml2-utils package in the Ubuntu distribution. The utility xmllint has an option which is called --xpath that allows you to describe which part of the HTML file you want to select. We want to select the content of the webpage under the div tag, having the attribute class set to the string "tides".

Another command I use is wget for fetching a webpage from the web without invoking a browser. I will use it with -q as flag for limiting the output to strictly necessary data and -O - for redirecting its output to the standard output:

$ wget -q -O - http://www.skynet.be/nieuws-sport/weer/mijn-weer?cityId=6450

Open in new window


Filtering the result with an xpath

From the ouput of this command, I now have to select what to extract with an xpath. The syntax of xpath is explained on a lot of websites, such as http://en.wikipedia.org/wiki/XPath or  http://www.w3schools.com/xpath/xpath_syntax.asp. To use xmllint with HTML as input, you need to add the flag --html and to read from the standard input, you have to specify - as filename. For connecting the standard output of the first command to the standard input of the second one, you just separate these command with the operator "|". To have the second command as quiet as the first, I redirected its standard error output to the null device /dev/null, which causes the error messages to be ignored.  Combining these commands gives us the next command line:

$ wget -q -O - http://www.skynet.be/nieuws-sport/weer/mijn-weer?cityId=6450 | xmllint --html --xpath '//div[@class = "tides"]' - 2>/dev/null

Open in new window

If you execute this command, you will see you get too much output. First of all, there are two tags div with the attribute class set to the string "tides", the first for the tides of today, the second for the tides of tomorrow. This is not a problem, because the syntax of the xpath gives you all the granularity you want to fine tune your selection. We will add a predicate []1] telling we only want the content under the first tag:

$ wget -q -O - http://www.skynet.be/nieuws-sport/weer/mijn-weer?cityId=6450 | xmllint --html --xpath '//div[@class = "tides"][1]' - 2>/dev/null

Open in new window

But this gives the same result as the previous selection. Digging more into the syntax of xpath, you quickly understand that you forgot to add parentheses: since both tags div (with attribute class="tides") are the first child of their ancestor, both will be selected. Add parentheses ( ... )[]1] if you want to select the content of only its first instance:

$ wget -q -O - http://www.skynet.be/nieuws-sport/weer/mijn-weer?cityId=6450 | xmllint --html --xpath ' ( //div[@class = "tides"] ) [1]' - 2>/dev/null

Open in new window

We are on our way to have exactly what we want to get. We are only interested in the tides in Antwerp, not these of Ostend, so we are going to refine our xpath asking the contents of the p tags following the div tag having the title attribute set to "Marées Anvers" which is done by adding div[]@title="Marées Anvers"]/following-sibling::p to the xpath:

$ wget -q -O - http://www.skynet.be/nieuws-sport/weer/mijn-weer?cityId=6450 | xmllint --html --xpath '(//div[@class = "tides"])[1]/div[@title="Marées Anvers"]/following-sibling::p' - 2>/dev/null

Open in new window

This is the output: 

Hoogtij: <strong>08:24</strong>  <strong>20:41</strong>
                      Laagtij: <strong>02:56</strong>  <strong>02:56</strong>

Open in new window

Hoogtij means high tide, Laagtij low tide. So we still have to exclude low tide from the output, what we can get as below:

$ wget -q -O - http://www.skynet.be/nieuws-sport/weer/mijn-weer?cityId=6450 | xmllint --html --xpath '(//div[@class = "tides"])[1]/div[@title="Marées Anvers"]/following-sibling::p[text()[1]="Hoogtij: "]/strong' - 2>/dev/null

Open in new window

This gives us the following result:

<strong>08:24</strong> <strong>20:41</strong>

Open in new window

which is very close to what we want to get. I recommend that you do not to remove the strong tags around your results by means of the xpath, as you will have no separator between the two values; it's better to add a filter to the output to remove these tags.

Making little changes with tr

The way we choose is to delete some characters from the output and here is where the tr command enters.
It is mainly used as a filter, which means that it expects an input from the sandard input stream and writes a slighty modified output on the standard output stream.

The main use of tr is for transliterate one set of characters into another. If you use tr without arguments, it expects two strings as argument, like in the following example:

$ tr AB ab

Open in new window

will transliterate all the instances of A into a, and of B into b. V.gr.

$ echo 'Hello, World!' | tr A-Z a-z

Open in new window

will produce:

hello, world!

Open in new window

You will see that instead of writing 'abcde'... until 'z', we have used a shortcut 'a-z'.
The command tr can also be used with the -d option to delete a certain set of characters from the stream. An example will show hos this can succeed:

$ echo 'Some #@* string to be (!) filtered from punctuation...' | tr -d '#@*()!?'

Open in new window

will produce

Some  string to be  filtered from punctuation...

Open in new window

We couid also have squeezed the repeated blanks into one. If you are eager to know how, take a look at the manual. With the knowledge we have acquired of tr, we can suppress now every letter and the characters < and > from the output. Our oneliner now looks like:

$ wget -q -O - http://www.skynet.be/nieuws-sport/weer/mijn-weer?cityId=6450 | xmllint --html --xpath '(//div[@class = "tides"])[1]/div[@title="Marées Anvers"]/following-sibling::p[text()[1]="Hoogtij: "]/strong' - 2>/dev/null | tr -d 'a-z<>'

Open in new window

Oops, we still do not reach the perfect result, since some slashes are still present in the output:

08:24/20:41/

Open in new window

No problem: we are going to change these slashes into newline characters, again with the tr filter, this time to translate the slash into a newline:

$ wget -q -O - http://www.skynet.be/nieuws-sport/weer/mijn-weer?cityId=6450 | xmllint --html --xpath '(//div[@class = "tides"])[1]/div[@title="Marées Anvers"]/following-sibling::p[text()[1]="Hoogtij: "]/strong' - 2>/dev/null | tr -d 'a-z<>' | tr '/' '\n'

Open in new window

Finally, we get the correct result:

08:24
                      20:41

Open in new window

 

Conclusion

It is necessary to have a certain knowledge of the syntax of xpath to achieve fast results by this way, but if that is the case, it is possible to easily retrieve data from complex webpages from within a bash script using its powerful syntax. And not even with a script, but with a one line command.
2
20,530 Views
Pierre FrançoisSenior consultant
CERTIFIED EXPERT

Comments (5)

Puspharaj SelvarajSr.System engineer

Commented:
Initiative, Using this libxml2-utils library , can i automate such things with elinks?

Thanks,
Puspharaj
Pierre FrançoisSenior consultant
CERTIFIED EXPERT

Author

Commented:
@Puspharaj

I don't understand how to use libxml2-utils with elinks.

As far as I understand, elinks is a browser that sends output to a terminal, a screen or a window. It is not possible to redirect that output to a command line filter like xmllint and tr. That is the reason why you have to work at the level of the command line interface (the shell) and use wget as http client.
Pierre FrançoisSenior consultant
CERTIFIED EXPERT

Author

Commented:
@Puspharaj

After Googling a bit, I found that elinks provides the ability to execute scripts called Lua, and that there is a link from Lua to the libxml2 libraries. In that way, you should be able to restrict the content of an HTML page to the specific information you want to retrieve.

Perhaps an opportunity for you to publish a very interesting article here.
magentoSenior Tech Lead
CERTIFIED EXPERT

Commented:
Very nice article, thanks,
CERTIFIED EXPERT

Commented:
I see endless possibilities  with this method of getting data and storing it in a database. A very nice article!

Have a question about something in this article? You can receive help directly from the article author. Sign up for a free trial to get started.