<

Still celebrating National IT Professionals Day with 3 months of free Premium Membership. Use Code ITDAY17

x

A Simple Linux script to retrieve information from the web using xpath selection

Published on
17,595 Points
7,895 Views
2 Endorsements
Last Modified:
Awarded
Have you ever been frustrated by having to click seven times in order to retrieve a small bit of information from the web, always the same seven clicks, scrolling down and down until you reach your target? When you know the benefits of the command line interface of advanced systems like Unix, Linux, BDS, and so on, scripting with bash helps you to avoid the hassle of repeating tasks, but is this possible when you have to get your information from some webpage?

A simple example taken from real life

I will describe how to get such information from the Internet. Suppose that I need to know when it is high tide in Antwerp. You can get that information from a table that you can find at http://www.skynet.be/nieuws-sport/weer/mijn-weer?cityId=6450, where 6450 is the code for the city of Antwerp.

By inspecting the HTML code of this page, you can discover that the information you are looking for stays inside of a div tag which has "tides" as class attribute. If you use Firefox, you can easily find it by right clicking on the information you want to localise and select Inspect Element (Q) in the pop-up menu.

Extracting information from a webpage

The program I am using for extracting a part of an HTML page is xmllint which is a part of the libxml2-utils package in the Ubuntu distribution. The utility xmllint has an option which is called --xpath that allows you to describe which part of the HTML file you want to select. We want to select the content of the webpage under the div tag, having the attribute class set to the string "tides".

Another command I use is wget for fetching a webpage from the web without invoking a browser. I will use it with -q as flag for limiting the output to strictly necessary data and -O - for redirecting its output to the standard output:

$ wget -q -O - http://www.skynet.be/nieuws-sport/weer/mijn-weer?cityId=6450

Open in new window


Filtering the result with an xpath

From the ouput of this command, I now have to select what to extract with an xpath. The syntax of xpath is explained on a lot of websites, such as http://en.wikipedia.org/wiki/XPath or  http://www.w3schools.com/xpath/xpath_syntax.asp. To use xmllint with HTML as input, you need to add the flag --html and to read from the standard input, you have to specify - as filename. For connecting the standard output of the first command to the standard input of the second one, you just separate these command with the operator "|". To have the second command as quiet as the first, I redirected its standard error output to the null device /dev/null, which causes the error messages to be ignored.  Combining these commands gives us the next command line:

$ wget -q -O - http://www.skynet.be/nieuws-sport/weer/mijn-weer?cityId=6450 | xmllint --html --xpath '//div[@class = "tides"]' - 2>/dev/null

Open in new window

If you execute this command, you will see you get too much output. First of all, there are two tags div with the attribute class set to the string "tides", the first for the tides of today, the second for the tides of tomorrow. This is not a problem, because the syntax of the xpath gives you all the granularity you want to fine tune your selection. We will add a predicate []1] telling we only want the content under the first tag:

$ wget -q -O - http://www.skynet.be/nieuws-sport/weer/mijn-weer?cityId=6450 | xmllint --html --xpath '//div[@class = "tides"][1]' - 2>/dev/null

Open in new window

But this gives the same result as the previous selection. Digging more into the syntax of xpath, you quickly understand that you forgot to add parentheses: since both tags div (with attribute class="tides") are the first child of their ancestor, both will be selected. Add parentheses ( ... )[]1] if you want to select the content of only its first instance:

$ wget -q -O - http://www.skynet.be/nieuws-sport/weer/mijn-weer?cityId=6450 | xmllint --html --xpath ' ( //div[@class = "tides"] ) [1]' - 2>/dev/null

Open in new window

We are on our way to have exactly what we want to get. We are only interested in the tides in Antwerp, not these of Ostend, so we are going to refine our xpath asking the contents of the p tags following the div tag having the title attribute set to "Marées Anvers" which is done by adding div[]@title="Marées Anvers"]/following-sibling::p to the xpath:

$ wget -q -O - http://www.skynet.be/nieuws-sport/weer/mijn-weer?cityId=6450 | xmllint --html --xpath '(//div[@class = "tides"])[1]/div[@title="Marées Anvers"]/following-sibling::p' - 2>/dev/null

Open in new window

This is the output: 

Hoogtij: <strong>08:24</strong>  <strong>20:41</strong>
Laagtij: <strong>02:56</strong>  <strong>02:56</strong>

Open in new window

Hoogtij means high tide, Laagtij low tide. So we still have to exclude low tide from the output, what we can get as below:

$ wget -q -O - http://www.skynet.be/nieuws-sport/weer/mijn-weer?cityId=6450 | xmllint --html --xpath '(//div[@class = "tides"])[1]/div[@title="Marées Anvers"]/following-sibling::p[text()[1]="Hoogtij: "]/strong' - 2>/dev/null

Open in new window

This gives us the following result:

<strong>08:24</strong> <strong>20:41</strong>

Open in new window

which is very close to what we want to get. I recommend that you do not to remove the strong tags around your results by means of the xpath, as you will have no separator between the two values; it's better to add a filter to the output to remove these tags.

Making little changes with tr

The way we choose is to delete some characters from the output and here is where the tr command enters.
It is mainly used as a filter, which means that it expects an input from the sandard input stream and writes a slighty modified output on the standard output stream.

The main use of tr is for transliterate one set of characters into another. If you use tr without arguments, it expects two strings as argument, like in the following example:

$ tr AB ab

Open in new window

will transliterate all the instances of A into a, and of B into b. V.gr.

$ echo 'Hello, World!' | tr A-Z a-z

Open in new window

will produce:

hello, world!

Open in new window

You will see that instead of writing 'abcde'... until 'z', we have used a shortcut 'a-z'.
The command tr can also be used with the -d option to delete a certain set of characters from the stream. An example will show hos this can succeed:

$ echo 'Some #@* string to be (!) filtered from punctuation...' | tr -d '#@*()!?'

Open in new window

will produce

Some  string to be  filtered from punctuation...

Open in new window

We couid also have squeezed the repeated blanks into one. If you are eager to know how, take a look at the manual. With the knowledge we have acquired of tr, we can suppress now every letter and the characters < and > from the output. Our oneliner now looks like:

$ wget -q -O - http://www.skynet.be/nieuws-sport/weer/mijn-weer?cityId=6450 | xmllint --html --xpath '(//div[@class = "tides"])[1]/div[@title="Marées Anvers"]/following-sibling::p[text()[1]="Hoogtij: "]/strong' - 2>/dev/null | tr -d 'a-z<>'

Open in new window

Oops, we still do not reach the perfect result, since some slashes are still present in the output:

08:24/20:41/

Open in new window

No problem: we are going to change these slashes into newline characters, again with the tr filter, this time to translate the slash into a newline:

$ wget -q -O - http://www.skynet.be/nieuws-sport/weer/mijn-weer?cityId=6450 | xmllint --html --xpath '(//div[@class = "tides"])[1]/div[@title="Marées Anvers"]/following-sibling::p[text()[1]="Hoogtij: "]/strong' - 2>/dev/null | tr -d 'a-z<>' | tr '/' '\n'

Open in new window

Finally, we get the correct result:

08:24
20:41

Open in new window

 

Conclusion

It is necessary to have a certain knowledge of the syntax of xpath to achieve fast results by this way, but if that is the case, it is possible to easily retrieve data from complex webpages from within a bash script using its powerful syntax. And not even with a script, but with a one line command.
2
Comment
Author:pfrancois
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
6 Comments
 
 

Administrative Comment

by:Eric AKA Netminder
Congratulations! Your article has been published, and has also been awarded EE-Approved status.

ericpete
Page Editor
0
 
LVL 2

Expert Comment

by:Puspharaj Selvaraj
Initiative, Using this libxml2-utils library , can i automate such things with elinks?

Thanks,
Puspharaj
0
 
LVL 10

Author Comment

by:pfrancois
@Puspharaj

I don't understand how to use libxml2-utils with elinks.

As far as I understand, elinks is a browser that sends output to a terminal, a screen or a window. It is not possible to redirect that output to a command line filter like xmllint and tr. That is the reason why you have to work at the level of the command line interface (the shell) and use wget as http client.
0
Major Serverless Shift

Comparison of major players like AWS, Microsoft Azure, IBM Bluemix, and Google Cloud Platform

 
LVL 10

Author Comment

by:pfrancois
@Puspharaj

After Googling a bit, I found that elinks provides the ability to execute scripts called Lua, and that there is a link from Lua to the libxml2 libraries. In that way, you should be able to restrict the content of an HTML page to the specific information you want to retrieve.

Perhaps an opportunity for you to publish a very interesting article here.
0
 
LVL 5

Expert Comment

by:magento
Very nice article, thanks,
0
 
LVL 4

Expert Comment

by:Davy Paridaens
I see endless possibilities  with this method of getting data and storing it in a database. A very nice article!
0

Featured Post

Free Tool: Subnet Calculator

The subnet calculator helps you design networks by taking an IP address and network mask and returning information such as network, broadcast address, and host range.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

Join & Write a Comment

Sometimes it takes a new vantage point, apart from our everyday security practices, to truly see our Active Directory (AD) vulnerabilities. We get used to implementing the same techniques and checking the same areas for a breach. This pattern can re…
In response to a need for security and privacy, and to continue fostering an environment members can turn to for support, solutions, and education, Experts Exchange has created anonymous question capabilities. This new feature is available to our Pr…

Keep in touch with Experts Exchange

Tech news and trends delivered to your inbox every month