$ wget -q -O - http://www.skynet.be/nieuws-sport/weer/mijn-weer?cityId=6450
$ wget -q -O - http://www.skynet.be/nieuws-sport/weer/mijn-weer?cityId=6450 | xmllint --html --xpath '//div[@class = "tides"]' - 2>/dev/null
If you execute this command, you will see you get too much output. First of all, there are two tags div with the attribute class set to the string "tides", the first for the tides of today, the second for the tides of tomorrow. This is not a problem, because the syntax of the xpath gives you all the granularity you want to fine tune your selection. We will add a predicate []1] telling we only want the content under the first tag:
$ wget -q -O - http://www.skynet.be/nieuws-sport/weer/mijn-weer?cityId=6450 | xmllint --html --xpath '//div[@class = "tides"][1]' - 2>/dev/null
But this gives the same result as the previous selection. Digging more into the syntax of xpath, you quickly understand that you forgot to add parentheses: since both tags div (with attribute class="tides") are the first child of their ancestor, both will be selected. Add parentheses ( ... )[]1] if you want to select the content of only its first instance:
$ wget -q -O - http://www.skynet.be/nieuws-sport/weer/mijn-weer?cityId=6450 | xmllint --html --xpath ' ( //div[@class = "tides"] ) [1]' - 2>/dev/null
We are on our way to have exactly what we want to get. We are only interested in the tides in Antwerp, not these of Ostend, so we are going to refine our xpath asking the contents of the p tags following the div tag having the title attribute set to "Marées Anvers" which is done by adding div[]@title="Marées Anvers"]/following-sibling$ wget -q -O - http://www.skynet.be/nieuws-sport/weer/mijn-weer?cityId=6450 | xmllint --html --xpath '(//div[@class = "tides"])[1]/div[@title="Marées Anvers"]/following-sibling::p' - 2>/dev/null
This is the output:
Hoogtij: <strong>08:24</strong> <strong>20:41</strong>
Laagtij: <strong>02:56</strong> <strong>02:56</strong>
Hoogtij means high tide, Laagtij low tide. So we still have to exclude low tide from the output, what we can get as below:
$ wget -q -O - http://www.skynet.be/nieuws-sport/weer/mijn-weer?cityId=6450 | xmllint --html --xpath '(//div[@class = "tides"])[1]/div[@title="Marées Anvers"]/following-sibling::p[text()[1]="Hoogtij: "]/strong' - 2>/dev/null
This gives us the following result:
<strong>08:24</strong> <strong>20:41</strong>
which is very close to what we want to get. I recommend that you do not to remove the strong tags around your results by means of the xpath, as you will have no separator between the two values; it's better to add a filter to the output to remove these tags.
$ tr AB ab
will transliterate all the instances of A into a, and of B into b. V.gr.
$ echo 'Hello, World!' | tr A-Z a-z
will produce:
hello, world!
You will see that instead of writing 'abcde'... until 'z', we have used a shortcut 'a-z'.
$ echo 'Some #@* string to be (!) filtered from punctuation...' | tr -d '#@*()!?'
will produce
Some string to be filtered from punctuation...
We couid also have squeezed the repeated blanks into one. If you are eager to know how, take a look at the manual. With the knowledge we have acquired of tr, we can suppress now every letter and the characters < and > from the output. Our oneliner now looks like:
$ wget -q -O - http://www.skynet.be/nieuws-sport/weer/mijn-weer?cityId=6450 | xmllint --html --xpath '(//div[@class = "tides"])[1]/div[@title="Marées Anvers"]/following-sibling::p[text()[1]="Hoogtij: "]/strong' - 2>/dev/null | tr -d 'a-z<>'
Oops, we still do not reach the perfect result, since some slashes are still present in the output:
08:24/20:41/
No problem: we are going to change these slashes into newline characters, again with the tr filter, this time to translate the slash into a newline:
$ wget -q -O - http://www.skynet.be/nieuws-sport/weer/mijn-weer?cityId=6450 | xmllint --html --xpath '(//div[@class = "tides"])[1]/div[@title="Marées Anvers"]/following-sibling::p[text()[1]="Hoogtij: "]/strong' - 2>/dev/null | tr -d 'a-z<>' | tr '/' '\n'
Finally, we get the correct result:
08:24
20:41
Have a question about something in this article? You can receive help directly from the article author. Sign up for a free trial to get started.
Comments (5)
Commented:
Thanks,
Puspharaj
Author
Commented:I don't understand how to use libxml2-utils with elinks.
As far as I understand, elinks is a browser that sends output to a terminal, a screen or a window. It is not possible to redirect that output to a command line filter like xmllint and tr. That is the reason why you have to work at the level of the command line interface (the shell) and use wget as http client.
Author
Commented:After Googling a bit, I found that elinks provides the ability to execute scripts called Lua, and that there is a link from Lua to the libxml2 libraries. In that way, you should be able to restrict the content of an HTML page to the specific information you want to retrieve.
Perhaps an opportunity for you to publish a very interesting article here.
Commented:
Commented: