how to XPATH query this html file from command-line?

Posted on 2013-06-08
Last Modified: 2013-06-14
Is there in this world a way to extract the XPATH "//div[@id='ps-content'] from the attached file test.htm in a single line of Windows command-line?

I tried with Saxon with this line:
"java -cp "saxon9he.jar" net.sf.saxon.Query -s:"test.htm" -qs:"//div[id='ps-content']"
but it gave a strange error in line number 62

I tried a similar query in Basex standalone but it gave "(Line 62): Invalid character found: ')'.

Does a solution exist?
Question by:lucavilla
  • 3
  • 2
LVL 60

Expert Comment

by:Geert Bormans
ID: 39232672

I am suspicious about your HTML (which you did not attach by the way)
The source for XPATH should be wellformed XML, not HTML
if your HTML is not wellformed XML, the XML parser will choque on it,

I can only comment after I have seen the HTML, but if it is not wellformed consider one of the following
- have a two step process. 1st step use TagSoup to turn your html into wellformed xml
- use Saxon PE and the function saxon:parse() which actually does TagSoup in the box
- use a text processing approach with Ruby or Python regexes

I favour the second option (but for that you need to buy Saxon9PE, honestly, it will be the best 50$ you ever spent, notably because you support the only industry qualtity XSLT2 processor around)
If you want to do the first in a single windows command ... you will need to wrap two instructions in one bat file
The third option is a poor one, with a lot of risks, but completeness requires me to list it

Author Comment

ID: 39232886
ooppsss sorry Gertone I forgot to attach the HTML file but after reading your excellent answer I decided to point you directly to the original of that HTML file that is this web page:

Before searching for a command-line solution I tested the XPATH in Google Chrome console over that page with this command string:  "$x('//DIV[@id="ps-content"]')" and Google Chrome correctly returns the block of HTML that I expect.

I need for a command-line solution because I need to add XPATH filtering ability in a commercial program (Website Watcher) that supports a simple scripting language where it can (at best, for this purpose) write a variable (eg. the html of a my given webpage) to a file, execute a command-line, read a file (eg. my wanted XPATH reduced page) to a variable... and continue with other processing over that variable...

This commercial program also supports regular expressions and indeed I always did the html filtering by well-refined regular expressions but now a crowd of developers tell me that XPATH is better even because it supports malformed html, where I should build monster-looking regexes to handle the same, so I'm here... a little surprise to read that an XML parser "will choque" on not "wellformed XML".

I can afford even a $100 solution if it works as I like so I'm very willing to try Saxon PE.
I found that XPATH 3.0 (last XPATH ver) is supported in command-line by BaseX and Saxon so can I assume that they're the two best command-line XPATH parsers on the Internet?
Do you recommend Saxon (PE) over BaseX for my purpose? do you think that BaseX doesn't have an integrated equivalent of TagSoup and would require a separate launch of it like in your point 1?

Thanks in advance for your great help, you're my illuminating genius in XML  :)
LVL 60

Expert Comment

by:Geert Bormans
ID: 39233098
Well, as I thought... you are trying to get some info from an html not XML, amazon webpage
(note: I have done a lot of amazon crawling, please note that they change the internals of the page very often, so you need to check regularly that the info is still where you expect it to be)

I don't know too much about baseX, it has TagSoup embedded for loading teh DB,
but I am not sure if you can set it as the default parser, nor it has a parse() function
though here is something you could try

For Saxon PE, TagSoup is integrated.
I won't recommend it over BaseX (not enough knowledge to make recommendations), but I would use Saxon since I am familiar with that.
I suggest that you try to switch the parser in BaseX as is described in the referenced doc.
And if that doesn't work, get a trial licensen from saxon and see how that behaves

Author Comment

ID: 39239683
Hi Gertone, I finally found the right command-line to launch the XPATH extraction with Saxon-PE, look here:

What I only miss is to find how to apply that parse-html() function to that commandline.  Do you know it?
LVL 60

Accepted Solution

Geert Bormans earned 500 total points
ID: 39246935
answered in follow up I believe

Featured Post

How Do You Stack Up Against Your Peers?

With today’s modern enterprise so dependent on digital infrastructures, the impact of major incidents has increased dramatically. Grab the report now to gain insight into how your organization ranks against your peers and learn best-in-class strategies to resolve incidents.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
Online tools to transform xml to excel using xsl 1 52
MS Access XML Export Query Setup Multiple Tag Values 15 43
XML response optional elements 12 61
Wordpress Hacked 1 72
Introduction In my previous article ( I showed you how the XML Source component can be used to load XML files into a SQL Server database, us…
The Confluence of Individual Knowledge and the Collective Intelligence At this writing (summer 2013) the term API ( has made its way into the popular lexicon of the English language.  A few years ago, …
I've attached the XLSM Excel spreadsheet I used in the video and also text files containing the macros used below.…

730 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question