• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 454
  • Last Modified:

how to XPATH query this html file from command-line?

Is there in this world a way to extract the XPATH "//div[@id='ps-content'] from the attached file test.htm in a single line of Windows command-line?

I tried with Saxon with this line:
"java -cp "saxon9he.jar" net.sf.saxon.Query -s:"test.htm" -qs:"//div[id='ps-content']"
but it gave a strange error in line number 62

I tried a similar query in Basex standalone but it gave "(Line 62): Invalid character found: ')'.

Does a solution exist?
  • 3
  • 2
1 Solution
Geert BormansInformation ArchitectCommented:

I am suspicious about your HTML (which you did not attach by the way)
The source for XPATH should be wellformed XML, not HTML
if your HTML is not wellformed XML, the XML parser will choque on it,

I can only comment after I have seen the HTML, but if it is not wellformed consider one of the following
- have a two step process. 1st step use TagSoup to turn your html into wellformed xml
- use Saxon PE and the function saxon:parse() which actually does TagSoup in the box
- use a text processing approach with Ruby or Python regexes

I favour the second option (but for that you need to buy Saxon9PE, honestly, it will be the best 50$ you ever spent, notably because you support the only industry qualtity XSLT2 processor around)
If you want to do the first in a single windows command ... you will need to wrap two instructions in one bat file
The third option is a poor one, with a lot of risks, but completeness requires me to list it
lucavillaAuthor Commented:
ooppsss sorry Gertone I forgot to attach the HTML file but after reading your excellent answer I decided to point you directly to the original of that HTML file that is this web page: http://www.amazon.com/dp/1449319432

Before searching for a command-line solution I tested the XPATH in Google Chrome console over that page with this command string:  "$x('//DIV[@id="ps-content"]')" and Google Chrome correctly returns the block of HTML that I expect.

I need for a command-line solution because I need to add XPATH filtering ability in a commercial program (Website Watcher) that supports a simple scripting language where it can (at best, for this purpose) write a variable (eg. the html of a my given webpage) to a file, execute a command-line, read a file (eg. my wanted XPATH reduced page) to a variable... and continue with other processing over that variable...

This commercial program also supports regular expressions and indeed I always did the html filtering by well-refined regular expressions but now a crowd of developers tell me that XPATH is better even because it supports malformed html, where I should build monster-looking regexes to handle the same, so I'm here... a little surprise to read that an XML parser "will choque" on not "wellformed XML".

I can afford even a $100 solution if it works as I like so I'm very willing to try Saxon PE.
I found that XPATH 3.0 (last XPATH ver) is supported in command-line by BaseX and Saxon so can I assume that they're the two best command-line XPATH parsers on the Internet?
Do you recommend Saxon (PE) over BaseX for my purpose? do you think that BaseX doesn't have an integrated equivalent of TagSoup and would require a separate launch of it like in your point 1?

Thanks in advance for your great help, you're my illuminating genius in XML  :)
Geert BormansInformation ArchitectCommented:
Well, as I thought... you are trying to get some info from an html not XML, amazon webpage
(note: I have done a lot of amazon crawling, please note that they change the internals of the page very often, so you need to check regularly that the info is still where you expect it to be)

I don't know too much about baseX, it has TagSoup embedded for loading teh DB,
but I am not sure if you can set it as the default parser, nor it has a parse() function
though here is something you could try

For Saxon PE, TagSoup is integrated.
I won't recommend it over BaseX (not enough knowledge to make recommendations), but I would use Saxon since I am familiar with that.
I suggest that you try to switch the parser in BaseX as is described in the referenced doc.
And if that doesn't work, get a trial licensen from saxon and see how that behaves
lucavillaAuthor Commented:
Hi Gertone, I finally found the right command-line to launch the XPATH extraction with Saxon-PE, look here: http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/XML/Q_28154550.html

What I only miss is to find how to apply that parse-html() function to that commandline.  Do you know it?
Geert BormansInformation ArchitectCommented:
answered in follow up I believe
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

Join & Write a Comment

Featured Post

Cloud Class® Course: CompTIA Cloud+

The CompTIA Cloud+ Basic training course will teach you about cloud concepts and models, data storage, networking, and network infrastructure.

  • 3
  • 2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now