Want to win a PS4? Go Premium and enter to win our High-Tech Treats giveaway. Enter to Win


how to XPATH query this html file from command-line?

Posted on 2013-06-08
Medium Priority
Last Modified: 2013-06-14
Is there in this world a way to extract the XPATH "//div[@id='ps-content'] from the attached file test.htm in a single line of Windows command-line?

I tried with Saxon with this line:
"java -cp "saxon9he.jar" net.sf.saxon.Query -s:"test.htm" -qs:"//div[id='ps-content']"
but it gave a strange error in line number 62

I tried a similar query in Basex standalone but it gave "(Line 62): Invalid character found: ')'.

Does a solution exist?
Question by:lucavilla
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 3
  • 2
LVL 60

Expert Comment

by:Geert Bormans
ID: 39232672

I am suspicious about your HTML (which you did not attach by the way)
The source for XPATH should be wellformed XML, not HTML
if your HTML is not wellformed XML, the XML parser will choque on it,

I can only comment after I have seen the HTML, but if it is not wellformed consider one of the following
- have a two step process. 1st step use TagSoup to turn your html into wellformed xml
- use Saxon PE and the function saxon:parse() which actually does TagSoup in the box
- use a text processing approach with Ruby or Python regexes

I favour the second option (but for that you need to buy Saxon9PE, honestly, it will be the best 50$ you ever spent, notably because you support the only industry qualtity XSLT2 processor around)
If you want to do the first in a single windows command ... you will need to wrap two instructions in one bat file
The third option is a poor one, with a lot of risks, but completeness requires me to list it

Author Comment

ID: 39232886
ooppsss sorry Gertone I forgot to attach the HTML file but after reading your excellent answer I decided to point you directly to the original of that HTML file that is this web page: http://www.amazon.com/dp/1449319432

Before searching for a command-line solution I tested the XPATH in Google Chrome console over that page with this command string:  "$x('//DIV[@id="ps-content"]')" and Google Chrome correctly returns the block of HTML that I expect.

I need for a command-line solution because I need to add XPATH filtering ability in a commercial program (Website Watcher) that supports a simple scripting language where it can (at best, for this purpose) write a variable (eg. the html of a my given webpage) to a file, execute a command-line, read a file (eg. my wanted XPATH reduced page) to a variable... and continue with other processing over that variable...

This commercial program also supports regular expressions and indeed I always did the html filtering by well-refined regular expressions but now a crowd of developers tell me that XPATH is better even because it supports malformed html, where I should build monster-looking regexes to handle the same, so I'm here... a little surprise to read that an XML parser "will choque" on not "wellformed XML".

I can afford even a $100 solution if it works as I like so I'm very willing to try Saxon PE.
I found that XPATH 3.0 (last XPATH ver) is supported in command-line by BaseX and Saxon so can I assume that they're the two best command-line XPATH parsers on the Internet?
Do you recommend Saxon (PE) over BaseX for my purpose? do you think that BaseX doesn't have an integrated equivalent of TagSoup and would require a separate launch of it like in your point 1?

Thanks in advance for your great help, you're my illuminating genius in XML  :)
LVL 60

Expert Comment

by:Geert Bormans
ID: 39233098
Well, as I thought... you are trying to get some info from an html not XML, amazon webpage
(note: I have done a lot of amazon crawling, please note that they change the internals of the page very often, so you need to check regularly that the info is still where you expect it to be)

I don't know too much about baseX, it has TagSoup embedded for loading teh DB,
but I am not sure if you can set it as the default parser, nor it has a parse() function
though here is something you could try

For Saxon PE, TagSoup is integrated.
I won't recommend it over BaseX (not enough knowledge to make recommendations), but I would use Saxon since I am familiar with that.
I suggest that you try to switch the parser in BaseX as is described in the referenced doc.
And if that doesn't work, get a trial licensen from saxon and see how that behaves

Author Comment

ID: 39239683
Hi Gertone, I finally found the right command-line to launch the XPATH extraction with Saxon-PE, look here: http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/XML/Q_28154550.html

What I only miss is to find how to apply that parse-html() function to that commandline.  Do you know it?
LVL 60

Accepted Solution

Geert Bormans earned 2000 total points
ID: 39246935
answered in follow up I believe

Featured Post


Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Browsing the questions asked to the Experts of this forum, you will be amazed to see how many times people are headaching about monster regular expressions (regex) to select that specific part of some HTML or XML file they want to extract. The examp…
Many times as a report developer I've been asked to display normalized data such as three rows with values Jack, Joe, and Bob as a single comma-separated string such as 'Jack, Joe, Bob', and vice versa.  Here's how to do it. 
In this video, Percona Director of Solution Engineering Jon Tobin discusses the function and features of Percona Server for MongoDB. How Percona can help Percona can help you determine if Percona Server for MongoDB is the right solution for …
We’ve all felt that sense of false security before—locking down external access to a database or component and feeling like we’ve done all we need to do to secure company data. But that feeling is fleeting. Attacks these days can happen in many w…
Suggested Courses

618 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question