Solved

how to XPATH query this html file from command-line?

Posted on 2013-06-08
5
409 Views
Last Modified: 2013-06-14
Is there in this world a way to extract the XPATH "//div[@id='ps-content'] from the attached file test.htm in a single line of Windows command-line?

I tried with Saxon with this line:
"java -cp "saxon9he.jar" net.sf.saxon.Query -s:"test.htm" -qs:"//div[id='ps-content']"
but it gave a strange error in line number 62

I tried a similar query in Basex standalone but it gave "(Line 62): Invalid character found: ')'.

Does a solution exist?
0
Comment
Question by:lucavilla
  • 3
  • 2
5 Comments
 
LVL 60

Expert Comment

by:Geert Bormans
ID: 39232672
Hi,

I am suspicious about your HTML (which you did not attach by the way)
The source for XPATH should be wellformed XML, not HTML
if your HTML is not wellformed XML, the XML parser will choque on it,

I can only comment after I have seen the HTML, but if it is not wellformed consider one of the following
- have a two step process. 1st step use TagSoup to turn your html into wellformed xml
- use Saxon PE and the function saxon:parse() which actually does TagSoup in the box
- use a text processing approach with Ruby or Python regexes

I favour the second option (but for that you need to buy Saxon9PE, honestly, it will be the best 50$ you ever spent, notably because you support the only industry qualtity XSLT2 processor around)
If you want to do the first in a single windows command ... you will need to wrap two instructions in one bat file
The third option is a poor one, with a lot of risks, but completeness requires me to list it
0
 

Author Comment

by:lucavilla
ID: 39232886
ooppsss sorry Gertone I forgot to attach the HTML file but after reading your excellent answer I decided to point you directly to the original of that HTML file that is this web page: http://www.amazon.com/dp/1449319432

Before searching for a command-line solution I tested the XPATH in Google Chrome console over that page with this command string:  "$x('//DIV[@id="ps-content"]')" and Google Chrome correctly returns the block of HTML that I expect.

I need for a command-line solution because I need to add XPATH filtering ability in a commercial program (Website Watcher) that supports a simple scripting language where it can (at best, for this purpose) write a variable (eg. the html of a my given webpage) to a file, execute a command-line, read a file (eg. my wanted XPATH reduced page) to a variable... and continue with other processing over that variable...

This commercial program also supports regular expressions and indeed I always did the html filtering by well-refined regular expressions but now a crowd of developers tell me that XPATH is better even because it supports malformed html, where I should build monster-looking regexes to handle the same, so I'm here... a little surprise to read that an XML parser "will choque" on not "wellformed XML".

I can afford even a $100 solution if it works as I like so I'm very willing to try Saxon PE.
I found that XPATH 3.0 (last XPATH ver) is supported in command-line by BaseX and Saxon so can I assume that they're the two best command-line XPATH parsers on the Internet?
Do you recommend Saxon (PE) over BaseX for my purpose? do you think that BaseX doesn't have an integrated equivalent of TagSoup and would require a separate launch of it like in your point 1?

Thanks in advance for your great help, you're my illuminating genius in XML  :)
0
 
LVL 60

Expert Comment

by:Geert Bormans
ID: 39233098
Well, as I thought... you are trying to get some info from an html not XML, amazon webpage
(note: I have done a lot of amazon crawling, please note that they change the internals of the page very often, so you need to check regularly that the info is still where you expect it to be)

I don't know too much about baseX, it has TagSoup embedded for loading teh DB,
but I am not sure if you can set it as the default parser, nor it has a parse() function
though here is something you could try
http://docs.basex.org/wiki/Parsers#Command_Line_2

For Saxon PE, TagSoup is integrated.
I won't recommend it over BaseX (not enough knowledge to make recommendations), but I would use Saxon since I am familiar with that.
I suggest that you try to switch the parser in BaseX as is described in the referenced doc.
And if that doesn't work, get a trial licensen from saxon and see how that behaves
0
 

Author Comment

by:lucavilla
ID: 39239683
Hi Gertone, I finally found the right command-line to launch the XPATH extraction with Saxon-PE, look here: http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/XML/Q_28154550.html

What I only miss is to find how to apply that parse-html() function to that commandline.  Do you know it?
0
 
LVL 60

Accepted Solution

by:
Geert Bormans earned 500 total points
ID: 39246935
answered in follow up I believe
0

Featured Post

What Should I Do With This Threat Intelligence?

Are you wondering if you actually need threat intelligence? The answer is yes. We explain the basics for creating useful threat intelligence.

Join & Write a Comment

Browsing the questions asked to the Experts of this forum, you will be amazed to see how many times people are headaching about monster regular expressions (regex) to select that specific part of some HTML or XML file they want to extract. The examp…
Many times as a report developer I've been asked to display normalized data such as three rows with values Jack, Joe, and Bob as a single comma-separated string such as 'Jack, Joe, Bob', and vice versa.  Here's how to do it. 
Access reports are powerful and flexible. Learn how to create a query and then a grouped report using the wizard. Modify the report design after the wizard is done to make it look better. There will be another video to explain how to put the final p…
When you create an app prototype with Adobe XD, you can insert system screens -- sharing or Control Center, for example -- with just a few clicks. This video shows you how. You can take the full course on Experts Exchange at http://bit.ly/XDcourse.

758 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

20 Experts available now in Live!

Get 1:1 Help Now