Solved

how to XPATH query this html file from command-line?

Posted on 2013-06-08
5
411 Views
Last Modified: 2013-06-14
Is there in this world a way to extract the XPATH "//div[@id='ps-content'] from the attached file test.htm in a single line of Windows command-line?

I tried with Saxon with this line:
"java -cp "saxon9he.jar" net.sf.saxon.Query -s:"test.htm" -qs:"//div[id='ps-content']"
but it gave a strange error in line number 62

I tried a similar query in Basex standalone but it gave "(Line 62): Invalid character found: ')'.

Does a solution exist?
0
Comment
Question by:lucavilla
  • 3
  • 2
5 Comments
 
LVL 60

Expert Comment

by:Geert Bormans
ID: 39232672
Hi,

I am suspicious about your HTML (which you did not attach by the way)
The source for XPATH should be wellformed XML, not HTML
if your HTML is not wellformed XML, the XML parser will choque on it,

I can only comment after I have seen the HTML, but if it is not wellformed consider one of the following
- have a two step process. 1st step use TagSoup to turn your html into wellformed xml
- use Saxon PE and the function saxon:parse() which actually does TagSoup in the box
- use a text processing approach with Ruby or Python regexes

I favour the second option (but for that you need to buy Saxon9PE, honestly, it will be the best 50$ you ever spent, notably because you support the only industry qualtity XSLT2 processor around)
If you want to do the first in a single windows command ... you will need to wrap two instructions in one bat file
The third option is a poor one, with a lot of risks, but completeness requires me to list it
0
 

Author Comment

by:lucavilla
ID: 39232886
ooppsss sorry Gertone I forgot to attach the HTML file but after reading your excellent answer I decided to point you directly to the original of that HTML file that is this web page: http://www.amazon.com/dp/1449319432

Before searching for a command-line solution I tested the XPATH in Google Chrome console over that page with this command string:  "$x('//DIV[@id="ps-content"]')" and Google Chrome correctly returns the block of HTML that I expect.

I need for a command-line solution because I need to add XPATH filtering ability in a commercial program (Website Watcher) that supports a simple scripting language where it can (at best, for this purpose) write a variable (eg. the html of a my given webpage) to a file, execute a command-line, read a file (eg. my wanted XPATH reduced page) to a variable... and continue with other processing over that variable...

This commercial program also supports regular expressions and indeed I always did the html filtering by well-refined regular expressions but now a crowd of developers tell me that XPATH is better even because it supports malformed html, where I should build monster-looking regexes to handle the same, so I'm here... a little surprise to read that an XML parser "will choque" on not "wellformed XML".

I can afford even a $100 solution if it works as I like so I'm very willing to try Saxon PE.
I found that XPATH 3.0 (last XPATH ver) is supported in command-line by BaseX and Saxon so can I assume that they're the two best command-line XPATH parsers on the Internet?
Do you recommend Saxon (PE) over BaseX for my purpose? do you think that BaseX doesn't have an integrated equivalent of TagSoup and would require a separate launch of it like in your point 1?

Thanks in advance for your great help, you're my illuminating genius in XML  :)
0
 
LVL 60

Expert Comment

by:Geert Bormans
ID: 39233098
Well, as I thought... you are trying to get some info from an html not XML, amazon webpage
(note: I have done a lot of amazon crawling, please note that they change the internals of the page very often, so you need to check regularly that the info is still where you expect it to be)

I don't know too much about baseX, it has TagSoup embedded for loading teh DB,
but I am not sure if you can set it as the default parser, nor it has a parse() function
though here is something you could try
http://docs.basex.org/wiki/Parsers#Command_Line_2

For Saxon PE, TagSoup is integrated.
I won't recommend it over BaseX (not enough knowledge to make recommendations), but I would use Saxon since I am familiar with that.
I suggest that you try to switch the parser in BaseX as is described in the referenced doc.
And if that doesn't work, get a trial licensen from saxon and see how that behaves
0
 

Author Comment

by:lucavilla
ID: 39239683
Hi Gertone, I finally found the right command-line to launch the XPATH extraction with Saxon-PE, look here: http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/XML/Q_28154550.html

What I only miss is to find how to apply that parse-html() function to that commandline.  Do you know it?
0
 
LVL 60

Accepted Solution

by:
Geert Bormans earned 500 total points
ID: 39246935
answered in follow up I believe
0

Featured Post

Gigs: Get Your Project Delivered by an Expert

Select from freelancers specializing in everything from database administration to programming, who have proven themselves as experts in their field. Hire the best, collaborate easily, pay securely and get projects done right.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Introduction In my previous article (http://www.experts-exchange.com/Microsoft/Development/MS-SQL-Server/SSIS/A_9150-Loading-XML-Using-SSIS.html) I showed you how the XML Source component can be used to load XML files into a SQL Server database, us…
I was working on a PowerPoint add-in the other day and a client asked me "can you implement a feature which processes a chart when it's pasted into a slide from another deck?". It got me wondering how to hook into built-in ribbon events in Office.
Sending a Secure fax is easy with eFax Corporate (http://www.enterprise.efax.com). First, just open a new email message. In the To field, type your recipient's fax number @efaxsend.com. You can even send a secure international fax — just include t…
Established in 1997, Technology Architects has become one of the most reputable technology solutions companies in the country. TA have been providing businesses with cost effective state-of-the-art solutions and unparalleled service that is designed…

813 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

17 Experts available now in Live!

Get 1:1 Help Now