asked on

how to extract an XPATH from a malformed HTML page with Saxon-PE commandline

I would like to extract the XPATH //DIV[@id="ps-content"] out from this web page: http://www.amazon.com/dp/1449319432 (saved as a local file)

I would like to do it with a single line of command-line with one of the best parsers, that is Saxon-PE.

So far the shortest solution that I (seemed to have) found is with these two lines:

java -jar tagsoup-1.2.1.jar <page.html >page.xhtml"
java -cp saxon9pe.jar net.sf.saxon.Query -s:"test.xhtm" -qs:"//*:div[@id='ps-content']"

Open in new window

The first line (TagSoup) is necessary for correcting the original malformed HTML to wellformed XML however I read that Saxon-PE has embedded TagSoup capability (see http://saxonica.com/documentation9.4-demo/html/extensions/functions/parse-html.html), how can I integrate my two lines into a single line?

mccarl

Ok, this is a bit of a stab in the dark (I don't have Saxon PE version to test) but I would think it would go along the lines of...

java -cp saxon9pe.jar net.sf.saxon.Query -qs:"saxon:parse-html(unparsed-text('page.html'))//*:div[@id='ps-content']"

Open in new window

lucavilla

ASKER

Thanks mccarl, it seems that it formally accept it!
but it gives this error:
"Error on line 1 of *module with no systemId*:
Failed to load org.ccil.cowan.tagsoup.Parser
Query processing failed: Run-time errors were reported"

Note that TagSoup is an external component, I simply downloaded the file "tagsoup-1.2.1.jar" and copied it to the same folder of "saxon9pe.jar". Maybe I have to do something more...
How could I tell to Java and/or Saxon that it should consider that file?

mccarl

Ah yes, sorry that should have been...

java -cp saxon9pe.jar;tagsoup-1.2.1.jar net.sf.saxon.Query -qs:"saxon:parse-html(unparsed-text('page.html'))//*:div[@id='ps-content']"

Open in new window

Gertone (Geert Bormans)

maybe you need to put the jar on the classpath
if it is in the same dir, just put it after the CP together with the saxon pe

java -cp tagsoup-1.2.1.jar saxon9pe.jar net.sf.saxon.Query

have you managed to run a simpler XPath over PE? because I assumed you needed a reference to the license file too

Gertone (Geert Bormans)

it is funny by the way that saxon seems to assume the saxon prefix for the right namespace without requesting a namespace binding

lucavilla

ASKER

Thanks guys, I think that you bringed me one step to the victory!
This command-line works:

java -cp saxon9pe.jar;tagsoup-1.2.1.jar net.sf.saxon.Query -qs:"saxon:parse-html('<div id="ps-content">abc</div>')//*:div[@id='ps-content']"

Now I only miss to replace the html string part with my file name. I tried random syntaxes without success. Any ideas?

PS: Gertone I already obtained and put in the same dir the file "saxon-license.lic" as you suggested :)

Gertone (Geert Bormans)

can't you get the file using unparsed-text()
you can't use doc since it is not wellformed
saxon:parse-html(unparsed-text('file:///c:/path/file.html'))//xpath

mccarl

The 'unparsed-text()' function that I gave in the previous posts should do what you want! What errors were you getting?

lucavilla

ASKER

With -qs:"saxon:parse-html(unparsed-text('test.htm'))//*:div[@id='ps-content']" I get this error:

Error on line 1 column 17
XPST0017 XQuery static error near #...l(unparsed-text('test.htm'))//#:
System function unparsed-text#1 is not available with this host language/version
Static error(s) in query

mccarl

Ahh, it appears that unparsed-text is only an XSLT funtion (not in XQuery 1.0). Try enabling XQuery 3.0 features (although I am finding mixed messages in docs about whether Saxon-PE supports 3.0 or not)...

java -cp saxon9pe.jar;tagsoup-1.2.1.jar net.sf.saxon.Query --xqueryVersion:3.0 -qs:"saxon:parse-html(fn:unparsed-text('page.html'))//*:div[@id='ps-content']"

Open in new window

(And the unparsed-text function may need to be qualified as above with the "fn:", try with it and without.)

Gertone (Geert Bormans)

I am not sure which 9PE you are using
but collection() recently got some properties
I believe collection('page.html;unparsed=yes')
works in both XSLT and XQuery

try:

java -cp saxon9pe.jar;tagsoup-1.2.1.jar net.sf.saxon.Query --xqueryVersion:3.0 -qs:"saxon:parse-html(collection('page.html;unparsed=yes'))//*:div[@id='ps-content']"

lucavilla

ASKER

Still no luck:

...\SaxonPE9-4-0-7J>java -cp saxon9pe.jar;tagsoup-1.2.1.jar
net.sf.saxon.Query --xqueryVersion:3.0 -qs:"saxon:parse-html(fn:unparsed-text('
page.html'))//*:div[@id='ps-content']"
Error on line 1 column 17
XPST0017 XQuery static error near #...:unparsed-text('page.html'))//#:
System function unparsed-text#1 is not available with this host language/ver
sion
Static error(s) in query

___________

...\SaxonPE9-4-0-7J>java -cp saxon9pe.jar;tagsoup-1.2.1.jar
net.sf.saxon.Query --xqueryVersion:3.0 -qs:"fn:parse-html(fn:unparsed-text('pag
e.html'))//*:div[@id='ps-content']"
Error on line 1 column 14
XPST0017 XQuery static error near #...:unparsed-text('page.html'))//#:
System function unparsed-text#1 is not available with this host language/ver
sion
Static error(s) in query

___________

...\SaxonPE9-4-0-7J>java -cp saxon9pe.jar;tagsoup-1.2.1.jar
net.sf.saxon.Query --xqueryVersion:3.0 -qs:"saxon:parse-html(collection('page.h
tml;unparsed=yes'))//*:div[@id='ps-content']"
Error on line 1 of *module with no systemId*:
FODC0002: The file or directory
file:/C:/Users/diego/Downloads/SaxonPE9-4-0-7J/page.html;unparsed=yes does not
exist
Query processing failed: Run-time errors were reported

___________

...\SaxonPE9-4-0-7J>java -cp saxon9pe.jar;tagsoup-1.2.1.jar
net.sf.saxon.Query --xqueryVersion:3.0 -qs:"saxon:parse-html(collection('page.h
tml';'unparsed=yes'))//*:div[@id='ps-content']"
Error on line 1 column 39
XPST0003 XQuery syntax error near #...ion('page.html';'unparsed=yes'#:
expected ")", found ";"
Static error(s) in query

Gertone (Geert Bormans)

I should obviously not dump code snippets that I did not test

Here is how it works in XQuery

declare namespace saxon="http://saxon.sf.net/";

let $doc := saxon:parse-html(collection('file:///C:/Users/User/Desktop?select=test2.html;unparsed=yes'))

return $doc

Open in new window

I copied the amazon file as test2.html on my desktop

put the file uri with path, add a questionmark
put select with the file mask (could be a single file as in this example)

I have not tested this on teh commandline
(sitting in a hotel room on a thin wire)
but it should work

Gertone (Geert Bormans)

Ah, did it work?
Given I did not test, I am a bit curious

lucavilla

ASKER

to say the truth I did not test it because my goal was to find a minimalist single line command line to perform XPATH and XQUERY extractions out from (even malformed) HTML pages... and I don't know how to reduce your last solution to a single line :)

Gertone (Geert Bormans)

that is a full XQuery I used to test it with

this is teh XPath that should work in your single line

saxon:parse-html(collection('file:///C:/Users/User/Desktop?select=test2.html;unparsed=yes'))

lucavilla

ASKER

The shortest formally accepted commandline is this:

java -cp saxon9pe.jar;tagsoup-1.2.1.jar net.sf.saxon.Query -qs:"saxon:parse-html(doc('test.xhtm'))//*:div[@id='ps-content']"

Open in new window

However it returns this empty result:

<?xml version="1.0" encoding="UTF-8"?

mccarl

Wow, I would have thought at least a split of the points would be appropriate here!

lucavilla

ASKER

Ok right, in conclusion I found no solution yet...
I reposted the question here: https://www.experts-exchange.com/questions/28164398/how-to-extract-an-XPATH-from-a-malformed-HTML-page-with-Saxon-PE-commandline-second-try.html

;)

Gertone (Geert Bormans)

I have my saxon PE licenses hooked up in XML IDEs, so I must check if I can launche one command-line. I had a chat with Saxonica people last week however and they claimed it should definitely work this way.

But I was wondering... you want to do all of this command-line? One single command.
Does that mean you can not have an XSLT or XQuery file next to it? You just need to reference the XSLT file then and still call the actual process in one go. I would go for XSLT then, given it supports the unparsed text function
Note that you can still pass parameters to the XSLT, so the XSLT could be a stub file and the actual XPath could be passed as a parameter to the XSLT... it gives you all the dynamics you could possibly need

lucavilla

ASKER

given the problem with XPATH special characters in command line I decided to go with the single command line + XSLT or XQuery file route.
Any idea about the most efficient solution?

SOLUTION

Gertone (Geert Bormans)

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

ASKER CERTIFIED SOLUTION

lucavilla

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

lucavilla

ASKER

solved