Go Premium for a chance to win a PS4. Enter to Win

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 784
  • Last Modified:

how to extract an XPATH from a malformed HTML page with Saxon-PE commandline

I would like to extract the XPATH //DIV[@id="ps-content"] out from this web page: http://www.amazon.com/dp/1449319432 (saved as a local file)

I would like to do it with a single line of command-line with one of the best parsers, that is Saxon-PE.

So far the shortest solution that I (seemed to have) found is with these two lines:

java -jar tagsoup-1.2.1.jar <page.html >page.xhtml"
java -cp saxon9pe.jar net.sf.saxon.Query -s:"test.xhtm" -qs:"//*:div[@id='ps-content']"

Open in new window


The first line (TagSoup) is necessary for correcting the original malformed HTML to wellformed XML however I read that Saxon-PE has embedded TagSoup capability (see http://saxonica.com/documentation9.4-demo/html/extensions/functions/parse-html.html), how can I integrate my two lines into a single line?
0
lucavilla
Asked:
lucavilla
  • 10
  • 9
  • 5
2 Solutions
 
mccarlIT Business Systems Analyst / Software DeveloperCommented:
Ok, this is a bit of a stab in the dark (I don't have Saxon PE version to test) but I would think it would go along the lines of...
java -cp saxon9pe.jar net.sf.saxon.Query -qs:"saxon:parse-html(unparsed-text('page.html'))//*:div[@id='ps-content']"

Open in new window

0
 
lucavillaAuthor Commented:
Thanks mccarl, it seems that it formally accept it!
but it gives this error:
"Error on line 1 of *module with no systemId*:
  Failed to load org.ccil.cowan.tagsoup.Parser
Query processing failed: Run-time errors were reported"

Note that TagSoup is an external component, I simply downloaded the file "tagsoup-1.2.1.jar" and copied it to the same folder of "saxon9pe.jar". Maybe I have to do something more...
How could I tell to Java and/or Saxon that it should consider that file?
0
 
mccarlIT Business Systems Analyst / Software DeveloperCommented:
Ah yes, sorry that should have been...
java -cp saxon9pe.jar;tagsoup-1.2.1.jar net.sf.saxon.Query -qs:"saxon:parse-html(unparsed-text('page.html'))//*:div[@id='ps-content']"

Open in new window

0
What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

 
Geert BormansCommented:
maybe you need to put the jar on the classpath
if it is in the same dir, just put it after the CP together with the saxon pe

java -cp tagsoup-1.2.1.jar saxon9pe.jar net.sf.saxon.Query

have you managed to run a simpler XPath over PE? because I assumed you needed a reference to the license file too
0
 
Geert BormansCommented:
it is funny by the way that saxon seems to assume the saxon prefix for the right namespace without requesting a namespace binding
0
 
lucavillaAuthor Commented:
Thanks guys, I think that you bringed me one step to the victory!
This command-line works:

java -cp saxon9pe.jar;tagsoup-1.2.1.jar net.sf.saxon.Query -qs:"saxon:parse-html('<div id="ps-content">abc</div>')//*:div[@id='ps-content']"

Now I only miss to replace the html string part with my file name.  I tried random syntaxes without success.  Any ideas?

PS: Gertone I already obtained and put in the same dir the file "saxon-license.lic" as you suggested  :)
0
 
Geert BormansCommented:
can't  you get the file using unparsed-text()
you can't use doc since it is not wellformed
saxon:parse-html(unparsed-text('file:///c:/path/file.html'))//xpath
0
 
mccarlIT Business Systems Analyst / Software DeveloperCommented:
The 'unparsed-text()' function that I gave in the previous posts should do what you want! What errors were you getting?
0
 
lucavillaAuthor Commented:
With   -qs:"saxon:parse-html(unparsed-text('test.htm'))//*:div[@id='ps-content']"  I get this error:

Error on line 1 column 17
  XPST0017 XQuery static error near #...l(unparsed-text('test.htm'))//#:
    System function unparsed-text#1 is not available with this host language/version
Static error(s) in query
0
 
mccarlIT Business Systems Analyst / Software DeveloperCommented:
Ahh, it appears that unparsed-text is only an XSLT funtion (not in XQuery 1.0). Try enabling XQuery 3.0 features (although I am finding mixed messages in docs about whether Saxon-PE supports 3.0 or not)...
java -cp saxon9pe.jar;tagsoup-1.2.1.jar net.sf.saxon.Query --xqueryVersion:3.0 -qs:"saxon:parse-html(fn:unparsed-text('page.html'))//*:div[@id='ps-content']"

Open in new window

(And the unparsed-text function may need to be qualified as above with the "fn:", try with it and without.)
0
 
Geert BormansCommented:
I am not sure which 9PE you are using
but collection() recently got some properties
I believe collection('page.html;unparsed=yes')
works in both XSLT and XQuery

try:

java -cp saxon9pe.jar;tagsoup-1.2.1.jar net.sf.saxon.Query --xqueryVersion:3.0 -qs:"saxon:parse-html(collection('page.html;unparsed=yes'))//*:div[@id='ps-content']"
0
 
lucavillaAuthor Commented:
Still no luck:

...\SaxonPE9-4-0-7J>java -cp saxon9pe.jar;tagsoup-1.2.1.jar
 net.sf.saxon.Query --xqueryVersion:3.0 -qs:"saxon:parse-html(fn:unparsed-text('
page.html'))//*:div[@id='ps-content']"
Error on line 1 column 17
  XPST0017 XQuery static error near #...:unparsed-text('page.html'))//#:
    System function unparsed-text#1 is not available with this host language/ver
sion
Static error(s) in query

___________

...\SaxonPE9-4-0-7J>java -cp saxon9pe.jar;tagsoup-1.2.1.jar
 net.sf.saxon.Query --xqueryVersion:3.0 -qs:"fn:parse-html(fn:unparsed-text('pag
e.html'))//*:div[@id='ps-content']"
Error on line 1 column 14
  XPST0017 XQuery static error near #...:unparsed-text('page.html'))//#:
    System function unparsed-text#1 is not available with this host language/ver
sion
Static error(s) in query

___________

...\SaxonPE9-4-0-7J>java -cp saxon9pe.jar;tagsoup-1.2.1.jar
 net.sf.saxon.Query --xqueryVersion:3.0 -qs:"saxon:parse-html(collection('page.h
tml;unparsed=yes'))//*:div[@id='ps-content']"
Error on line 1 of *module with no systemId*:
  FODC0002: The file or directory
  file:/C:/Users/diego/Downloads/SaxonPE9-4-0-7J/page.html;unparsed=yes does not
 exist
Query processing failed: Run-time errors were reported

___________

...\SaxonPE9-4-0-7J>java -cp saxon9pe.jar;tagsoup-1.2.1.jar
 net.sf.saxon.Query --xqueryVersion:3.0 -qs:"saxon:parse-html(collection('page.h
tml';'unparsed=yes'))//*:div[@id='ps-content']"
Error on line 1 column 39
  XPST0003 XQuery syntax error near #...ion('page.html';'unparsed=yes'#:
    expected ")", found ";"
Static error(s) in query
0
 
Geert BormansCommented:
I should obviously not dump code snippets that I did not test

Here is how it works in XQuery

declare namespace saxon="http://saxon.sf.net/";

let $doc := saxon:parse-html(collection('file:///C:/Users/User/Desktop?select=test2.html;unparsed=yes'))

return $doc

Open in new window


I copied the amazon file as test2.html on my desktop

put the file uri with path, add a questionmark
put select with the file mask (could be a single file as in this example)

I have not tested this on teh commandline
(sitting in a hotel room on a thin wire)
but it should work
0
 
Geert BormansCommented:
Ah, did it work?
Given I did not test, I am a bit curious
0
 
lucavillaAuthor Commented:
to say the truth I did not test it because my goal was to find a minimalist single line command line to perform XPATH and XQUERY extractions out from (even malformed) HTML pages...   and I don't know how to reduce your last solution to a single line  :)
0
 
Geert BormansCommented:
that is a full XQuery I used to test it with

this is teh XPath that should work in your single line

saxon:parse-html(collection('file:///C:/Users/User/Desktop?select=test2.html;unparsed=yes'))
0
 
lucavillaAuthor Commented:
The shortest formally accepted commandline is this:

java -cp saxon9pe.jar;tagsoup-1.2.1.jar net.sf.saxon.Query -qs:"saxon:parse-html(doc('test.xhtm'))//*:div[@id='ps-content']"

Open in new window


However it returns this empty result:

<?xml version="1.0" encoding="UTF-8"?
0
 
mccarlIT Business Systems Analyst / Software DeveloperCommented:
Wow, I would have thought at least a split of the points would be appropriate here!
0
 
lucavillaAuthor Commented:
Ok right, in conclusion I found no solution yet...
I reposted the question here: http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/XML/Q_28164398.html

;)
0
 
Geert BormansCommented:
I have my saxon PE licenses hooked up in XML IDEs, so I must check if I can launche one command-line. I had a chat with Saxonica people last week however and they claimed it should definitely work this way.

But I was wondering... you want to do all of this command-line? One single command.
Does that mean you can not have an XSLT or XQuery file next to it? You just need to reference the XSLT file then and still call the actual process in one go. I would go for XSLT then, given it supports the unparsed text function
Note that you can still pass parameters to the XSLT, so the XSLT could be a stub file and the actual XPath could be passed as a parameter to the XSLT... it gives you all the dynamics you could possibly need
0
 
lucavillaAuthor Commented:
given the problem with XPATH special characters in command line I decided to go with the single command line + XSLT or XQuery file route.
Any idea about the most efficient solution?
0
 
Geert BormansCommented:
Saxon uses the same underlying parser for both.
I don't think it matters

Just getting a single value... then XQuery sounds more natural to me
(note that you better make a full file uri from the file reference

I think this is the XQuery you would need

declare namespace saxon="http://saxon.sf.net/";

let $doc := saxon:parse-html(collection('file:///C:/Users/User/Desktop?select=test2.html;unparsed=yes'))//*:div[@id='ps-content']

return $doc 

Open in new window

0
 
lucavillaAuthor Commented:
Solution:

____file query.xq____
declare default element namespace 'http://www.w3.org/1999/xhtml';
doc('page.html')//div[@id='ps-content']
______________________

Command-line using Nailgun:
ng.exe --nailgun-port 2114 net.sf.saxon.Query -x:"org.ccil.cowan.tagsoup.Parser" query.xq
0
 
lucavillaAuthor Commented:
solved
0

Featured Post

Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

  • 10
  • 9
  • 5
Tackle projects and never again get stuck behind a technical roadblock.
Join Now