Still celebrating National IT Professionals Day with 3 months of free Premium Membership. Use Code ITDAY17

x
?
Solved

how to extract an XPATH from a malformed HTML page with Saxon-PE commandline

Posted on 2013-06-11
25
Medium Priority
?
757 Views
Last Modified: 2013-07-02
I would like to extract the XPATH //DIV[@id="ps-content"] out from this web page: http://www.amazon.com/dp/1449319432 (saved as a local file)

I would like to do it with a single line of command-line with one of the best parsers, that is Saxon-PE.

So far the shortest solution that I (seemed to have) found is with these two lines:

java -jar tagsoup-1.2.1.jar <page.html >page.xhtml"
java -cp saxon9pe.jar net.sf.saxon.Query -s:"test.xhtm" -qs:"//*:div[@id='ps-content']"

Open in new window


The first line (TagSoup) is necessary for correcting the original malformed HTML to wellformed XML however I read that Saxon-PE has embedded TagSoup capability (see http://saxonica.com/documentation9.4-demo/html/extensions/functions/parse-html.html), how can I integrate my two lines into a single line?
0
Comment
Question by:lucavilla
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 10
  • 9
  • 5
25 Comments
 
LVL 36

Expert Comment

by:mccarl
ID: 39240017
Ok, this is a bit of a stab in the dark (I don't have Saxon PE version to test) but I would think it would go along the lines of...
java -cp saxon9pe.jar net.sf.saxon.Query -qs:"saxon:parse-html(unparsed-text('page.html'))//*:div[@id='ps-content']"

Open in new window

0
 

Author Comment

by:lucavilla
ID: 39240438
Thanks mccarl, it seems that it formally accept it!
but it gives this error:
"Error on line 1 of *module with no systemId*:
  Failed to load org.ccil.cowan.tagsoup.Parser
Query processing failed: Run-time errors were reported"

Note that TagSoup is an external component, I simply downloaded the file "tagsoup-1.2.1.jar" and copied it to the same folder of "saxon9pe.jar". Maybe I have to do something more...
How could I tell to Java and/or Saxon that it should consider that file?
0
 
LVL 36

Expert Comment

by:mccarl
ID: 39240773
Ah yes, sorry that should have been...
java -cp saxon9pe.jar;tagsoup-1.2.1.jar net.sf.saxon.Query -qs:"saxon:parse-html(unparsed-text('page.html'))//*:div[@id='ps-content']"

Open in new window

0
Build and deliver software with DevOps

A digital transformation requires faster time to market, shorter software development lifecycles, and the ability to adapt rapidly to changing customer demands. DevOps provides the solution.

 
LVL 60

Expert Comment

by:Geert Bormans
ID: 39240847
maybe you need to put the jar on the classpath
if it is in the same dir, just put it after the CP together with the saxon pe

java -cp tagsoup-1.2.1.jar saxon9pe.jar net.sf.saxon.Query

have you managed to run a simpler XPath over PE? because I assumed you needed a reference to the license file too
0
 
LVL 60

Expert Comment

by:Geert Bormans
ID: 39240851
it is funny by the way that saxon seems to assume the saxon prefix for the right namespace without requesting a namespace binding
0
 

Author Comment

by:lucavilla
ID: 39240926
Thanks guys, I think that you bringed me one step to the victory!
This command-line works:

java -cp saxon9pe.jar;tagsoup-1.2.1.jar net.sf.saxon.Query -qs:"saxon:parse-html('<div id="ps-content">abc</div>')//*:div[@id='ps-content']"

Now I only miss to replace the html string part with my file name.  I tried random syntaxes without success.  Any ideas?

PS: Gertone I already obtained and put in the same dir the file "saxon-license.lic" as you suggested  :)
0
 
LVL 60

Expert Comment

by:Geert Bormans
ID: 39240956
can't  you get the file using unparsed-text()
you can't use doc since it is not wellformed
saxon:parse-html(unparsed-text('file:///c:/path/file.html'))//xpath
0
 
LVL 36

Expert Comment

by:mccarl
ID: 39240992
The 'unparsed-text()' function that I gave in the previous posts should do what you want! What errors were you getting?
0
 

Author Comment

by:lucavilla
ID: 39241000
With   -qs:"saxon:parse-html(unparsed-text('test.htm'))//*:div[@id='ps-content']"  I get this error:

Error on line 1 column 17
  XPST0017 XQuery static error near #...l(unparsed-text('test.htm'))//#:
    System function unparsed-text#1 is not available with this host language/version
Static error(s) in query
0
 
LVL 36

Expert Comment

by:mccarl
ID: 39241089
Ahh, it appears that unparsed-text is only an XSLT funtion (not in XQuery 1.0). Try enabling XQuery 3.0 features (although I am finding mixed messages in docs about whether Saxon-PE supports 3.0 or not)...
java -cp saxon9pe.jar;tagsoup-1.2.1.jar net.sf.saxon.Query --xqueryVersion:3.0 -qs:"saxon:parse-html(fn:unparsed-text('page.html'))//*:div[@id='ps-content']"

Open in new window

(And the unparsed-text function may need to be qualified as above with the "fn:", try with it and without.)
0
 
LVL 60

Expert Comment

by:Geert Bormans
ID: 39242157
I am not sure which 9PE you are using
but collection() recently got some properties
I believe collection('page.html;unparsed=yes')
works in both XSLT and XQuery

try:

java -cp saxon9pe.jar;tagsoup-1.2.1.jar net.sf.saxon.Query --xqueryVersion:3.0 -qs:"saxon:parse-html(collection('page.html;unparsed=yes'))//*:div[@id='ps-content']"
0
 

Author Comment

by:lucavilla
ID: 39249047
Still no luck:

...\SaxonPE9-4-0-7J>java -cp saxon9pe.jar;tagsoup-1.2.1.jar
 net.sf.saxon.Query --xqueryVersion:3.0 -qs:"saxon:parse-html(fn:unparsed-text('
page.html'))//*:div[@id='ps-content']"
Error on line 1 column 17
  XPST0017 XQuery static error near #...:unparsed-text('page.html'))//#:
    System function unparsed-text#1 is not available with this host language/ver
sion
Static error(s) in query

___________

...\SaxonPE9-4-0-7J>java -cp saxon9pe.jar;tagsoup-1.2.1.jar
 net.sf.saxon.Query --xqueryVersion:3.0 -qs:"fn:parse-html(fn:unparsed-text('pag
e.html'))//*:div[@id='ps-content']"
Error on line 1 column 14
  XPST0017 XQuery static error near #...:unparsed-text('page.html'))//#:
    System function unparsed-text#1 is not available with this host language/ver
sion
Static error(s) in query

___________

...\SaxonPE9-4-0-7J>java -cp saxon9pe.jar;tagsoup-1.2.1.jar
 net.sf.saxon.Query --xqueryVersion:3.0 -qs:"saxon:parse-html(collection('page.h
tml;unparsed=yes'))//*:div[@id='ps-content']"
Error on line 1 of *module with no systemId*:
  FODC0002: The file or directory
  file:/C:/Users/diego/Downloads/SaxonPE9-4-0-7J/page.html;unparsed=yes does not
 exist
Query processing failed: Run-time errors were reported

___________

...\SaxonPE9-4-0-7J>java -cp saxon9pe.jar;tagsoup-1.2.1.jar
 net.sf.saxon.Query --xqueryVersion:3.0 -qs:"saxon:parse-html(collection('page.h
tml';'unparsed=yes'))//*:div[@id='ps-content']"
Error on line 1 column 39
  XPST0003 XQuery syntax error near #...ion('page.html';'unparsed=yes'#:
    expected ")", found ";"
Static error(s) in query
0
 
LVL 60

Expert Comment

by:Geert Bormans
ID: 39249127
I should obviously not dump code snippets that I did not test

Here is how it works in XQuery

declare namespace saxon="http://saxon.sf.net/";

let $doc := saxon:parse-html(collection('file:///C:/Users/User/Desktop?select=test2.html;unparsed=yes'))

return $doc

Open in new window


I copied the amazon file as test2.html on my desktop

put the file uri with path, add a questionmark
put select with the file mask (could be a single file as in this example)

I have not tested this on teh commandline
(sitting in a hotel room on a thin wire)
but it should work
0
 
LVL 60

Expert Comment

by:Geert Bormans
ID: 39266192
Ah, did it work?
Given I did not test, I am a bit curious
0
 

Author Comment

by:lucavilla
ID: 39266225
to say the truth I did not test it because my goal was to find a minimalist single line command line to perform XPATH and XQUERY extractions out from (even malformed) HTML pages...   and I don't know how to reduce your last solution to a single line  :)
0
 
LVL 60

Expert Comment

by:Geert Bormans
ID: 39266290
that is a full XQuery I used to test it with

this is teh XPath that should work in your single line

saxon:parse-html(collection('file:///C:/Users/User/Desktop?select=test2.html;unparsed=yes'))
0
 

Author Comment

by:lucavilla
ID: 39266676
The shortest formally accepted commandline is this:

java -cp saxon9pe.jar;tagsoup-1.2.1.jar net.sf.saxon.Query -qs:"saxon:parse-html(doc('test.xhtm'))//*:div[@id='ps-content']"

Open in new window


However it returns this empty result:

<?xml version="1.0" encoding="UTF-8"?
0
 
LVL 36

Expert Comment

by:mccarl
ID: 39267129
Wow, I would have thought at least a split of the points would be appropriate here!
0
 

Author Comment

by:lucavilla
ID: 39267195
Ok right, in conclusion I found no solution yet...
I reposted the question here: http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/XML/Q_28164398.html

;)
0
 
LVL 60

Expert Comment

by:Geert Bormans
ID: 39269052
I have my saxon PE licenses hooked up in XML IDEs, so I must check if I can launche one command-line. I had a chat with Saxonica people last week however and they claimed it should definitely work this way.

But I was wondering... you want to do all of this command-line? One single command.
Does that mean you can not have an XSLT or XQuery file next to it? You just need to reference the XSLT file then and still call the actual process in one go. I would go for XSLT then, given it supports the unparsed text function
Note that you can still pass parameters to the XSLT, so the XSLT could be a stub file and the actual XPath could be passed as a parameter to the XSLT... it gives you all the dynamics you could possibly need
0
 

Author Comment

by:lucavilla
ID: 39269124
given the problem with XPATH special characters in command line I decided to go with the single command line + XSLT or XQuery file route.
Any idea about the most efficient solution?
0
 
LVL 60

Assisted Solution

by:Geert Bormans
Geert Bormans earned 400 total points
ID: 39269126
Saxon uses the same underlying parser for both.
I don't think it matters

Just getting a single value... then XQuery sounds more natural to me
(note that you better make a full file uri from the file reference

I think this is the XQuery you would need

declare namespace saxon="http://saxon.sf.net/";

let $doc := saxon:parse-html(collection('file:///C:/Users/User/Desktop?select=test2.html;unparsed=yes'))//*:div[@id='ps-content']

return $doc 

Open in new window

0
 

Accepted Solution

by:
lucavilla earned 0 total points
ID: 39282618
Solution:

____file query.xq____
declare default element namespace 'http://www.w3.org/1999/xhtml';
doc('page.html')//div[@id='ps-content']
______________________

Command-line using Nailgun:
ng.exe --nailgun-port 2114 net.sf.saxon.Query -x:"org.ccil.cowan.tagsoup.Parser" query.xq
0
 

Author Closing Comment

by:lucavilla
ID: 39292568
solved
0

Featured Post

Build and deliver software with DevOps

A digital transformation requires faster time to market, shorter software development lifecycles, and the ability to adapt rapidly to changing customer demands. DevOps provides the solution.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Java functions are among the best things for programmers to work with as Java sites can be very easy to read and prepare. Java especially simplifies many processes in the coding industry as it helps integrate many forms of technology and different d…
In this post we will learn different types of Android Layout and some basics of an Android App.
Viewers will learn one way to get user input in Java. Introduce the Scanner object: Declare the variable that stores the user input: An example prompting the user for input: Methods you need to invoke in order to properly get  user input:
This tutorial covers a step-by-step guide to install VisualVM launcher in eclipse.
Suggested Courses

705 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question