Solved

how to extract an XPATH from a malformed HTML page with Saxon-PE commandline

Posted on 2013-06-11
25
705 Views
Last Modified: 2013-07-02
I would like to extract the XPATH //DIV[@id="ps-content"] out from this web page: http://www.amazon.com/dp/1449319432 (saved as a local file)

I would like to do it with a single line of command-line with one of the best parsers, that is Saxon-PE.

So far the shortest solution that I (seemed to have) found is with these two lines:

java -jar tagsoup-1.2.1.jar <page.html >page.xhtml"
java -cp saxon9pe.jar net.sf.saxon.Query -s:"test.xhtm" -qs:"//*:div[@id='ps-content']"

Open in new window


The first line (TagSoup) is necessary for correcting the original malformed HTML to wellformed XML however I read that Saxon-PE has embedded TagSoup capability (see http://saxonica.com/documentation9.4-demo/html/extensions/functions/parse-html.html), how can I integrate my two lines into a single line?
0
Comment
Question by:lucavilla
  • 10
  • 9
  • 5
25 Comments
 
LVL 35

Expert Comment

by:mccarl
ID: 39240017
Ok, this is a bit of a stab in the dark (I don't have Saxon PE version to test) but I would think it would go along the lines of...
java -cp saxon9pe.jar net.sf.saxon.Query -qs:"saxon:parse-html(unparsed-text('page.html'))//*:div[@id='ps-content']"

Open in new window

0
 

Author Comment

by:lucavilla
ID: 39240438
Thanks mccarl, it seems that it formally accept it!
but it gives this error:
"Error on line 1 of *module with no systemId*:
  Failed to load org.ccil.cowan.tagsoup.Parser
Query processing failed: Run-time errors were reported"

Note that TagSoup is an external component, I simply downloaded the file "tagsoup-1.2.1.jar" and copied it to the same folder of "saxon9pe.jar". Maybe I have to do something more...
How could I tell to Java and/or Saxon that it should consider that file?
0
 
LVL 35

Expert Comment

by:mccarl
ID: 39240773
Ah yes, sorry that should have been...
java -cp saxon9pe.jar;tagsoup-1.2.1.jar net.sf.saxon.Query -qs:"saxon:parse-html(unparsed-text('page.html'))//*:div[@id='ps-content']"

Open in new window

0
Resolve Critical IT Incidents Fast

If your data, services or processes become compromised, your organization can suffer damage in just minutes and how fast you communicate during a major IT incident is everything. Learn how to immediately identify incidents & best practices to resolve them quickly and effectively.

 
LVL 60

Expert Comment

by:Geert Bormans
ID: 39240847
maybe you need to put the jar on the classpath
if it is in the same dir, just put it after the CP together with the saxon pe

java -cp tagsoup-1.2.1.jar saxon9pe.jar net.sf.saxon.Query

have you managed to run a simpler XPath over PE? because I assumed you needed a reference to the license file too
0
 
LVL 60

Expert Comment

by:Geert Bormans
ID: 39240851
it is funny by the way that saxon seems to assume the saxon prefix for the right namespace without requesting a namespace binding
0
 

Author Comment

by:lucavilla
ID: 39240926
Thanks guys, I think that you bringed me one step to the victory!
This command-line works:

java -cp saxon9pe.jar;tagsoup-1.2.1.jar net.sf.saxon.Query -qs:"saxon:parse-html('<div id="ps-content">abc</div>')//*:div[@id='ps-content']"

Now I only miss to replace the html string part with my file name.  I tried random syntaxes without success.  Any ideas?

PS: Gertone I already obtained and put in the same dir the file "saxon-license.lic" as you suggested  :)
0
 
LVL 60

Expert Comment

by:Geert Bormans
ID: 39240956
can't  you get the file using unparsed-text()
you can't use doc since it is not wellformed
saxon:parse-html(unparsed-text('file:///c:/path/file.html'))//xpath
0
 
LVL 35

Expert Comment

by:mccarl
ID: 39240992
The 'unparsed-text()' function that I gave in the previous posts should do what you want! What errors were you getting?
0
 

Author Comment

by:lucavilla
ID: 39241000
With   -qs:"saxon:parse-html(unparsed-text('test.htm'))//*:div[@id='ps-content']"  I get this error:

Error on line 1 column 17
  XPST0017 XQuery static error near #...l(unparsed-text('test.htm'))//#:
    System function unparsed-text#1 is not available with this host language/version
Static error(s) in query
0
 
LVL 35

Expert Comment

by:mccarl
ID: 39241089
Ahh, it appears that unparsed-text is only an XSLT funtion (not in XQuery 1.0). Try enabling XQuery 3.0 features (although I am finding mixed messages in docs about whether Saxon-PE supports 3.0 or not)...
java -cp saxon9pe.jar;tagsoup-1.2.1.jar net.sf.saxon.Query --xqueryVersion:3.0 -qs:"saxon:parse-html(fn:unparsed-text('page.html'))//*:div[@id='ps-content']"

Open in new window

(And the unparsed-text function may need to be qualified as above with the "fn:", try with it and without.)
0
 
LVL 60

Expert Comment

by:Geert Bormans
ID: 39242157
I am not sure which 9PE you are using
but collection() recently got some properties
I believe collection('page.html;unparsed=yes')
works in both XSLT and XQuery

try:

java -cp saxon9pe.jar;tagsoup-1.2.1.jar net.sf.saxon.Query --xqueryVersion:3.0 -qs:"saxon:parse-html(collection('page.html;unparsed=yes'))//*:div[@id='ps-content']"
0
 

Author Comment

by:lucavilla
ID: 39249047
Still no luck:

...\SaxonPE9-4-0-7J>java -cp saxon9pe.jar;tagsoup-1.2.1.jar
 net.sf.saxon.Query --xqueryVersion:3.0 -qs:"saxon:parse-html(fn:unparsed-text('
page.html'))//*:div[@id='ps-content']"
Error on line 1 column 17
  XPST0017 XQuery static error near #...:unparsed-text('page.html'))//#:
    System function unparsed-text#1 is not available with this host language/ver
sion
Static error(s) in query

___________

...\SaxonPE9-4-0-7J>java -cp saxon9pe.jar;tagsoup-1.2.1.jar
 net.sf.saxon.Query --xqueryVersion:3.0 -qs:"fn:parse-html(fn:unparsed-text('pag
e.html'))//*:div[@id='ps-content']"
Error on line 1 column 14
  XPST0017 XQuery static error near #...:unparsed-text('page.html'))//#:
    System function unparsed-text#1 is not available with this host language/ver
sion
Static error(s) in query

___________

...\SaxonPE9-4-0-7J>java -cp saxon9pe.jar;tagsoup-1.2.1.jar
 net.sf.saxon.Query --xqueryVersion:3.0 -qs:"saxon:parse-html(collection('page.h
tml;unparsed=yes'))//*:div[@id='ps-content']"
Error on line 1 of *module with no systemId*:
  FODC0002: The file or directory
  file:/C:/Users/diego/Downloads/SaxonPE9-4-0-7J/page.html;unparsed=yes does not
 exist
Query processing failed: Run-time errors were reported

___________

...\SaxonPE9-4-0-7J>java -cp saxon9pe.jar;tagsoup-1.2.1.jar
 net.sf.saxon.Query --xqueryVersion:3.0 -qs:"saxon:parse-html(collection('page.h
tml';'unparsed=yes'))//*:div[@id='ps-content']"
Error on line 1 column 39
  XPST0003 XQuery syntax error near #...ion('page.html';'unparsed=yes'#:
    expected ")", found ";"
Static error(s) in query
0
 
LVL 60

Expert Comment

by:Geert Bormans
ID: 39249127
I should obviously not dump code snippets that I did not test

Here is how it works in XQuery

declare namespace saxon="http://saxon.sf.net/";

let $doc := saxon:parse-html(collection('file:///C:/Users/User/Desktop?select=test2.html;unparsed=yes'))

return $doc

Open in new window


I copied the amazon file as test2.html on my desktop

put the file uri with path, add a questionmark
put select with the file mask (could be a single file as in this example)

I have not tested this on teh commandline
(sitting in a hotel room on a thin wire)
but it should work
0
 
LVL 60

Expert Comment

by:Geert Bormans
ID: 39266192
Ah, did it work?
Given I did not test, I am a bit curious
0
 

Author Comment

by:lucavilla
ID: 39266225
to say the truth I did not test it because my goal was to find a minimalist single line command line to perform XPATH and XQUERY extractions out from (even malformed) HTML pages...   and I don't know how to reduce your last solution to a single line  :)
0
 
LVL 60

Expert Comment

by:Geert Bormans
ID: 39266290
that is a full XQuery I used to test it with

this is teh XPath that should work in your single line

saxon:parse-html(collection('file:///C:/Users/User/Desktop?select=test2.html;unparsed=yes'))
0
 

Author Comment

by:lucavilla
ID: 39266676
The shortest formally accepted commandline is this:

java -cp saxon9pe.jar;tagsoup-1.2.1.jar net.sf.saxon.Query -qs:"saxon:parse-html(doc('test.xhtm'))//*:div[@id='ps-content']"

Open in new window


However it returns this empty result:

<?xml version="1.0" encoding="UTF-8"?
0
 
LVL 35

Expert Comment

by:mccarl
ID: 39267129
Wow, I would have thought at least a split of the points would be appropriate here!
0
 

Author Comment

by:lucavilla
ID: 39267195
Ok right, in conclusion I found no solution yet...
I reposted the question here: http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/XML/Q_28164398.html

;)
0
 
LVL 60

Expert Comment

by:Geert Bormans
ID: 39269052
I have my saxon PE licenses hooked up in XML IDEs, so I must check if I can launche one command-line. I had a chat with Saxonica people last week however and they claimed it should definitely work this way.

But I was wondering... you want to do all of this command-line? One single command.
Does that mean you can not have an XSLT or XQuery file next to it? You just need to reference the XSLT file then and still call the actual process in one go. I would go for XSLT then, given it supports the unparsed text function
Note that you can still pass parameters to the XSLT, so the XSLT could be a stub file and the actual XPath could be passed as a parameter to the XSLT... it gives you all the dynamics you could possibly need
0
 

Author Comment

by:lucavilla
ID: 39269124
given the problem with XPATH special characters in command line I decided to go with the single command line + XSLT or XQuery file route.
Any idea about the most efficient solution?
0
 
LVL 60

Assisted Solution

by:Geert Bormans
Geert Bormans earned 100 total points
ID: 39269126
Saxon uses the same underlying parser for both.
I don't think it matters

Just getting a single value... then XQuery sounds more natural to me
(note that you better make a full file uri from the file reference

I think this is the XQuery you would need

declare namespace saxon="http://saxon.sf.net/";

let $doc := saxon:parse-html(collection('file:///C:/Users/User/Desktop?select=test2.html;unparsed=yes'))//*:div[@id='ps-content']

return $doc 

Open in new window

0
 

Accepted Solution

by:
lucavilla earned 0 total points
ID: 39282618
Solution:

____file query.xq____
declare default element namespace 'http://www.w3.org/1999/xhtml';
doc('page.html')//div[@id='ps-content']
______________________

Command-line using Nailgun:
ng.exe --nailgun-port 2114 net.sf.saxon.Query -x:"org.ccil.cowan.tagsoup.Parser" query.xq
0
 

Author Closing Comment

by:lucavilla
ID: 39292568
solved
0

Featured Post

Master Your Team's Linux and Cloud Stack!

The average business loses $13.5M per year to ineffective training (per 1,000 employees). Keep ahead of the competition and combine in-person quality with online cost and flexibility by training with Linux Academy.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
Help needed with Powershell  XML to MySQL 5 59
Macro to import XML in Access 2013 2 50
xslt 1.0 - How to split value 8 28
Running JavaFX on the Raspberry Pi 27 46
The Problem How to write an Xquery that works like a SQL outer join, providing placeholders for absent data on the outer side?  I give a bit more background at the end. The situation expressed as relational data Let’s work through this.  I’ve …
Introduction This article is the first of three articles that explain why and how the Experts Exchange QA Team does test automation for our web site. This article explains our test automation goals. Then rationale is given for the tools we use to a…
This tutorial explains how to use the VisualVM tool for the Java platform application. This video goes into detail on the Threads, Sampler, and Profiler tabs.
Viewers will learn how to properly install Eclipse with the necessary JDK, and will take a look at an introductory Java program. Download Eclipse installation zip file: Extract files from zip file: Download and install JDK 8: Open Eclipse and …

830 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question