Solved

how to extract an XPATH from a malformed HTML page with Saxon-PE commandline

Posted on 2013-06-11
25
692 Views
Last Modified: 2013-07-02
I would like to extract the XPATH //DIV[@id="ps-content"] out from this web page: http://www.amazon.com/dp/1449319432 (saved as a local file)

I would like to do it with a single line of command-line with one of the best parsers, that is Saxon-PE.

So far the shortest solution that I (seemed to have) found is with these two lines:

java -jar tagsoup-1.2.1.jar <page.html >page.xhtml"
java -cp saxon9pe.jar net.sf.saxon.Query -s:"test.xhtm" -qs:"//*:div[@id='ps-content']"

Open in new window


The first line (TagSoup) is necessary for correcting the original malformed HTML to wellformed XML however I read that Saxon-PE has embedded TagSoup capability (see http://saxonica.com/documentation9.4-demo/html/extensions/functions/parse-html.html), how can I integrate my two lines into a single line?
0
Comment
Question by:lucavilla
  • 10
  • 9
  • 5
25 Comments
 
LVL 35

Expert Comment

by:mccarl
ID: 39240017
Ok, this is a bit of a stab in the dark (I don't have Saxon PE version to test) but I would think it would go along the lines of...
java -cp saxon9pe.jar net.sf.saxon.Query -qs:"saxon:parse-html(unparsed-text('page.html'))//*:div[@id='ps-content']"

Open in new window

0
 

Author Comment

by:lucavilla
ID: 39240438
Thanks mccarl, it seems that it formally accept it!
but it gives this error:
"Error on line 1 of *module with no systemId*:
  Failed to load org.ccil.cowan.tagsoup.Parser
Query processing failed: Run-time errors were reported"

Note that TagSoup is an external component, I simply downloaded the file "tagsoup-1.2.1.jar" and copied it to the same folder of "saxon9pe.jar". Maybe I have to do something more...
How could I tell to Java and/or Saxon that it should consider that file?
0
 
LVL 35

Expert Comment

by:mccarl
ID: 39240773
Ah yes, sorry that should have been...
java -cp saxon9pe.jar;tagsoup-1.2.1.jar net.sf.saxon.Query -qs:"saxon:parse-html(unparsed-text('page.html'))//*:div[@id='ps-content']"

Open in new window

0
 
LVL 60

Expert Comment

by:Geert Bormans
ID: 39240847
maybe you need to put the jar on the classpath
if it is in the same dir, just put it after the CP together with the saxon pe

java -cp tagsoup-1.2.1.jar saxon9pe.jar net.sf.saxon.Query

have you managed to run a simpler XPath over PE? because I assumed you needed a reference to the license file too
0
 
LVL 60

Expert Comment

by:Geert Bormans
ID: 39240851
it is funny by the way that saxon seems to assume the saxon prefix for the right namespace without requesting a namespace binding
0
 

Author Comment

by:lucavilla
ID: 39240926
Thanks guys, I think that you bringed me one step to the victory!
This command-line works:

java -cp saxon9pe.jar;tagsoup-1.2.1.jar net.sf.saxon.Query -qs:"saxon:parse-html('<div id="ps-content">abc</div>')//*:div[@id='ps-content']"

Now I only miss to replace the html string part with my file name.  I tried random syntaxes without success.  Any ideas?

PS: Gertone I already obtained and put in the same dir the file "saxon-license.lic" as you suggested  :)
0
 
LVL 60

Expert Comment

by:Geert Bormans
ID: 39240956
can't  you get the file using unparsed-text()
you can't use doc since it is not wellformed
saxon:parse-html(unparsed-text('file:///c:/path/file.html'))//xpath
0
 
LVL 35

Expert Comment

by:mccarl
ID: 39240992
The 'unparsed-text()' function that I gave in the previous posts should do what you want! What errors were you getting?
0
 

Author Comment

by:lucavilla
ID: 39241000
With   -qs:"saxon:parse-html(unparsed-text('test.htm'))//*:div[@id='ps-content']"  I get this error:

Error on line 1 column 17
  XPST0017 XQuery static error near #...l(unparsed-text('test.htm'))//#:
    System function unparsed-text#1 is not available with this host language/version
Static error(s) in query
0
 
LVL 35

Expert Comment

by:mccarl
ID: 39241089
Ahh, it appears that unparsed-text is only an XSLT funtion (not in XQuery 1.0). Try enabling XQuery 3.0 features (although I am finding mixed messages in docs about whether Saxon-PE supports 3.0 or not)...
java -cp saxon9pe.jar;tagsoup-1.2.1.jar net.sf.saxon.Query --xqueryVersion:3.0 -qs:"saxon:parse-html(fn:unparsed-text('page.html'))//*:div[@id='ps-content']"

Open in new window

(And the unparsed-text function may need to be qualified as above with the "fn:", try with it and without.)
0
 
LVL 60

Expert Comment

by:Geert Bormans
ID: 39242157
I am not sure which 9PE you are using
but collection() recently got some properties
I believe collection('page.html;unparsed=yes')
works in both XSLT and XQuery

try:

java -cp saxon9pe.jar;tagsoup-1.2.1.jar net.sf.saxon.Query --xqueryVersion:3.0 -qs:"saxon:parse-html(collection('page.html;unparsed=yes'))//*:div[@id='ps-content']"
0
 

Author Comment

by:lucavilla
ID: 39249047
Still no luck:

...\SaxonPE9-4-0-7J>java -cp saxon9pe.jar;tagsoup-1.2.1.jar
 net.sf.saxon.Query --xqueryVersion:3.0 -qs:"saxon:parse-html(fn:unparsed-text('
page.html'))//*:div[@id='ps-content']"
Error on line 1 column 17
  XPST0017 XQuery static error near #...:unparsed-text('page.html'))//#:
    System function unparsed-text#1 is not available with this host language/ver
sion
Static error(s) in query

___________

...\SaxonPE9-4-0-7J>java -cp saxon9pe.jar;tagsoup-1.2.1.jar
 net.sf.saxon.Query --xqueryVersion:3.0 -qs:"fn:parse-html(fn:unparsed-text('pag
e.html'))//*:div[@id='ps-content']"
Error on line 1 column 14
  XPST0017 XQuery static error near #...:unparsed-text('page.html'))//#:
    System function unparsed-text#1 is not available with this host language/ver
sion
Static error(s) in query

___________

...\SaxonPE9-4-0-7J>java -cp saxon9pe.jar;tagsoup-1.2.1.jar
 net.sf.saxon.Query --xqueryVersion:3.0 -qs:"saxon:parse-html(collection('page.h
tml;unparsed=yes'))//*:div[@id='ps-content']"
Error on line 1 of *module with no systemId*:
  FODC0002: The file or directory
  file:/C:/Users/diego/Downloads/SaxonPE9-4-0-7J/page.html;unparsed=yes does not
 exist
Query processing failed: Run-time errors were reported

___________

...\SaxonPE9-4-0-7J>java -cp saxon9pe.jar;tagsoup-1.2.1.jar
 net.sf.saxon.Query --xqueryVersion:3.0 -qs:"saxon:parse-html(collection('page.h
tml';'unparsed=yes'))//*:div[@id='ps-content']"
Error on line 1 column 39
  XPST0003 XQuery syntax error near #...ion('page.html';'unparsed=yes'#:
    expected ")", found ";"
Static error(s) in query
0
How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

 
LVL 60

Expert Comment

by:Geert Bormans
ID: 39249127
I should obviously not dump code snippets that I did not test

Here is how it works in XQuery

declare namespace saxon="http://saxon.sf.net/";

let $doc := saxon:parse-html(collection('file:///C:/Users/User/Desktop?select=test2.html;unparsed=yes'))

return $doc

Open in new window


I copied the amazon file as test2.html on my desktop

put the file uri with path, add a questionmark
put select with the file mask (could be a single file as in this example)

I have not tested this on teh commandline
(sitting in a hotel room on a thin wire)
but it should work
0
 
LVL 60

Expert Comment

by:Geert Bormans
ID: 39266192
Ah, did it work?
Given I did not test, I am a bit curious
0
 

Author Comment

by:lucavilla
ID: 39266225
to say the truth I did not test it because my goal was to find a minimalist single line command line to perform XPATH and XQUERY extractions out from (even malformed) HTML pages...   and I don't know how to reduce your last solution to a single line  :)
0
 
LVL 60

Expert Comment

by:Geert Bormans
ID: 39266290
that is a full XQuery I used to test it with

this is teh XPath that should work in your single line

saxon:parse-html(collection('file:///C:/Users/User/Desktop?select=test2.html;unparsed=yes'))
0
 

Author Comment

by:lucavilla
ID: 39266676
The shortest formally accepted commandline is this:

java -cp saxon9pe.jar;tagsoup-1.2.1.jar net.sf.saxon.Query -qs:"saxon:parse-html(doc('test.xhtm'))//*:div[@id='ps-content']"

Open in new window


However it returns this empty result:

<?xml version="1.0" encoding="UTF-8"?
0
 
LVL 35

Expert Comment

by:mccarl
ID: 39267129
Wow, I would have thought at least a split of the points would be appropriate here!
0
 

Author Comment

by:lucavilla
ID: 39267195
Ok right, in conclusion I found no solution yet...
I reposted the question here: http://www.experts-exchange.com/Web_Development/Web_Languages-Standards/XML/Q_28164398.html

;)
0
 
LVL 60

Expert Comment

by:Geert Bormans
ID: 39269052
I have my saxon PE licenses hooked up in XML IDEs, so I must check if I can launche one command-line. I had a chat with Saxonica people last week however and they claimed it should definitely work this way.

But I was wondering... you want to do all of this command-line? One single command.
Does that mean you can not have an XSLT or XQuery file next to it? You just need to reference the XSLT file then and still call the actual process in one go. I would go for XSLT then, given it supports the unparsed text function
Note that you can still pass parameters to the XSLT, so the XSLT could be a stub file and the actual XPath could be passed as a parameter to the XSLT... it gives you all the dynamics you could possibly need
0
 

Author Comment

by:lucavilla
ID: 39269124
given the problem with XPATH special characters in command line I decided to go with the single command line + XSLT or XQuery file route.
Any idea about the most efficient solution?
0
 
LVL 60

Assisted Solution

by:Geert Bormans
Geert Bormans earned 100 total points
ID: 39269126
Saxon uses the same underlying parser for both.
I don't think it matters

Just getting a single value... then XQuery sounds more natural to me
(note that you better make a full file uri from the file reference

I think this is the XQuery you would need

declare namespace saxon="http://saxon.sf.net/";

let $doc := saxon:parse-html(collection('file:///C:/Users/User/Desktop?select=test2.html;unparsed=yes'))//*:div[@id='ps-content']

return $doc 

Open in new window

0
 

Accepted Solution

by:
lucavilla earned 0 total points
ID: 39282618
Solution:

____file query.xq____
declare default element namespace 'http://www.w3.org/1999/xhtml';
doc('page.html')//div[@id='ps-content']
______________________

Command-line using Nailgun:
ng.exe --nailgun-port 2114 net.sf.saxon.Query -x:"org.ccil.cowan.tagsoup.Parser" query.xq
0
 

Author Closing Comment

by:lucavilla
ID: 39292568
solved
0

Featured Post

Enabling OSINT in Activity Based Intelligence

Activity based intelligence (ABI) requires access to all available sources of data. Recorded Future allows analysts to observe structured data on the open, deep, and dark web.

Join & Write a Comment

Introduction In my previous article (http://www.experts-exchange.com/Microsoft/Development/MS-SQL-Server/SSIS/A_9150-Loading-XML-Using-SSIS.html) I showed you how the XML Source component can be used to load XML files into a SQL Server database, us…
Many times as a report developer I've been asked to display normalized data such as three rows with values Jack, Joe, and Bob as a single comma-separated string such as 'Jack, Joe, Bob', and vice versa.  Here's how to do it. 
Viewers will learn one way to get user input in Java. Introduce the Scanner object: Declare the variable that stores the user input: An example prompting the user for input: Methods you need to invoke in order to properly get  user input:
Viewers will learn about basic arrays, how to declare them, and how to use them. Introduction and definition: Declare an array and cover the syntax of declaring them: Initialize every index in the created array: Example/Features of a basic arr…

757 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

22 Experts available now in Live!

Get 1:1 Help Now