Solved

XPath Syntax help in conjunction with perl XML::LibXML

Posted on 2016-08-08
18
149 Views
Last Modified: 2016-08-19
I’m  trying to extract  some data from the georss.xml file below using the perl script I’m struggling to get the  value of the following

<feed sc:countryCodes=”US”  # need US

  <category term="Wanted" label="Wanted" url="http://otherurl.com/pathto/FINDME" />  # need value of term where value of url contains FINDME

 The value of the 2 entry/title fields but know which one has type="text" /  value = fileName.pdf
<point xmlns="http://www.opengis.net/gml" srsName="urn:ogc:crs:EPSG:4326>  # need value of srsName




perl script

 
#!C:\strawberry\perl\bin\perl.exe

use strict;
use warnings;
use XML::LibXML;
use XML::LibXML::XPathContext;
use Data::Dump qw(dump);


my $filename = 'georss.xml';

my $dom = XML::LibXML->load_xml(location => $filename);
my $xpc = XML::LibXML::XPathContext->new($dom);
$xpc->registerNs(dft => "http://www.w3.org/2005/Atom");
$xpc->registerNs(georss => "http://www.georss.org/georss");

my $title =  $xpc->findnodes('//dft:feed/dft:title');
print "title $title\n"; # GOOD
#my $point = $xpc->findnodes('//dft:feed/georss:where/dft:Point/dft:pos'); ## this doesn't find anything
my $point = $xpc->findnodes('//dft:feed/georss:where');
   $point =~ s/^\s*//;  # clean white space unsure why but had loads
   $point =~ s/\s*$//;
   print "point $point\n"; # GOOD

foreach my $Etitle ($dom->findnodes('//dft:feed/dft:title)) {
    print "Etitle $Etitle\n";  # prints <title type="text">fileName.pdf</title>
    my $EtitleVal = $Etitle->findvalue('./title');
    if($Etilte =~ m/jpg/){
      print "Image $EtitleVal\n";  # prints Image
     } 
    elsif($Etilte =~ m/pdf/){
      print "PDF $EtitleVal\n";  # prints <title type="text">fileName.pdf</title>
     } 
    
}

Open in new window


georss.xml

<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom" xmlns:georss="http://www.georss.org/georss" xmlns:gml="http://www.opengis.net/gml"sc:countryCodes=”US”>
  <title type="text">Earthquakes</title>
  <subtitle>International earthquake observation labs</subtitle>
  <link href="http://example.org/" />
  <updated>2005-12-13T18:30:02Z</updated>
  <category term="Note" label ="Note" url="http://exampleurl.com/pathto/" />
  <category term="Wanted" label ="Wanted" url="http://otherurl.com/pathto/FINDME" />
  <author>
    <name>Dr. Thaddeus Remor</name>
    <email>tremor@quakelab.edu</email>
  </author>
  <id>urn:uuid:60a76c80-d399-11d9-b93C-0003939e0af6</id>
  <georss:where>
     <point xmlns="http://www.opengis.net/gml" srsName="urn:ogc:crs:EPSG:4326>
      <pos>45.256 -71.92</pos>
     </point>
  </georss:where>

  <entry>
    <title type="text">fileName.pdf</title>
   
  </entry>
  <entry>
    <title>fileName.jpg</title>
  </entry>
</feed>

Open in new window

0
Comment
Question by:trevor1940
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 10
  • 8
18 Comments
 
LVL 60

Accepted Solution

by:
Geert Bormans earned 500 total points
ID: 41747479
#my $point = $xpc->findnodes('//dft:feed/georss:where/dft:Point/dft:pos'); ## this doesn't find anything

 correct, the dft:pos is inside a gml:point not a dft:Point
(note that the default namespace changes at the point level)

add a
$xpc->registerNs(gml => "http://www.opengis.net/gml");

and change the XPath

#my $point = $xpc->findnodes('//dft:feed/georss:where/gml:point/dft:pos');
0
 
LVL 1

Author Comment

by:trevor1940
ID: 41747782
Thank you for that

Any idea about the  other  issues?
0
 
LVL 60

Assisted Solution

by:Geert Bormans
Geert Bormans earned 500 total points
ID: 41747879
/dft:feed/@sc:countryCodes
for the "US"
but you then need to bind the sc namespace
(the snippet you posted is unvalid, since the sc namespace is not declared)
0
Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
LVL 60

Assisted Solution

by:Geert Bormans
Geert Bormans earned 500 total points
ID: 41747881
/dft:feed/dft:category[contains(@url, 'FINDME')]/@term
0
 
LVL 1

Author Comment

by:trevor1940
ID: 41747963
Thanx

(the snippet you posted is unvalid, since the sc namespace is not declared)

I'll double check when back in the office but I'm pretty sure this is how it is in the actual file No I cannot post it before I'm asked
0
 
LVL 60

Assisted Solution

by:Geert Bormans
Geert Bormans earned 500 total points
ID: 41747967
<feed xmlns="http://www.w3.org/2005/Atom" xmlns:georss="http://www.georss.org/georss" xmlns:gml="http://www.opengis.net/gml"sc:countryCodes=”US”>

is the first line of the xml you posted.
two pieces missing to make it valid.

  • there is no xmlns:sc="..."
  • there should be a space between the gml namespace declaration and the sc:countryCodes

this won't even parse right
0
 
LVL 1

Author Comment

by:trevor1940
ID: 41748365
Apologies for errors in the xml these were caused by fat fingers from a closed system to internet PC all the above worked
One last question given this

foreach my $Etitle ($dom->findnodes('//dft:feed/dft:entry/dft:title)) {
    print "Etitle $Etitle\n";  # prints <title type="text">fileName.pdf</title>
    my $EtitleVal = $Etitle->findvalue('./title');  ## Fails
    if($Etilte =~ m/jpg/){
      print "Image $EtitleVal\n";  # prints Image
     } 
    elsif($Etilte =~ m/pdf/){
      print "PDF $EtitleVal\n";  # prints <title type="text">fileName.pdf</title>
     } 
    
}

Open in new window


First why is this printing the tag eg " <title type="text">fileName.pdf</title>"
Secondly how   do I tell the difference  between the two?
0
 
LVL 60

Assisted Solution

by:Geert Bormans
Geert Bormans earned 500 total points
ID: 41748382
caveat: I am not a perl programmer at all, so there might be some rough edges to my suggestions

dom->findnodes, finds the actual node, it seems that the "print" operation on a node serializes the node attributes inclusive that is why you don't see "filename.pdf" but the entire xml snippet

I believe ->data is the solution
print $Etitle->data

not sure what you want to do here
my $EtitleVal = $Etitle->findvalue('./title');
but I guess you need
my $EtitleVal = $Etitle->data;

You realize that the variable name is wrong in the test?
if($Etilte =~ m/jpg/){
$Etilte instead of $Etitle?
but I guess that is a copy paste error too
1
 
LVL 1

Author Comment

by:trevor1940
ID: 41748389
Yes to copy / paste errors

So are you saying by getting the my $EtitleVal = $Etitle->data; then testing / pattern matching for m/jpg/ is the way to do this and not test if"<title type="text">?
0
 
LVL 60

Assisted Solution

by:Geert Bormans
Geert Bormans earned 500 total points
ID: 41748395
you are doing a regex on a node, you should do it on its textual data, yes
(I guess, as I said, I am not a Perl programmer)

Testing the attribute is something you should do in the XPath
I think If I needed all, I would do something like this
$dom->findnodes('//dft:feed/dft:entry/dft:title[@type='text'])
for the text typed nodes
and
$dom->findnodes('//dft:feed/dft:entry/dft:title[not(@type='text')])
for the others
then you don't need extra testing after
1
 
LVL 1

Author Comment

by:trevor1940
ID: 41748996
Yes I was doing regex I like your way better Thank you.

It seems within a foreach loop need to do something like this

foreach my $Etitle ($dom->findnodes('//dft:feed/dft:entry/dft:title)) {
         my $EtitleVal = Etitle->textContent;
}

Open in new window

where as if you go after a single entry you don't.

  my $Etitle = $dom->findnodes('//dft:feed/dft:entry/dft:title);

Open in new window


Given
  <entry>
    <title>fileName.jpg</title>
     <link href="PathTo/fileName.jpg" />
  </entry>

Open in new window


Can I go after '<entry>' and get it's children in order to keep title and link together?
0
 
LVL 60

Assisted Solution

by:Geert Bormans
Geert Bormans earned 500 total points
ID: 41749230
my $Entry = $dom->findnodes('//dft:feed/dft:entry);
my $ETitle = $Entry->findnodes('dft:title');

should work
0
 
LVL 1

Author Comment

by:trevor1940
ID: 41750550
Thank you for your continued help

This seems to be how to get the child nodes

foreach my $Entry ($dom->findnodes("//dft:feed/dft:entry")) {

     foreach my $Images ($dom->findnodes("//dft:title[not(\@type='text')]", $Entry)) {
         my $ImageVal = $Images->textContent;
          ####  This finds all the Images

     }

}

Open in new window


Given this

  <entry xmlns:georss="http://www.georss.org/georss/10" xsi:schemaLocation ="http://www.url1.net/path/ http://www.url2.net/path/11  http://www.url3.net/path/23" >
    <title>fileName.jpg</title>
     <link href="PathTo/fileName.jpg" />
  </entry>
  <entry>
    <title type="text">fileName.pdf</title>
     <link type="application/pdf"  href="PathTo/fileName.pdf" />
  </entry>

Open in new window


Is there a way of testing if <entry> contains  a namespace or xsi:schemaLocation  I searched google  but found nothing possibly because not sure what to search for ie XPATH node has namespace
0
 
LVL 1

Author Closing Comment

by:trevor1940
ID: 41752000
Thank You
0
 
LVL 60

Expert Comment

by:Geert Bormans
ID: 41752106
welcome,

I just noticed I missed one follow up question apparently.
I am not sure on how to test for namespace nodes
For a parser it is only relevant to know which is the default namespace and which prefixes are bound to which namespace at a specific location, regardless of at which level the binding is declared

note that XPath allows you to look for all namespace nodes
//namespace::*
that could help you to get the namespace node on your current node
0
 
LVL 60

Expert Comment

by:Geert Bormans
ID: 41752111
stackoverflow.com/questions/7388555/xmllibxml-find-and-register-namespaces-used-in-a-document

for inspiration
0
 
LVL 1

Author Comment

by:trevor1940
ID: 41752638
Thanx that was one of the few links I had found

I closed this because the task has been pulled however may need to revisit

For my own interest

If you can't  test for namespace nodes directly but can find the child via "(\@type='text')" then get the parent <entry> then search back down for <link> thus ensuring <type> and other siblings are dealt with together?
0
 
LVL 60

Expert Comment

by:Geert Bormans
ID: 41752903
yes, you can do that
0

Featured Post

Learn by Doing. Anytime. Anywhere.

Do you like to learn by doing?
Our labs and exercises give you the chance to do just that: Learn by performing actions on real environments.

Hands-on, scenario-based labs give you experience on real environments provided by us so you don't have to worry about breaking anything.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Email validation in proper way is  very important validation required in any web pages. This code is self explainable except that Regular Expression which I used for pattern matching. I originally published as a thread on my website : http://www…
In the distant past (last year) I hacked together a little toy that would allow a couple of Manager types to query, preview, and extract data from a number of MongoDB instances, to their tool of choice: Excel (http://dilbert.com/strips/comic/2007-08…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
Six Sigma Control Plans

728 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question