Solved

XPath Syntax help in conjunction with perl XML::LibXML

Posted on 2016-08-08
18
121 Views
Last Modified: 2016-08-19
I’m  trying to extract  some data from the georss.xml file below using the perl script I’m struggling to get the  value of the following

<feed sc:countryCodes=”US”  # need US

  <category term="Wanted" label="Wanted" url="http://otherurl.com/pathto/FINDME" />  # need value of term where value of url contains FINDME

 The value of the 2 entry/title fields but know which one has type="text" /  value = fileName.pdf
<point xmlns="http://www.opengis.net/gml" srsName="urn:ogc:crs:EPSG:4326>  # need value of srsName




perl script

 
#!C:\strawberry\perl\bin\perl.exe

use strict;
use warnings;
use XML::LibXML;
use XML::LibXML::XPathContext;
use Data::Dump qw(dump);


my $filename = 'georss.xml';

my $dom = XML::LibXML->load_xml(location => $filename);
my $xpc = XML::LibXML::XPathContext->new($dom);
$xpc->registerNs(dft => "http://www.w3.org/2005/Atom");
$xpc->registerNs(georss => "http://www.georss.org/georss");

my $title =  $xpc->findnodes('//dft:feed/dft:title');
print "title $title\n"; # GOOD
#my $point = $xpc->findnodes('//dft:feed/georss:where/dft:Point/dft:pos'); ## this doesn't find anything
my $point = $xpc->findnodes('//dft:feed/georss:where');
   $point =~ s/^\s*//;  # clean white space unsure why but had loads
   $point =~ s/\s*$//;
   print "point $point\n"; # GOOD

foreach my $Etitle ($dom->findnodes('//dft:feed/dft:title)) {
    print "Etitle $Etitle\n";  # prints <title type="text">fileName.pdf</title>
    my $EtitleVal = $Etitle->findvalue('./title');
    if($Etilte =~ m/jpg/){
      print "Image $EtitleVal\n";  # prints Image
     } 
    elsif($Etilte =~ m/pdf/){
      print "PDF $EtitleVal\n";  # prints <title type="text">fileName.pdf</title>
     } 
    
}

Open in new window


georss.xml

<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom" xmlns:georss="http://www.georss.org/georss" xmlns:gml="http://www.opengis.net/gml"sc:countryCodes=”US”>
  <title type="text">Earthquakes</title>
  <subtitle>International earthquake observation labs</subtitle>
  <link href="http://example.org/" />
  <updated>2005-12-13T18:30:02Z</updated>
  <category term="Note" label ="Note" url="http://exampleurl.com/pathto/" />
  <category term="Wanted" label ="Wanted" url="http://otherurl.com/pathto/FINDME" />
  <author>
    <name>Dr. Thaddeus Remor</name>
    <email>tremor@quakelab.edu</email>
  </author>
  <id>urn:uuid:60a76c80-d399-11d9-b93C-0003939e0af6</id>
  <georss:where>
     <point xmlns="http://www.opengis.net/gml" srsName="urn:ogc:crs:EPSG:4326>
      <pos>45.256 -71.92</pos>
     </point>
  </georss:where>

  <entry>
    <title type="text">fileName.pdf</title>
   
  </entry>
  <entry>
    <title>fileName.jpg</title>
  </entry>
</feed>

Open in new window

0
Comment
Question by:trevor1940
  • 10
  • 8
18 Comments
 
LVL 60

Accepted Solution

by:
Geert Bormans earned 500 total points
ID: 41747479
#my $point = $xpc->findnodes('//dft:feed/georss:where/dft:Point/dft:pos'); ## this doesn't find anything

 correct, the dft:pos is inside a gml:point not a dft:Point
(note that the default namespace changes at the point level)

add a
$xpc->registerNs(gml => "http://www.opengis.net/gml");

and change the XPath

#my $point = $xpc->findnodes('//dft:feed/georss:where/gml:point/dft:pos');
0
 
LVL 1

Author Comment

by:trevor1940
ID: 41747782
Thank you for that

Any idea about the  other  issues?
0
 
LVL 60

Assisted Solution

by:Geert Bormans
Geert Bormans earned 500 total points
ID: 41747879
/dft:feed/@sc:countryCodes
for the "US"
but you then need to bind the sc namespace
(the snippet you posted is unvalid, since the sc namespace is not declared)
0
Space-Age Communications Transitions to DevOps

ViaSat, a global provider of satellite and wireless communications, securely connects businesses, governments, and organizations to the Internet. Learn how ViaSat’s Network Solutions Engineer, drove the transition from a traditional network support to a DevOps-centric model.

 
LVL 60

Assisted Solution

by:Geert Bormans
Geert Bormans earned 500 total points
ID: 41747881
/dft:feed/dft:category[contains(@url, 'FINDME')]/@term
0
 
LVL 1

Author Comment

by:trevor1940
ID: 41747963
Thanx

(the snippet you posted is unvalid, since the sc namespace is not declared)

I'll double check when back in the office but I'm pretty sure this is how it is in the actual file No I cannot post it before I'm asked
0
 
LVL 60

Assisted Solution

by:Geert Bormans
Geert Bormans earned 500 total points
ID: 41747967
<feed xmlns="http://www.w3.org/2005/Atom" xmlns:georss="http://www.georss.org/georss" xmlns:gml="http://www.opengis.net/gml"sc:countryCodes=”US”>

is the first line of the xml you posted.
two pieces missing to make it valid.

  • there is no xmlns:sc="..."
  • there should be a space between the gml namespace declaration and the sc:countryCodes

this won't even parse right
0
 
LVL 1

Author Comment

by:trevor1940
ID: 41748365
Apologies for errors in the xml these were caused by fat fingers from a closed system to internet PC all the above worked
One last question given this

foreach my $Etitle ($dom->findnodes('//dft:feed/dft:entry/dft:title)) {
    print "Etitle $Etitle\n";  # prints <title type="text">fileName.pdf</title>
    my $EtitleVal = $Etitle->findvalue('./title');  ## Fails
    if($Etilte =~ m/jpg/){
      print "Image $EtitleVal\n";  # prints Image
     } 
    elsif($Etilte =~ m/pdf/){
      print "PDF $EtitleVal\n";  # prints <title type="text">fileName.pdf</title>
     } 
    
}

Open in new window


First why is this printing the tag eg " <title type="text">fileName.pdf</title>"
Secondly how   do I tell the difference  between the two?
0
 
LVL 60

Assisted Solution

by:Geert Bormans
Geert Bormans earned 500 total points
ID: 41748382
caveat: I am not a perl programmer at all, so there might be some rough edges to my suggestions

dom->findnodes, finds the actual node, it seems that the "print" operation on a node serializes the node attributes inclusive that is why you don't see "filename.pdf" but the entire xml snippet

I believe ->data is the solution
print $Etitle->data

not sure what you want to do here
my $EtitleVal = $Etitle->findvalue('./title');
but I guess you need
my $EtitleVal = $Etitle->data;

You realize that the variable name is wrong in the test?
if($Etilte =~ m/jpg/){
$Etilte instead of $Etitle?
but I guess that is a copy paste error too
1
 
LVL 1

Author Comment

by:trevor1940
ID: 41748389
Yes to copy / paste errors

So are you saying by getting the my $EtitleVal = $Etitle->data; then testing / pattern matching for m/jpg/ is the way to do this and not test if"<title type="text">?
0
 
LVL 60

Assisted Solution

by:Geert Bormans
Geert Bormans earned 500 total points
ID: 41748395
you are doing a regex on a node, you should do it on its textual data, yes
(I guess, as I said, I am not a Perl programmer)

Testing the attribute is something you should do in the XPath
I think If I needed all, I would do something like this
$dom->findnodes('//dft:feed/dft:entry/dft:title[@type='text'])
for the text typed nodes
and
$dom->findnodes('//dft:feed/dft:entry/dft:title[not(@type='text')])
for the others
then you don't need extra testing after
1
 
LVL 1

Author Comment

by:trevor1940
ID: 41748996
Yes I was doing regex I like your way better Thank you.

It seems within a foreach loop need to do something like this

foreach my $Etitle ($dom->findnodes('//dft:feed/dft:entry/dft:title)) {
         my $EtitleVal = Etitle->textContent;
}

Open in new window

where as if you go after a single entry you don't.

  my $Etitle = $dom->findnodes('//dft:feed/dft:entry/dft:title);

Open in new window


Given
  <entry>
    <title>fileName.jpg</title>
     <link href="PathTo/fileName.jpg" />
  </entry>

Open in new window


Can I go after '<entry>' and get it's children in order to keep title and link together?
0
 
LVL 60

Assisted Solution

by:Geert Bormans
Geert Bormans earned 500 total points
ID: 41749230
my $Entry = $dom->findnodes('//dft:feed/dft:entry);
my $ETitle = $Entry->findnodes('dft:title');

should work
0
 
LVL 1

Author Comment

by:trevor1940
ID: 41750550
Thank you for your continued help

This seems to be how to get the child nodes

foreach my $Entry ($dom->findnodes("//dft:feed/dft:entry")) {

     foreach my $Images ($dom->findnodes("//dft:title[not(\@type='text')]", $Entry)) {
         my $ImageVal = $Images->textContent;
          ####  This finds all the Images

     }

}

Open in new window


Given this

  <entry xmlns:georss="http://www.georss.org/georss/10" xsi:schemaLocation ="http://www.url1.net/path/ http://www.url2.net/path/11  http://www.url3.net/path/23" >
    <title>fileName.jpg</title>
     <link href="PathTo/fileName.jpg" />
  </entry>
  <entry>
    <title type="text">fileName.pdf</title>
     <link type="application/pdf"  href="PathTo/fileName.pdf" />
  </entry>

Open in new window


Is there a way of testing if <entry> contains  a namespace or xsi:schemaLocation  I searched google  but found nothing possibly because not sure what to search for ie XPATH node has namespace
0
 
LVL 1

Author Closing Comment

by:trevor1940
ID: 41752000
Thank You
0
 
LVL 60

Expert Comment

by:Geert Bormans
ID: 41752106
welcome,

I just noticed I missed one follow up question apparently.
I am not sure on how to test for namespace nodes
For a parser it is only relevant to know which is the default namespace and which prefixes are bound to which namespace at a specific location, regardless of at which level the binding is declared

note that XPath allows you to look for all namespace nodes
//namespace::*
that could help you to get the namespace node on your current node
0
 
LVL 60

Expert Comment

by:Geert Bormans
ID: 41752111
stackoverflow.com/questions/7388555/xmllibxml-find-and-register-namespaces-used-in-a-document

for inspiration
0
 
LVL 1

Author Comment

by:trevor1940
ID: 41752638
Thanx that was one of the few links I had found

I closed this because the task has been pulled however may need to revisit

For my own interest

If you can't  test for namespace nodes directly but can find the child via "(\@type='text')" then get the parent <entry> then search back down for <link> thus ensuring <type> and other siblings are dealt with together?
0
 
LVL 60

Expert Comment

by:Geert Bormans
ID: 41752903
yes, you can do that
0

Featured Post

Webinar: Aligning, Automating, Winning

Join Dan Russo, Senior Manager of Operations Intelligence, for an in-depth discussion on how Dealertrack, leading provider of integrated digital solutions for the automotive industry, transformed their DevOps processes to increase collaboration and move with greater velocity.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

I've just discovered very important differences between Windows an Unix formats in Perl,at least 5.xx.. MOST IMPORTANT: Use Unix file format while saving Your script. otherwise it will have ^M s or smth likely weird in the EOL, Then DO NOT use m…
Many time we need to work with multiple files all together. If its windows system then we can use some GUI based editor to accomplish our task. But what if you are on putty or have only CLI(Command Line Interface) as an option to  edit your files. I…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…

820 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question