Solved

XPath Syntax help in conjunction with perl XML::LibXML

Posted on 2016-08-08
18
91 Views
Last Modified: 2016-08-19
I’m  trying to extract  some data from the georss.xml file below using the perl script I’m struggling to get the  value of the following

<feed sc:countryCodes=”US”  # need US

  <category term="Wanted" label="Wanted" url="http://otherurl.com/pathto/FINDME" />  # need value of term where value of url contains FINDME

 The value of the 2 entry/title fields but know which one has type="text" /  value = fileName.pdf
<point xmlns="http://www.opengis.net/gml" srsName="urn:ogc:crs:EPSG:4326>  # need value of srsName




perl script

 
#!C:\strawberry\perl\bin\perl.exe

use strict;
use warnings;
use XML::LibXML;
use XML::LibXML::XPathContext;
use Data::Dump qw(dump);


my $filename = 'georss.xml';

my $dom = XML::LibXML->load_xml(location => $filename);
my $xpc = XML::LibXML::XPathContext->new($dom);
$xpc->registerNs(dft => "http://www.w3.org/2005/Atom");
$xpc->registerNs(georss => "http://www.georss.org/georss");

my $title =  $xpc->findnodes('//dft:feed/dft:title');
print "title $title\n"; # GOOD
#my $point = $xpc->findnodes('//dft:feed/georss:where/dft:Point/dft:pos'); ## this doesn't find anything
my $point = $xpc->findnodes('//dft:feed/georss:where');
   $point =~ s/^\s*//;  # clean white space unsure why but had loads
   $point =~ s/\s*$//;
   print "point $point\n"; # GOOD

foreach my $Etitle ($dom->findnodes('//dft:feed/dft:title)) {
    print "Etitle $Etitle\n";  # prints <title type="text">fileName.pdf</title>
    my $EtitleVal = $Etitle->findvalue('./title');
    if($Etilte =~ m/jpg/){
      print "Image $EtitleVal\n";  # prints Image
     } 
    elsif($Etilte =~ m/pdf/){
      print "PDF $EtitleVal\n";  # prints <title type="text">fileName.pdf</title>
     } 
    
}

Open in new window


georss.xml

<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom" xmlns:georss="http://www.georss.org/georss" xmlns:gml="http://www.opengis.net/gml"sc:countryCodes=”US”>
  <title type="text">Earthquakes</title>
  <subtitle>International earthquake observation labs</subtitle>
  <link href="http://example.org/" />
  <updated>2005-12-13T18:30:02Z</updated>
  <category term="Note" label ="Note" url="http://exampleurl.com/pathto/" />
  <category term="Wanted" label ="Wanted" url="http://otherurl.com/pathto/FINDME" />
  <author>
    <name>Dr. Thaddeus Remor</name>
    <email>tremor@quakelab.edu</email>
  </author>
  <id>urn:uuid:60a76c80-d399-11d9-b93C-0003939e0af6</id>
  <georss:where>
     <point xmlns="http://www.opengis.net/gml" srsName="urn:ogc:crs:EPSG:4326>
      <pos>45.256 -71.92</pos>
     </point>
  </georss:where>

  <entry>
    <title type="text">fileName.pdf</title>
   
  </entry>
  <entry>
    <title>fileName.jpg</title>
  </entry>
</feed>

Open in new window

0
Comment
Question by:trevor1940
  • 10
  • 8
18 Comments
 
LVL 60

Accepted Solution

by:
Geert Bormans earned 500 total points
Comment Utility
#my $point = $xpc->findnodes('//dft:feed/georss:where/dft:Point/dft:pos'); ## this doesn't find anything

 correct, the dft:pos is inside a gml:point not a dft:Point
(note that the default namespace changes at the point level)

add a
$xpc->registerNs(gml => "http://www.opengis.net/gml");

and change the XPath

#my $point = $xpc->findnodes('//dft:feed/georss:where/gml:point/dft:pos');
0
 
LVL 1

Author Comment

by:trevor1940
Comment Utility
Thank you for that

Any idea about the  other  issues?
0
 
LVL 60

Assisted Solution

by:Geert Bormans
Geert Bormans earned 500 total points
Comment Utility
/dft:feed/@sc:countryCodes
for the "US"
but you then need to bind the sc namespace
(the snippet you posted is unvalid, since the sc namespace is not declared)
0
 
LVL 60

Assisted Solution

by:Geert Bormans
Geert Bormans earned 500 total points
Comment Utility
/dft:feed/dft:category[contains(@url, 'FINDME')]/@term
0
 
LVL 1

Author Comment

by:trevor1940
Comment Utility
Thanx

(the snippet you posted is unvalid, since the sc namespace is not declared)

I'll double check when back in the office but I'm pretty sure this is how it is in the actual file No I cannot post it before I'm asked
0
 
LVL 60

Assisted Solution

by:Geert Bormans
Geert Bormans earned 500 total points
Comment Utility
<feed xmlns="http://www.w3.org/2005/Atom" xmlns:georss="http://www.georss.org/georss" xmlns:gml="http://www.opengis.net/gml"sc:countryCodes=”US”>

is the first line of the xml you posted.
two pieces missing to make it valid.

  • there is no xmlns:sc="..."
  • there should be a space between the gml namespace declaration and the sc:countryCodes

this won't even parse right
0
 
LVL 1

Author Comment

by:trevor1940
Comment Utility
Apologies for errors in the xml these were caused by fat fingers from a closed system to internet PC all the above worked
One last question given this

foreach my $Etitle ($dom->findnodes('//dft:feed/dft:entry/dft:title)) {
    print "Etitle $Etitle\n";  # prints <title type="text">fileName.pdf</title>
    my $EtitleVal = $Etitle->findvalue('./title');  ## Fails
    if($Etilte =~ m/jpg/){
      print "Image $EtitleVal\n";  # prints Image
     } 
    elsif($Etilte =~ m/pdf/){
      print "PDF $EtitleVal\n";  # prints <title type="text">fileName.pdf</title>
     } 
    
}

Open in new window


First why is this printing the tag eg " <title type="text">fileName.pdf</title>"
Secondly how   do I tell the difference  between the two?
0
 
LVL 60

Assisted Solution

by:Geert Bormans
Geert Bormans earned 500 total points
Comment Utility
caveat: I am not a perl programmer at all, so there might be some rough edges to my suggestions

dom->findnodes, finds the actual node, it seems that the "print" operation on a node serializes the node attributes inclusive that is why you don't see "filename.pdf" but the entire xml snippet

I believe ->data is the solution
print $Etitle->data

not sure what you want to do here
my $EtitleVal = $Etitle->findvalue('./title');
but I guess you need
my $EtitleVal = $Etitle->data;

You realize that the variable name is wrong in the test?
if($Etilte =~ m/jpg/){
$Etilte instead of $Etitle?
but I guess that is a copy paste error too
1
 
LVL 1

Author Comment

by:trevor1940
Comment Utility
Yes to copy / paste errors

So are you saying by getting the my $EtitleVal = $Etitle->data; then testing / pattern matching for m/jpg/ is the way to do this and not test if"<title type="text">?
0
6 Surprising Benefits of Threat Intelligence

All sorts of threat intelligence is available on the web. Intelligence you can learn from, and use to anticipate and prepare for future attacks.

 
LVL 60

Assisted Solution

by:Geert Bormans
Geert Bormans earned 500 total points
Comment Utility
you are doing a regex on a node, you should do it on its textual data, yes
(I guess, as I said, I am not a Perl programmer)

Testing the attribute is something you should do in the XPath
I think If I needed all, I would do something like this
$dom->findnodes('//dft:feed/dft:entry/dft:title[@type='text'])
for the text typed nodes
and
$dom->findnodes('//dft:feed/dft:entry/dft:title[not(@type='text')])
for the others
then you don't need extra testing after
1
 
LVL 1

Author Comment

by:trevor1940
Comment Utility
Yes I was doing regex I like your way better Thank you.

It seems within a foreach loop need to do something like this

foreach my $Etitle ($dom->findnodes('//dft:feed/dft:entry/dft:title)) {
         my $EtitleVal = Etitle->textContent;
}

Open in new window

where as if you go after a single entry you don't.

  my $Etitle = $dom->findnodes('//dft:feed/dft:entry/dft:title);

Open in new window


Given
  <entry>
    <title>fileName.jpg</title>
     <link href="PathTo/fileName.jpg" />
  </entry>

Open in new window


Can I go after '<entry>' and get it's children in order to keep title and link together?
0
 
LVL 60

Assisted Solution

by:Geert Bormans
Geert Bormans earned 500 total points
Comment Utility
my $Entry = $dom->findnodes('//dft:feed/dft:entry);
my $ETitle = $Entry->findnodes('dft:title');

should work
0
 
LVL 1

Author Comment

by:trevor1940
Comment Utility
Thank you for your continued help

This seems to be how to get the child nodes

foreach my $Entry ($dom->findnodes("//dft:feed/dft:entry")) {

     foreach my $Images ($dom->findnodes("//dft:title[not(\@type='text')]", $Entry)) {
         my $ImageVal = $Images->textContent;
          ####  This finds all the Images

     }

}

Open in new window


Given this

  <entry xmlns:georss="http://www.georss.org/georss/10" xsi:schemaLocation ="http://www.url1.net/path/ http://www.url2.net/path/11  http://www.url3.net/path/23" >
    <title>fileName.jpg</title>
     <link href="PathTo/fileName.jpg" />
  </entry>
  <entry>
    <title type="text">fileName.pdf</title>
     <link type="application/pdf"  href="PathTo/fileName.pdf" />
  </entry>

Open in new window


Is there a way of testing if <entry> contains  a namespace or xsi:schemaLocation  I searched google  but found nothing possibly because not sure what to search for ie XPATH node has namespace
0
 
LVL 1

Author Closing Comment

by:trevor1940
Comment Utility
Thank You
0
 
LVL 60

Expert Comment

by:Geert Bormans
Comment Utility
welcome,

I just noticed I missed one follow up question apparently.
I am not sure on how to test for namespace nodes
For a parser it is only relevant to know which is the default namespace and which prefixes are bound to which namespace at a specific location, regardless of at which level the binding is declared

note that XPath allows you to look for all namespace nodes
//namespace::*
that could help you to get the namespace node on your current node
0
 
LVL 60

Expert Comment

by:Geert Bormans
Comment Utility
stackoverflow.com/questions/7388555/xmllibxml-find-and-register-namespaces-used-in-a-document

for inspiration
0
 
LVL 1

Author Comment

by:trevor1940
Comment Utility
Thanx that was one of the few links I had found

I closed this because the task has been pulled however may need to revisit

For my own interest

If you can't  test for namespace nodes directly but can find the child via "(\@type='text')" then get the parent <entry> then search back down for <link> thus ensuring <type> and other siblings are dealt with together?
0
 
LVL 60

Expert Comment

by:Geert Bormans
Comment Utility
yes, you can do that
0

Featured Post

Threat Intelligence Starter Resources

Integrating threat intelligence can be challenging, and not all companies are ready. These resources can help you build awareness and prepare for defense.

Join & Write a Comment

Suggested Solutions

Title # Comments Views Activity
DOM question 5 74
Getting attribute Values using xslt 4 34
Perl Frameworks 1 51
XSL - change date format 3 26
On Microsoft Windows, if  when you click or type the name of a .pl file, you get an error "is not recognized as an internal or external command, operable program or batch file", then this means you do not have the .pl file extension associated with …
A year or so back I was asked to have a play with MongoDB; within half an hour I had downloaded (http://www.mongodb.org/downloads),  installed and started the daemon, and had a console window open. After an hour or two of playing at the command …
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
It is a freely distributed piece of software for such tasks as photo retouching, image composition and image authoring. It works on many operating systems, in many languages.

762 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

10 Experts available now in Live!

Get 1:1 Help Now