XPath Syntax help in conjunction with perl XML::LibXML

I’m  trying to extract  some data from the georss.xml file below using the perl script I’m struggling to get the  value of the following

<feed sc:countryCodes=”US”  # need US

  <category term="Wanted" label="Wanted" url="http://otherurl.com/pathto/FINDME" />  # need value of term where value of url contains FINDME

 The value of the 2 entry/title fields but know which one has type="text" /  value = fileName.pdf
<point xmlns="http://www.opengis.net/gml" srsName="urn:ogc:crs:EPSG:4326>  # need value of srsName




perl script

 
#!C:\strawberry\perl\bin\perl.exe

use strict;
use warnings;
use XML::LibXML;
use XML::LibXML::XPathContext;
use Data::Dump qw(dump);


my $filename = 'georss.xml';

my $dom = XML::LibXML->load_xml(location => $filename);
my $xpc = XML::LibXML::XPathContext->new($dom);
$xpc->registerNs(dft => "http://www.w3.org/2005/Atom");
$xpc->registerNs(georss => "http://www.georss.org/georss");

my $title =  $xpc->findnodes('//dft:feed/dft:title');
print "title $title\n"; # GOOD
#my $point = $xpc->findnodes('//dft:feed/georss:where/dft:Point/dft:pos'); ## this doesn't find anything
my $point = $xpc->findnodes('//dft:feed/georss:where');
   $point =~ s/^\s*//;  # clean white space unsure why but had loads
   $point =~ s/\s*$//;
   print "point $point\n"; # GOOD

foreach my $Etitle ($dom->findnodes('//dft:feed/dft:title)) {
    print "Etitle $Etitle\n";  # prints <title type="text">fileName.pdf</title>
    my $EtitleVal = $Etitle->findvalue('./title');
    if($Etilte =~ m/jpg/){
      print "Image $EtitleVal\n";  # prints Image
     } 
    elsif($Etilte =~ m/pdf/){
      print "PDF $EtitleVal\n";  # prints <title type="text">fileName.pdf</title>
     } 
    
}

Open in new window


georss.xml

<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom" xmlns:georss="http://www.georss.org/georss" xmlns:gml="http://www.opengis.net/gml"sc:countryCodes=”US”>
  <title type="text">Earthquakes</title>
  <subtitle>International earthquake observation labs</subtitle>
  <link href="http://example.org/" />
  <updated>2005-12-13T18:30:02Z</updated>
  <category term="Note" label ="Note" url="http://exampleurl.com/pathto/" />
  <category term="Wanted" label ="Wanted" url="http://otherurl.com/pathto/FINDME" />
  <author>
    <name>Dr. Thaddeus Remor</name>
    <email>tremor@quakelab.edu</email>
  </author>
  <id>urn:uuid:60a76c80-d399-11d9-b93C-0003939e0af6</id>
  <georss:where>
     <point xmlns="http://www.opengis.net/gml" srsName="urn:ogc:crs:EPSG:4326>
      <pos>45.256 -71.92</pos>
     </point>
  </georss:where>

  <entry>
    <title type="text">fileName.pdf</title>
   
  </entry>
  <entry>
    <title>fileName.jpg</title>
  </entry>
</feed>

Open in new window

LVL 1
trevor1940Asked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Gertone (Geert Bormans)Information ArchitectCommented:
#my $point = $xpc->findnodes('//dft:feed/georss:where/dft:Point/dft:pos'); ## this doesn't find anything

 correct, the dft:pos is inside a gml:point not a dft:Point
(note that the default namespace changes at the point level)

add a
$xpc->registerNs(gml => "http://www.opengis.net/gml");

and change the XPath

#my $point = $xpc->findnodes('//dft:feed/georss:where/gml:point/dft:pos');
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
trevor1940Author Commented:
Thank you for that

Any idea about the  other  issues?
0
Gertone (Geert Bormans)Information ArchitectCommented:
/dft:feed/@sc:countryCodes
for the "US"
but you then need to bind the sc namespace
(the snippet you posted is unvalid, since the sc namespace is not declared)
0
Introduction to Web Design

Develop a strong foundation and understanding of web design by learning HTML, CSS, and additional tools to help you develop your own website.

Gertone (Geert Bormans)Information ArchitectCommented:
/dft:feed/dft:category[contains(@url, 'FINDME')]/@term
0
trevor1940Author Commented:
Thanx

(the snippet you posted is unvalid, since the sc namespace is not declared)

I'll double check when back in the office but I'm pretty sure this is how it is in the actual file No I cannot post it before I'm asked
0
Gertone (Geert Bormans)Information ArchitectCommented:
<feed xmlns="http://www.w3.org/2005/Atom" xmlns:georss="http://www.georss.org/georss" xmlns:gml="http://www.opengis.net/gml"sc:countryCodes=”US”>

is the first line of the xml you posted.
two pieces missing to make it valid.

  • there is no xmlns:sc="..."
  • there should be a space between the gml namespace declaration and the sc:countryCodes

this won't even parse right
0
trevor1940Author Commented:
Apologies for errors in the xml these were caused by fat fingers from a closed system to internet PC all the above worked
One last question given this

foreach my $Etitle ($dom->findnodes('//dft:feed/dft:entry/dft:title)) {
    print "Etitle $Etitle\n";  # prints <title type="text">fileName.pdf</title>
    my $EtitleVal = $Etitle->findvalue('./title');  ## Fails
    if($Etilte =~ m/jpg/){
      print "Image $EtitleVal\n";  # prints Image
     } 
    elsif($Etilte =~ m/pdf/){
      print "PDF $EtitleVal\n";  # prints <title type="text">fileName.pdf</title>
     } 
    
}

Open in new window


First why is this printing the tag eg " <title type="text">fileName.pdf</title>"
Secondly how   do I tell the difference  between the two?
0
Gertone (Geert Bormans)Information ArchitectCommented:
caveat: I am not a perl programmer at all, so there might be some rough edges to my suggestions

dom->findnodes, finds the actual node, it seems that the "print" operation on a node serializes the node attributes inclusive that is why you don't see "filename.pdf" but the entire xml snippet

I believe ->data is the solution
print $Etitle->data

not sure what you want to do here
my $EtitleVal = $Etitle->findvalue('./title');
but I guess you need
my $EtitleVal = $Etitle->data;

You realize that the variable name is wrong in the test?
if($Etilte =~ m/jpg/){
$Etilte instead of $Etitle?
but I guess that is a copy paste error too
1
trevor1940Author Commented:
Yes to copy / paste errors

So are you saying by getting the my $EtitleVal = $Etitle->data; then testing / pattern matching for m/jpg/ is the way to do this and not test if"<title type="text">?
0
Gertone (Geert Bormans)Information ArchitectCommented:
you are doing a regex on a node, you should do it on its textual data, yes
(I guess, as I said, I am not a Perl programmer)

Testing the attribute is something you should do in the XPath
I think If I needed all, I would do something like this
$dom->findnodes('//dft:feed/dft:entry/dft:title[@type='text'])
for the text typed nodes
and
$dom->findnodes('//dft:feed/dft:entry/dft:title[not(@type='text')])
for the others
then you don't need extra testing after
1
trevor1940Author Commented:
Yes I was doing regex I like your way better Thank you.

It seems within a foreach loop need to do something like this

foreach my $Etitle ($dom->findnodes('//dft:feed/dft:entry/dft:title)) {
         my $EtitleVal = Etitle->textContent;
}

Open in new window

where as if you go after a single entry you don't.

  my $Etitle = $dom->findnodes('//dft:feed/dft:entry/dft:title);

Open in new window


Given
  <entry>
    <title>fileName.jpg</title>
     <link href="PathTo/fileName.jpg" />
  </entry>

Open in new window


Can I go after '<entry>' and get it's children in order to keep title and link together?
0
Gertone (Geert Bormans)Information ArchitectCommented:
my $Entry = $dom->findnodes('//dft:feed/dft:entry);
my $ETitle = $Entry->findnodes('dft:title');

should work
0
trevor1940Author Commented:
Thank you for your continued help

This seems to be how to get the child nodes

foreach my $Entry ($dom->findnodes("//dft:feed/dft:entry")) {

     foreach my $Images ($dom->findnodes("//dft:title[not(\@type='text')]", $Entry)) {
         my $ImageVal = $Images->textContent;
          ####  This finds all the Images

     }

}

Open in new window


Given this

  <entry xmlns:georss="http://www.georss.org/georss/10" xsi:schemaLocation ="http://www.url1.net/path/ http://www.url2.net/path/11  http://www.url3.net/path/23" >
    <title>fileName.jpg</title>
     <link href="PathTo/fileName.jpg" />
  </entry>
  <entry>
    <title type="text">fileName.pdf</title>
     <link type="application/pdf"  href="PathTo/fileName.pdf" />
  </entry>

Open in new window


Is there a way of testing if <entry> contains  a namespace or xsi:schemaLocation  I searched google  but found nothing possibly because not sure what to search for ie XPATH node has namespace
0
trevor1940Author Commented:
Thank You
0
Gertone (Geert Bormans)Information ArchitectCommented:
welcome,

I just noticed I missed one follow up question apparently.
I am not sure on how to test for namespace nodes
For a parser it is only relevant to know which is the default namespace and which prefixes are bound to which namespace at a specific location, regardless of at which level the binding is declared

note that XPath allows you to look for all namespace nodes
//namespace::*
that could help you to get the namespace node on your current node
0
Gertone (Geert Bormans)Information ArchitectCommented:
stackoverflow.com/questions/7388555/xmllibxml-find-and-register-namespaces-used-in-a-document

for inspiration
0
trevor1940Author Commented:
Thanx that was one of the few links I had found

I closed this because the task has been pulled however may need to revisit

For my own interest

If you can't  test for namespace nodes directly but can find the child via "(\@type='text')" then get the parent <entry> then search back down for <link> thus ensuring <type> and other siblings are dealt with together?
0
Gertone (Geert Bormans)Information ArchitectCommented:
yes, you can do that
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Perl

From novice to tech pro — start learning today.