Solved

XML parsing with Perl

Posted on 2014-10-16
6
176 Views
Last Modified: 2014-10-21
hello, I begin to study XML parsing and I'm involved in 2 little problems.
my xml (a collection of bibliographic records)  has the following structure:

<?xml version="1.0" encoding="UTF-8"?>

<collection xmlns="http://www.loc.gov/MARC21/slim">
 <!-- FIRST INCREMENTAL -->
 <!-- INSTANCE:sfxudn -->
 <record>
  <leader>-----nas-a2200000z--4500</leader>
  <controlfield tag="008">140922uuuuuuuuuxx-uu-|------u|----|eng-d</controlfield>
  <datafield tag="010" ind1="" ind2="">
   <subfield code="a">01015589</subfield>
  </datafield>
  <datafield tag="245" ind1="" ind2="0">
   <subfield code="a">Publishers weekly</subfield>
  </datafield>
  <datafield tag="260" ind1="" ind2="">
   <subfield code="a">New York, NY</subfield>
   <subfield code="b">Reed Business Information</subfield>
  </datafield>
  <datafield tag="022" ind1="" ind2="">
   <subfield code="a">0000-0019</subfield>
  </datafield>
  <datafield tag="776" ind1="" ind2="">
   <subfield code="x">2150-4008</subfield>
  </datafield>
  <datafield tag="090" ind1="" ind2="">
   <subfield code="a">954921332001</subfield>
  </datafield>
  <datafield tag="866" ind1="" ind2="">
   <subfield code="a">Available from 1997. </subfield>
   <subfield code="s">1000000000001224</subfield>
   <subfield code="t">1000000000000630</subfield>
   <subfield code="x">EBSCOhost Business Source Complete:Full Text</subfield>
   <subfield code="z">1000000000125212</subfield>
  </datafield>
 </record>
 
 ....more records...
  </collection>
 
  and I'd like to make 2 manipulations:
 
  1) adding a single constant line (with constant content) in a precise position  in <datafield tag="866" ind1="" ind2="">".
  in one word, I.E.,  the
   <datafield tag="866" ind1="" ind2="">
   <subfield code="a">Available from 1997. </subfield>
   <subfield code="s">1000000000001224</subfield>
   <subfield code="t">1000000000000630</subfield>
   <subfield code="x">EBSCOhost Business Source Complete:Full Text</subfield>
   <subfield code="z">1000000000125212</subfield>
  </datafield>
 
  should be transformed to:
 
  <datafield tag="866" ind1="" ind2="">
   <subfield code="a">Available from 1997. </subfield>
   ****add the following line with "code" attribute in alphabetical order, after "a" and before "s"****
    <subfield code="i">DEFAULT</subfield>
   <subfield code="s">1000000000001224</subfield>
   <subfield code="t">1000000000000630</subfield>
   <subfield code="x">EBSCOhost Business Source Complete:Full Text</subfield>
   <subfield code="z">10000000value 00125212</subfield>
  </datafield>
 
  2) find ALL and ONLY the records titles (it's the content of /collection/record/datafield tag='245'/subfield code='a') that have:
 
  a) value of "/collection/record/datafield tag='866']/subfield code='x'" equal to "Elsevier SD Freedom Collection:Full Text"
  b) the "/collection/record/datafield @tag='866'/subfield @code='a'" is totally absent,  OR -if present- is empty.I.E.:
 
  <datafield tag="866" ind1="" ind2="">
   <subfield code="s">1000000000000992</subfield>
   <subfield code="t">1000000000000473</subfield>
   <subfield code="x">Elsevier SD Freedom Collection:Full Text</subfield>
   <subfield code="z">1000000000043233</subfield>
  </datafield>
 
  OR
 
   <datafield tag="866" ind1="" ind2="">
   <subfield code="a"></subfield>
   <subfield code="s">1000000000000992</subfield>
   <subfield code="t">1000000000000473</subfield>
   <subfield code="x">Elsevier SD Freedom Collection:Full Text</subfield>
   <subfield code="z">1000000000043233</subfield>
  </datafield>
 
  thanks a lot for your reply,
 
  fabianope
0
Comment
Question by:fabianope65
  • 3
  • 3
6 Comments
 
LVL 26

Expert Comment

by:wilcoxon
ID: 40384251
A few questions...
Why does it matter if the codes are in alphabetic order?  XML doesn't care.
Are you positive that the field could not also be <subfield code="a"/> which is also valid empty XML?  It won't matter to a parser but, if someone tried to do it with regex or something, it would.

Personally, I'd look at either XML::Simple (which will definitely not preserve order) or XML::SAX.  Both will write new files rather then editing the existing one in-place.
0
 

Author Comment

by:fabianope65
ID: 40384847
Hi again, wilcoxon
Thanks a lot also for the interest on this issue.
here are the answers:
1) the xml is also for my friends  librarians, and I'd like to give them a "ordered" code (for example, the various tags attribute corresponds to precise cataloging items, etc.).
2) you're right about this issue...the only thing I can say and that...I've never found <subfield code="a"/> in the file to process until now, but you're right...it can happens in the future so the point 2 should be re-formulated as follows:
*************************************************************************************************
2) find ALL and ONLY the records titles (it's the content of /collection/record/datafield tag='245'/subfield code='a') that have:
 
  a) value of "/collection/record/datafield tag='866']/subfield code='x'" equal to "Elsevier SD Freedom Collection:Full Text"
  b) the "/collection/record/datafield @tag='866'/subfield @code='a'" is totally absent,  OR -if present- is empty.

c) a <subfield code="a"/>


thanks a lot,
fabiano
0
 
LVL 26

Expert Comment

by:wilcoxon
ID: 40387523
Are ind1 and ind2 part of the "key" for datafield or is tag unique on its own?

Maintaining ordering is a lot more difficult so I'm going to ignore that requirement since it does not actually affect the XML.  Are the codes already in alphabetic order?  If not, it's even harder.

This should be the simplest code to accomplish the rest of it.  If this doesn't exactly do what you want (such as changing elements to attributes or vice-versa), it will require playing around with options to XMLin and XMLout.
use strict;
use warnings;
use XML::Simple;
my $file = shift; # get file name from command line
my %opt = (KeepRoot => 1, ForceArray => [qw(datafield subfield record)],
           KeyAttr => { datafield => 'tag', subfield => 'code' },
          );
my $xml = XMLin($file, %opt) or die "could not parse $file: $!";
foreach my $rec (@{$xml->{collection}{record}}) {
    my $data = $rec->{datafield}{866};
    if (not $data->{ind1} and not $data->{ind2}) {
        $data->{subfield}{i} = 'DEFAULT';
    }
    next if ($data->{subfield}{a} or $data->{subfield}{x} ne 'Elsevier SD Freedom Collection:Full Text');
    print 'Title: ', $rec->{datafield}{245}{subfield}{a}, "\n";
}
print XMLout($xml, %opt);

Open in new window

0
DevOps Toolchain Recommendations

Read this Gartner Research Note and discover how your IT organization can automate and optimize DevOps processes using a toolchain architecture.

 

Author Comment

by:fabianope65
ID: 40391731
Hi, there
just launched:
acnp2.pl e-collection.20140922145816.xml-marc >acnp2.txt

(acnp2.pl your script,  e-collection.20140922145816.xml-marc  the input file, acnp2.txt the output file)

I obtain the error:
Unrecognised option: ForceArray at C:\Perl64\eg\xml\acnp2.pl line 17.

at line 17 I've the following code:
print XMLout($xml, %opt);

surely my fault... where :=))

Thanks a lot,

Fabiano
0
 
LVL 26

Accepted Solution

by:
wilcoxon earned 500 total points
ID: 40392177
Oops.  Odd.  KeyAttr requires ForceArray for XMLin but it is not valid for XMLout.  Try this:
use strict;
use warnings;
use XML::Simple;
my $file = shift; # get file name from command line
my %opt = ( KeepRoot => 1,
            KeyAttr => { datafield => 'tag', subfield => 'code' },
          );
my %inopt = ( ForceArray => [qw(datafield subfield record)] );
my $xml = XMLin($file, %opt, %inopt) or die "could not parse $file: $!";
foreach my $rec (@{$xml->{collection}{record}}) {
    my $data = $rec->{datafield}{866};
    if (not $data->{ind1} and not $data->{ind2}) {
        $data->{subfield}{i} = 'DEFAULT';
    }
    next if ($data->{subfield}{a} or $data->{subfield}{x} ne 'Elsevier SD Freedom Collection:Full Text');
    print 'Title: ', $rec->{datafield}{245}{subfield}{a}, "\n";
}
print XMLout($xml, %opt);

Open in new window

0
 

Author Closing Comment

by:fabianope65
ID: 40394212
Hi there
tested & all OK!! :=))
Thanks a lot,
Fabiano
0

Featured Post

Announcing the Most Valuable Experts of 2016

MVEs are more concerned with the satisfaction of those they help than with the considerable points they can earn. They are the types of people you feel privileged to call colleagues. Join us in honoring this amazing group of Experts.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

On Microsoft Windows, if  when you click or type the name of a .pl file, you get an error "is not recognized as an internal or external command, operable program or batch file", then this means you do not have the .pl file extension associated with …
The Problem How to write an Xquery that works like a SQL outer join, providing placeholders for absent data on the outer side?  I give a bit more background at the end. The situation expressed as relational data Let’s work through this.  I’ve …
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…

825 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question