XML parsing with Perl

hello, I begin to study XML parsing and I'm involved in 2 little problems.
my xml (a collection of bibliographic records)  has the following structure:

<?xml version="1.0" encoding="UTF-8"?>

<collection xmlns="http://www.loc.gov/MARC21/slim">
 <!-- FIRST INCREMENTAL -->
 <!-- INSTANCE:sfxudn -->
 <record>
  <leader>-----nas-a2200000z--4500</leader>
  <controlfield tag="008">140922uuuuuuuuuxx-uu-|------u|----|eng-d</controlfield>
  <datafield tag="010" ind1="" ind2="">
   <subfield code="a">01015589</subfield>
  </datafield>
  <datafield tag="245" ind1="" ind2="0">
   <subfield code="a">Publishers weekly</subfield>
  </datafield>
  <datafield tag="260" ind1="" ind2="">
   <subfield code="a">New York, NY</subfield>
   <subfield code="b">Reed Business Information</subfield>
  </datafield>
  <datafield tag="022" ind1="" ind2="">
   <subfield code="a">0000-0019</subfield>
  </datafield>
  <datafield tag="776" ind1="" ind2="">
   <subfield code="x">2150-4008</subfield>
  </datafield>
  <datafield tag="090" ind1="" ind2="">
   <subfield code="a">954921332001</subfield>
  </datafield>
  <datafield tag="866" ind1="" ind2="">
   <subfield code="a">Available from 1997. </subfield>
   <subfield code="s">1000000000001224</subfield>
   <subfield code="t">1000000000000630</subfield>
   <subfield code="x">EBSCOhost Business Source Complete:Full Text</subfield>
   <subfield code="z">1000000000125212</subfield>
  </datafield>
 </record>
 
 ....more records...
  </collection>
 
  and I'd like to make 2 manipulations:
 
  1) adding a single constant line (with constant content) in a precise position  in <datafield tag="866" ind1="" ind2="">".
  in one word, I.E.,  the
   <datafield tag="866" ind1="" ind2="">
   <subfield code="a">Available from 1997. </subfield>
   <subfield code="s">1000000000001224</subfield>
   <subfield code="t">1000000000000630</subfield>
   <subfield code="x">EBSCOhost Business Source Complete:Full Text</subfield>
   <subfield code="z">1000000000125212</subfield>
  </datafield>
 
  should be transformed to:
 
  <datafield tag="866" ind1="" ind2="">
   <subfield code="a">Available from 1997. </subfield>
   ****add the following line with "code" attribute in alphabetical order, after "a" and before "s"****
    <subfield code="i">DEFAULT</subfield>
   <subfield code="s">1000000000001224</subfield>
   <subfield code="t">1000000000000630</subfield>
   <subfield code="x">EBSCOhost Business Source Complete:Full Text</subfield>
   <subfield code="z">10000000value 00125212</subfield>
  </datafield>
 
  2) find ALL and ONLY the records titles (it's the content of /collection/record/datafield tag='245'/subfield code='a') that have:
 
  a) value of "/collection/record/datafield tag='866']/subfield code='x'" equal to "Elsevier SD Freedom Collection:Full Text"
  b) the "/collection/record/datafield @tag='866'/subfield @code='a'" is totally absent,  OR -if present- is empty.I.E.:
 
  <datafield tag="866" ind1="" ind2="">
   <subfield code="s">1000000000000992</subfield>
   <subfield code="t">1000000000000473</subfield>
   <subfield code="x">Elsevier SD Freedom Collection:Full Text</subfield>
   <subfield code="z">1000000000043233</subfield>
  </datafield>
 
  OR
 
   <datafield tag="866" ind1="" ind2="">
   <subfield code="a"></subfield>
   <subfield code="s">1000000000000992</subfield>
   <subfield code="t">1000000000000473</subfield>
   <subfield code="x">Elsevier SD Freedom Collection:Full Text</subfield>
   <subfield code="z">1000000000043233</subfield>
  </datafield>
 
  thanks a lot for your reply,
 
  fabianope
fabiano petroneAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

wilcoxonCommented:
A few questions...
Why does it matter if the codes are in alphabetic order?  XML doesn't care.
Are you positive that the field could not also be <subfield code="a"/> which is also valid empty XML?  It won't matter to a parser but, if someone tried to do it with regex or something, it would.

Personally, I'd look at either XML::Simple (which will definitely not preserve order) or XML::SAX.  Both will write new files rather then editing the existing one in-place.
0
fabiano petroneAuthor Commented:
Hi again, wilcoxon
Thanks a lot also for the interest on this issue.
here are the answers:
1) the xml is also for my friends  librarians, and I'd like to give them a "ordered" code (for example, the various tags attribute corresponds to precise cataloging items, etc.).
2) you're right about this issue...the only thing I can say and that...I've never found <subfield code="a"/> in the file to process until now, but you're right...it can happens in the future so the point 2 should be re-formulated as follows:
*************************************************************************************************
2) find ALL and ONLY the records titles (it's the content of /collection/record/datafield tag='245'/subfield code='a') that have:
 
  a) value of "/collection/record/datafield tag='866']/subfield code='x'" equal to "Elsevier SD Freedom Collection:Full Text"
  b) the "/collection/record/datafield @tag='866'/subfield @code='a'" is totally absent,  OR -if present- is empty.

c) a <subfield code="a"/>


thanks a lot,
fabiano
0
wilcoxonCommented:
Are ind1 and ind2 part of the "key" for datafield or is tag unique on its own?

Maintaining ordering is a lot more difficult so I'm going to ignore that requirement since it does not actually affect the XML.  Are the codes already in alphabetic order?  If not, it's even harder.

This should be the simplest code to accomplish the rest of it.  If this doesn't exactly do what you want (such as changing elements to attributes or vice-versa), it will require playing around with options to XMLin and XMLout.
use strict;
use warnings;
use XML::Simple;
my $file = shift; # get file name from command line
my %opt = (KeepRoot => 1, ForceArray => [qw(datafield subfield record)],
           KeyAttr => { datafield => 'tag', subfield => 'code' },
          );
my $xml = XMLin($file, %opt) or die "could not parse $file: $!";
foreach my $rec (@{$xml->{collection}{record}}) {
    my $data = $rec->{datafield}{866};
    if (not $data->{ind1} and not $data->{ind2}) {
        $data->{subfield}{i} = 'DEFAULT';
    }
    next if ($data->{subfield}{a} or $data->{subfield}{x} ne 'Elsevier SD Freedom Collection:Full Text');
    print 'Title: ', $rec->{datafield}{245}{subfield}{a}, "\n";
}
print XMLout($xml, %opt);

Open in new window

0
Cloud Class® Course: Amazon Web Services - Basic

Are you thinking about creating an Amazon Web Services account for your business? Not sure where to start? In this course you’ll get an overview of the history of AWS and take a tour of their user interface.

fabiano petroneAuthor Commented:
Hi, there
just launched:
acnp2.pl e-collection.20140922145816.xml-marc >acnp2.txt

(acnp2.pl your script,  e-collection.20140922145816.xml-marc  the input file, acnp2.txt the output file)

I obtain the error:
Unrecognised option: ForceArray at C:\Perl64\eg\xml\acnp2.pl line 17.

at line 17 I've the following code:
print XMLout($xml, %opt);

surely my fault... where :=))

Thanks a lot,

Fabiano
0
wilcoxonCommented:
Oops.  Odd.  KeyAttr requires ForceArray for XMLin but it is not valid for XMLout.  Try this:
use strict;
use warnings;
use XML::Simple;
my $file = shift; # get file name from command line
my %opt = ( KeepRoot => 1,
            KeyAttr => { datafield => 'tag', subfield => 'code' },
          );
my %inopt = ( ForceArray => [qw(datafield subfield record)] );
my $xml = XMLin($file, %opt, %inopt) or die "could not parse $file: $!";
foreach my $rec (@{$xml->{collection}{record}}) {
    my $data = $rec->{datafield}{866};
    if (not $data->{ind1} and not $data->{ind2}) {
        $data->{subfield}{i} = 'DEFAULT';
    }
    next if ($data->{subfield}{a} or $data->{subfield}{x} ne 'Elsevier SD Freedom Collection:Full Text');
    print 'Title: ', $rec->{datafield}{245}{subfield}{a}, "\n";
}
print XMLout($xml, %opt);

Open in new window

0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
fabiano petroneAuthor Commented:
Hi there
tested & all OK!! :=))
Thanks a lot,
Fabiano
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Perl

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.