Solved

XML parsing with Perl

Posted on 2014-10-16
6
190 Views
Last Modified: 2014-10-21
hello, I begin to study XML parsing and I'm involved in 2 little problems.
my xml (a collection of bibliographic records)  has the following structure:

<?xml version="1.0" encoding="UTF-8"?>

<collection xmlns="http://www.loc.gov/MARC21/slim">
 <!-- FIRST INCREMENTAL -->
 <!-- INSTANCE:sfxudn -->
 <record>
  <leader>-----nas-a2200000z--4500</leader>
  <controlfield tag="008">140922uuuuuuuuuxx-uu-|------u|----|eng-d</controlfield>
  <datafield tag="010" ind1="" ind2="">
   <subfield code="a">01015589</subfield>
  </datafield>
  <datafield tag="245" ind1="" ind2="0">
   <subfield code="a">Publishers weekly</subfield>
  </datafield>
  <datafield tag="260" ind1="" ind2="">
   <subfield code="a">New York, NY</subfield>
   <subfield code="b">Reed Business Information</subfield>
  </datafield>
  <datafield tag="022" ind1="" ind2="">
   <subfield code="a">0000-0019</subfield>
  </datafield>
  <datafield tag="776" ind1="" ind2="">
   <subfield code="x">2150-4008</subfield>
  </datafield>
  <datafield tag="090" ind1="" ind2="">
   <subfield code="a">954921332001</subfield>
  </datafield>
  <datafield tag="866" ind1="" ind2="">
   <subfield code="a">Available from 1997. </subfield>
   <subfield code="s">1000000000001224</subfield>
   <subfield code="t">1000000000000630</subfield>
   <subfield code="x">EBSCOhost Business Source Complete:Full Text</subfield>
   <subfield code="z">1000000000125212</subfield>
  </datafield>
 </record>
 
 ....more records...
  </collection>
 
  and I'd like to make 2 manipulations:
 
  1) adding a single constant line (with constant content) in a precise position  in <datafield tag="866" ind1="" ind2="">".
  in one word, I.E.,  the
   <datafield tag="866" ind1="" ind2="">
   <subfield code="a">Available from 1997. </subfield>
   <subfield code="s">1000000000001224</subfield>
   <subfield code="t">1000000000000630</subfield>
   <subfield code="x">EBSCOhost Business Source Complete:Full Text</subfield>
   <subfield code="z">1000000000125212</subfield>
  </datafield>
 
  should be transformed to:
 
  <datafield tag="866" ind1="" ind2="">
   <subfield code="a">Available from 1997. </subfield>
   ****add the following line with "code" attribute in alphabetical order, after "a" and before "s"****
    <subfield code="i">DEFAULT</subfield>
   <subfield code="s">1000000000001224</subfield>
   <subfield code="t">1000000000000630</subfield>
   <subfield code="x">EBSCOhost Business Source Complete:Full Text</subfield>
   <subfield code="z">10000000value 00125212</subfield>
  </datafield>
 
  2) find ALL and ONLY the records titles (it's the content of /collection/record/datafield tag='245'/subfield code='a') that have:
 
  a) value of "/collection/record/datafield tag='866']/subfield code='x'" equal to "Elsevier SD Freedom Collection:Full Text"
  b) the "/collection/record/datafield @tag='866'/subfield @code='a'" is totally absent,  OR -if present- is empty.I.E.:
 
  <datafield tag="866" ind1="" ind2="">
   <subfield code="s">1000000000000992</subfield>
   <subfield code="t">1000000000000473</subfield>
   <subfield code="x">Elsevier SD Freedom Collection:Full Text</subfield>
   <subfield code="z">1000000000043233</subfield>
  </datafield>
 
  OR
 
   <datafield tag="866" ind1="" ind2="">
   <subfield code="a"></subfield>
   <subfield code="s">1000000000000992</subfield>
   <subfield code="t">1000000000000473</subfield>
   <subfield code="x">Elsevier SD Freedom Collection:Full Text</subfield>
   <subfield code="z">1000000000043233</subfield>
  </datafield>
 
  thanks a lot for your reply,
 
  fabianope
0
Comment
Question by:fabianope65
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 3
  • 3
6 Comments
 
LVL 26

Expert Comment

by:wilcoxon
ID: 40384251
A few questions...
Why does it matter if the codes are in alphabetic order?  XML doesn't care.
Are you positive that the field could not also be <subfield code="a"/> which is also valid empty XML?  It won't matter to a parser but, if someone tried to do it with regex or something, it would.

Personally, I'd look at either XML::Simple (which will definitely not preserve order) or XML::SAX.  Both will write new files rather then editing the existing one in-place.
0
 

Author Comment

by:fabianope65
ID: 40384847
Hi again, wilcoxon
Thanks a lot also for the interest on this issue.
here are the answers:
1) the xml is also for my friends  librarians, and I'd like to give them a "ordered" code (for example, the various tags attribute corresponds to precise cataloging items, etc.).
2) you're right about this issue...the only thing I can say and that...I've never found <subfield code="a"/> in the file to process until now, but you're right...it can happens in the future so the point 2 should be re-formulated as follows:
*************************************************************************************************
2) find ALL and ONLY the records titles (it's the content of /collection/record/datafield tag='245'/subfield code='a') that have:
 
  a) value of "/collection/record/datafield tag='866']/subfield code='x'" equal to "Elsevier SD Freedom Collection:Full Text"
  b) the "/collection/record/datafield @tag='866'/subfield @code='a'" is totally absent,  OR -if present- is empty.

c) a <subfield code="a"/>


thanks a lot,
fabiano
0
 
LVL 26

Expert Comment

by:wilcoxon
ID: 40387523
Are ind1 and ind2 part of the "key" for datafield or is tag unique on its own?

Maintaining ordering is a lot more difficult so I'm going to ignore that requirement since it does not actually affect the XML.  Are the codes already in alphabetic order?  If not, it's even harder.

This should be the simplest code to accomplish the rest of it.  If this doesn't exactly do what you want (such as changing elements to attributes or vice-versa), it will require playing around with options to XMLin and XMLout.
use strict;
use warnings;
use XML::Simple;
my $file = shift; # get file name from command line
my %opt = (KeepRoot => 1, ForceArray => [qw(datafield subfield record)],
           KeyAttr => { datafield => 'tag', subfield => 'code' },
          );
my $xml = XMLin($file, %opt) or die "could not parse $file: $!";
foreach my $rec (@{$xml->{collection}{record}}) {
    my $data = $rec->{datafield}{866};
    if (not $data->{ind1} and not $data->{ind2}) {
        $data->{subfield}{i} = 'DEFAULT';
    }
    next if ($data->{subfield}{a} or $data->{subfield}{x} ne 'Elsevier SD Freedom Collection:Full Text');
    print 'Title: ', $rec->{datafield}{245}{subfield}{a}, "\n";
}
print XMLout($xml, %opt);

Open in new window

0
Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 

Author Comment

by:fabianope65
ID: 40391731
Hi, there
just launched:
acnp2.pl e-collection.20140922145816.xml-marc >acnp2.txt

(acnp2.pl your script,  e-collection.20140922145816.xml-marc  the input file, acnp2.txt the output file)

I obtain the error:
Unrecognised option: ForceArray at C:\Perl64\eg\xml\acnp2.pl line 17.

at line 17 I've the following code:
print XMLout($xml, %opt);

surely my fault... where :=))

Thanks a lot,

Fabiano
0
 
LVL 26

Accepted Solution

by:
wilcoxon earned 500 total points
ID: 40392177
Oops.  Odd.  KeyAttr requires ForceArray for XMLin but it is not valid for XMLout.  Try this:
use strict;
use warnings;
use XML::Simple;
my $file = shift; # get file name from command line
my %opt = ( KeepRoot => 1,
            KeyAttr => { datafield => 'tag', subfield => 'code' },
          );
my %inopt = ( ForceArray => [qw(datafield subfield record)] );
my $xml = XMLin($file, %opt, %inopt) or die "could not parse $file: $!";
foreach my $rec (@{$xml->{collection}{record}}) {
    my $data = $rec->{datafield}{866};
    if (not $data->{ind1} and not $data->{ind2}) {
        $data->{subfield}{i} = 'DEFAULT';
    }
    next if ($data->{subfield}{a} or $data->{subfield}{x} ne 'Elsevier SD Freedom Collection:Full Text');
    print 'Title: ', $rec->{datafield}{245}{subfield}{a}, "\n";
}
print XMLout($xml, %opt);

Open in new window

0
 

Author Closing Comment

by:fabianope65
ID: 40394212
Hi there
tested & all OK!! :=))
Thanks a lot,
Fabiano
0

Featured Post

Increase Agility with Enabled Toolchains

Connect your existing build, deployment, management, monitoring, and collaboration platforms. From Puppet to Chef, HipChat to Slack, ServiceNow to JIRA, Splunk to New Relic and beyond, hand off data between systems to engage the right people.

Connect with xMatters.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

The Confluence of Individual Knowledge and the Collective Intelligence At this writing (summer 2013) the term API (http://dictionary.reference.com/browse/API?s=t) has made its way into the popular lexicon of the English language.  A few years ago, …
I was working on a PowerPoint add-in the other day and a client asked me "can you implement a feature which processes a chart when it's pasted into a slide from another deck?". It got me wondering how to hook into built-in ribbon events in Office.
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
Six Sigma Control Plans

717 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question