?
Solved

XML parsing with Perl

Posted on 2014-10-16
6
Medium Priority
?
197 Views
Last Modified: 2014-10-21
hello, I begin to study XML parsing and I'm involved in 2 little problems.
my xml (a collection of bibliographic records)  has the following structure:

<?xml version="1.0" encoding="UTF-8"?>

<collection xmlns="http://www.loc.gov/MARC21/slim">
 <!-- FIRST INCREMENTAL -->
 <!-- INSTANCE:sfxudn -->
 <record>
  <leader>-----nas-a2200000z--4500</leader>
  <controlfield tag="008">140922uuuuuuuuuxx-uu-|------u|----|eng-d</controlfield>
  <datafield tag="010" ind1="" ind2="">
   <subfield code="a">01015589</subfield>
  </datafield>
  <datafield tag="245" ind1="" ind2="0">
   <subfield code="a">Publishers weekly</subfield>
  </datafield>
  <datafield tag="260" ind1="" ind2="">
   <subfield code="a">New York, NY</subfield>
   <subfield code="b">Reed Business Information</subfield>
  </datafield>
  <datafield tag="022" ind1="" ind2="">
   <subfield code="a">0000-0019</subfield>
  </datafield>
  <datafield tag="776" ind1="" ind2="">
   <subfield code="x">2150-4008</subfield>
  </datafield>
  <datafield tag="090" ind1="" ind2="">
   <subfield code="a">954921332001</subfield>
  </datafield>
  <datafield tag="866" ind1="" ind2="">
   <subfield code="a">Available from 1997. </subfield>
   <subfield code="s">1000000000001224</subfield>
   <subfield code="t">1000000000000630</subfield>
   <subfield code="x">EBSCOhost Business Source Complete:Full Text</subfield>
   <subfield code="z">1000000000125212</subfield>
  </datafield>
 </record>
 
 ....more records...
  </collection>
 
  and I'd like to make 2 manipulations:
 
  1) adding a single constant line (with constant content) in a precise position  in <datafield tag="866" ind1="" ind2="">".
  in one word, I.E.,  the
   <datafield tag="866" ind1="" ind2="">
   <subfield code="a">Available from 1997. </subfield>
   <subfield code="s">1000000000001224</subfield>
   <subfield code="t">1000000000000630</subfield>
   <subfield code="x">EBSCOhost Business Source Complete:Full Text</subfield>
   <subfield code="z">1000000000125212</subfield>
  </datafield>
 
  should be transformed to:
 
  <datafield tag="866" ind1="" ind2="">
   <subfield code="a">Available from 1997. </subfield>
   ****add the following line with "code" attribute in alphabetical order, after "a" and before "s"****
    <subfield code="i">DEFAULT</subfield>
   <subfield code="s">1000000000001224</subfield>
   <subfield code="t">1000000000000630</subfield>
   <subfield code="x">EBSCOhost Business Source Complete:Full Text</subfield>
   <subfield code="z">10000000value 00125212</subfield>
  </datafield>
 
  2) find ALL and ONLY the records titles (it's the content of /collection/record/datafield tag='245'/subfield code='a') that have:
 
  a) value of "/collection/record/datafield tag='866']/subfield code='x'" equal to "Elsevier SD Freedom Collection:Full Text"
  b) the "/collection/record/datafield @tag='866'/subfield @code='a'" is totally absent,  OR -if present- is empty.I.E.:
 
  <datafield tag="866" ind1="" ind2="">
   <subfield code="s">1000000000000992</subfield>
   <subfield code="t">1000000000000473</subfield>
   <subfield code="x">Elsevier SD Freedom Collection:Full Text</subfield>
   <subfield code="z">1000000000043233</subfield>
  </datafield>
 
  OR
 
   <datafield tag="866" ind1="" ind2="">
   <subfield code="a"></subfield>
   <subfield code="s">1000000000000992</subfield>
   <subfield code="t">1000000000000473</subfield>
   <subfield code="x">Elsevier SD Freedom Collection:Full Text</subfield>
   <subfield code="z">1000000000043233</subfield>
  </datafield>
 
  thanks a lot for your reply,
 
  fabianope
0
Comment
Question by:fabianope65
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 3
  • 3
6 Comments
 
LVL 26

Expert Comment

by:wilcoxon
ID: 40384251
A few questions...
Why does it matter if the codes are in alphabetic order?  XML doesn't care.
Are you positive that the field could not also be <subfield code="a"/> which is also valid empty XML?  It won't matter to a parser but, if someone tried to do it with regex or something, it would.

Personally, I'd look at either XML::Simple (which will definitely not preserve order) or XML::SAX.  Both will write new files rather then editing the existing one in-place.
0
 

Author Comment

by:fabianope65
ID: 40384847
Hi again, wilcoxon
Thanks a lot also for the interest on this issue.
here are the answers:
1) the xml is also for my friends  librarians, and I'd like to give them a "ordered" code (for example, the various tags attribute corresponds to precise cataloging items, etc.).
2) you're right about this issue...the only thing I can say and that...I've never found <subfield code="a"/> in the file to process until now, but you're right...it can happens in the future so the point 2 should be re-formulated as follows:
*************************************************************************************************
2) find ALL and ONLY the records titles (it's the content of /collection/record/datafield tag='245'/subfield code='a') that have:
 
  a) value of "/collection/record/datafield tag='866']/subfield code='x'" equal to "Elsevier SD Freedom Collection:Full Text"
  b) the "/collection/record/datafield @tag='866'/subfield @code='a'" is totally absent,  OR -if present- is empty.

c) a <subfield code="a"/>


thanks a lot,
fabiano
0
 
LVL 26

Expert Comment

by:wilcoxon
ID: 40387523
Are ind1 and ind2 part of the "key" for datafield or is tag unique on its own?

Maintaining ordering is a lot more difficult so I'm going to ignore that requirement since it does not actually affect the XML.  Are the codes already in alphabetic order?  If not, it's even harder.

This should be the simplest code to accomplish the rest of it.  If this doesn't exactly do what you want (such as changing elements to attributes or vice-versa), it will require playing around with options to XMLin and XMLout.
use strict;
use warnings;
use XML::Simple;
my $file = shift; # get file name from command line
my %opt = (KeepRoot => 1, ForceArray => [qw(datafield subfield record)],
           KeyAttr => { datafield => 'tag', subfield => 'code' },
          );
my $xml = XMLin($file, %opt) or die "could not parse $file: $!";
foreach my $rec (@{$xml->{collection}{record}}) {
    my $data = $rec->{datafield}{866};
    if (not $data->{ind1} and not $data->{ind2}) {
        $data->{subfield}{i} = 'DEFAULT';
    }
    next if ($data->{subfield}{a} or $data->{subfield}{x} ne 'Elsevier SD Freedom Collection:Full Text');
    print 'Title: ', $rec->{datafield}{245}{subfield}{a}, "\n";
}
print XMLout($xml, %opt);

Open in new window

0
VIDEO: THE CONCERTO CLOUD FOR HEALTHCARE

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

 

Author Comment

by:fabianope65
ID: 40391731
Hi, there
just launched:
acnp2.pl e-collection.20140922145816.xml-marc >acnp2.txt

(acnp2.pl your script,  e-collection.20140922145816.xml-marc  the input file, acnp2.txt the output file)

I obtain the error:
Unrecognised option: ForceArray at C:\Perl64\eg\xml\acnp2.pl line 17.

at line 17 I've the following code:
print XMLout($xml, %opt);

surely my fault... where :=))

Thanks a lot,

Fabiano
0
 
LVL 26

Accepted Solution

by:
wilcoxon earned 2000 total points
ID: 40392177
Oops.  Odd.  KeyAttr requires ForceArray for XMLin but it is not valid for XMLout.  Try this:
use strict;
use warnings;
use XML::Simple;
my $file = shift; # get file name from command line
my %opt = ( KeepRoot => 1,
            KeyAttr => { datafield => 'tag', subfield => 'code' },
          );
my %inopt = ( ForceArray => [qw(datafield subfield record)] );
my $xml = XMLin($file, %opt, %inopt) or die "could not parse $file: $!";
foreach my $rec (@{$xml->{collection}{record}}) {
    my $data = $rec->{datafield}{866};
    if (not $data->{ind1} and not $data->{ind2}) {
        $data->{subfield}{i} = 'DEFAULT';
    }
    next if ($data->{subfield}{a} or $data->{subfield}{x} ne 'Elsevier SD Freedom Collection:Full Text');
    print 'Title: ', $rec->{datafield}{245}{subfield}{a}, "\n";
}
print XMLout($xml, %opt);

Open in new window

0
 

Author Closing Comment

by:fabianope65
ID: 40394212
Hi there
tested & all OK!! :=))
Thanks a lot,
Fabiano
0

Featured Post

Free Tool: SSL Checker

Scans your site and returns information about your SSL implementation and certificate. Helpful for debugging and validating your SSL configuration.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

On Microsoft Windows, if  when you click or type the name of a .pl file, you get an error "is not recognized as an internal or external command, operable program or batch file", then this means you do not have the .pl file extension associated with …
I was working on a PowerPoint add-in the other day and a client asked me "can you implement a feature which processes a chart when it's pasted into a slide from another deck?". It got me wondering how to hook into built-in ribbon events in Office.
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
Six Sigma Control Plans

770 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question