Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium

x
?
Solved

XML parsing with Perl

Posted on 2014-10-16
6
Medium Priority
?
233 Views
Last Modified: 2014-10-21
hello, I begin to study XML parsing and I'm involved in 2 little problems.
my xml (a collection of bibliographic records)  has the following structure:

<?xml version="1.0" encoding="UTF-8"?>

<collection xmlns="http://www.loc.gov/MARC21/slim">
 <!-- FIRST INCREMENTAL -->
 <!-- INSTANCE:sfxudn -->
 <record>
  <leader>-----nas-a2200000z--4500</leader>
  <controlfield tag="008">140922uuuuuuuuuxx-uu-|------u|----|eng-d</controlfield>
  <datafield tag="010" ind1="" ind2="">
   <subfield code="a">01015589</subfield>
  </datafield>
  <datafield tag="245" ind1="" ind2="0">
   <subfield code="a">Publishers weekly</subfield>
  </datafield>
  <datafield tag="260" ind1="" ind2="">
   <subfield code="a">New York, NY</subfield>
   <subfield code="b">Reed Business Information</subfield>
  </datafield>
  <datafield tag="022" ind1="" ind2="">
   <subfield code="a">0000-0019</subfield>
  </datafield>
  <datafield tag="776" ind1="" ind2="">
   <subfield code="x">2150-4008</subfield>
  </datafield>
  <datafield tag="090" ind1="" ind2="">
   <subfield code="a">954921332001</subfield>
  </datafield>
  <datafield tag="866" ind1="" ind2="">
   <subfield code="a">Available from 1997. </subfield>
   <subfield code="s">1000000000001224</subfield>
   <subfield code="t">1000000000000630</subfield>
   <subfield code="x">EBSCOhost Business Source Complete:Full Text</subfield>
   <subfield code="z">1000000000125212</subfield>
  </datafield>
 </record>
 
 ....more records...
  </collection>
 
  and I'd like to make 2 manipulations:
 
  1) adding a single constant line (with constant content) in a precise position  in <datafield tag="866" ind1="" ind2="">".
  in one word, I.E.,  the
   <datafield tag="866" ind1="" ind2="">
   <subfield code="a">Available from 1997. </subfield>
   <subfield code="s">1000000000001224</subfield>
   <subfield code="t">1000000000000630</subfield>
   <subfield code="x">EBSCOhost Business Source Complete:Full Text</subfield>
   <subfield code="z">1000000000125212</subfield>
  </datafield>
 
  should be transformed to:
 
  <datafield tag="866" ind1="" ind2="">
   <subfield code="a">Available from 1997. </subfield>
   ****add the following line with "code" attribute in alphabetical order, after "a" and before "s"****
    <subfield code="i">DEFAULT</subfield>
   <subfield code="s">1000000000001224</subfield>
   <subfield code="t">1000000000000630</subfield>
   <subfield code="x">EBSCOhost Business Source Complete:Full Text</subfield>
   <subfield code="z">10000000value 00125212</subfield>
  </datafield>
 
  2) find ALL and ONLY the records titles (it's the content of /collection/record/datafield tag='245'/subfield code='a') that have:
 
  a) value of "/collection/record/datafield tag='866']/subfield code='x'" equal to "Elsevier SD Freedom Collection:Full Text"
  b) the "/collection/record/datafield @tag='866'/subfield @code='a'" is totally absent,  OR -if present- is empty.I.E.:
 
  <datafield tag="866" ind1="" ind2="">
   <subfield code="s">1000000000000992</subfield>
   <subfield code="t">1000000000000473</subfield>
   <subfield code="x">Elsevier SD Freedom Collection:Full Text</subfield>
   <subfield code="z">1000000000043233</subfield>
  </datafield>
 
  OR
 
   <datafield tag="866" ind1="" ind2="">
   <subfield code="a"></subfield>
   <subfield code="s">1000000000000992</subfield>
   <subfield code="t">1000000000000473</subfield>
   <subfield code="x">Elsevier SD Freedom Collection:Full Text</subfield>
   <subfield code="z">1000000000043233</subfield>
  </datafield>
 
  thanks a lot for your reply,
 
  fabianope
0
Comment
Question by:fabiano petrone
  • 3
  • 3
6 Comments
 
LVL 27

Expert Comment

by:wilcoxon
ID: 40384251
A few questions...
Why does it matter if the codes are in alphabetic order?  XML doesn't care.
Are you positive that the field could not also be <subfield code="a"/> which is also valid empty XML?  It won't matter to a parser but, if someone tried to do it with regex or something, it would.

Personally, I'd look at either XML::Simple (which will definitely not preserve order) or XML::SAX.  Both will write new files rather then editing the existing one in-place.
0
 

Author Comment

by:fabiano petrone
ID: 40384847
Hi again, wilcoxon
Thanks a lot also for the interest on this issue.
here are the answers:
1) the xml is also for my friends  librarians, and I'd like to give them a "ordered" code (for example, the various tags attribute corresponds to precise cataloging items, etc.).
2) you're right about this issue...the only thing I can say and that...I've never found <subfield code="a"/> in the file to process until now, but you're right...it can happens in the future so the point 2 should be re-formulated as follows:
*************************************************************************************************
2) find ALL and ONLY the records titles (it's the content of /collection/record/datafield tag='245'/subfield code='a') that have:
 
  a) value of "/collection/record/datafield tag='866']/subfield code='x'" equal to "Elsevier SD Freedom Collection:Full Text"
  b) the "/collection/record/datafield @tag='866'/subfield @code='a'" is totally absent,  OR -if present- is empty.

c) a <subfield code="a"/>


thanks a lot,
fabiano
0
 
LVL 27

Expert Comment

by:wilcoxon
ID: 40387523
Are ind1 and ind2 part of the "key" for datafield or is tag unique on its own?

Maintaining ordering is a lot more difficult so I'm going to ignore that requirement since it does not actually affect the XML.  Are the codes already in alphabetic order?  If not, it's even harder.

This should be the simplest code to accomplish the rest of it.  If this doesn't exactly do what you want (such as changing elements to attributes or vice-versa), it will require playing around with options to XMLin and XMLout.
use strict;
use warnings;
use XML::Simple;
my $file = shift; # get file name from command line
my %opt = (KeepRoot => 1, ForceArray => [qw(datafield subfield record)],
           KeyAttr => { datafield => 'tag', subfield => 'code' },
          );
my $xml = XMLin($file, %opt) or die "could not parse $file: $!";
foreach my $rec (@{$xml->{collection}{record}}) {
    my $data = $rec->{datafield}{866};
    if (not $data->{ind1} and not $data->{ind2}) {
        $data->{subfield}{i} = 'DEFAULT';
    }
    next if ($data->{subfield}{a} or $data->{subfield}{x} ne 'Elsevier SD Freedom Collection:Full Text');
    print 'Title: ', $rec->{datafield}{245}{subfield}{a}, "\n";
}
print XMLout($xml, %opt);

Open in new window

0
VIDEO: THE CONCERTO CLOUD FOR HEALTHCARE

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

 

Author Comment

by:fabiano petrone
ID: 40391731
Hi, there
just launched:
acnp2.pl e-collection.20140922145816.xml-marc >acnp2.txt

(acnp2.pl your script,  e-collection.20140922145816.xml-marc  the input file, acnp2.txt the output file)

I obtain the error:
Unrecognised option: ForceArray at C:\Perl64\eg\xml\acnp2.pl line 17.

at line 17 I've the following code:
print XMLout($xml, %opt);

surely my fault... where :=))

Thanks a lot,

Fabiano
0
 
LVL 27

Accepted Solution

by:
wilcoxon earned 2000 total points
ID: 40392177
Oops.  Odd.  KeyAttr requires ForceArray for XMLin but it is not valid for XMLout.  Try this:
use strict;
use warnings;
use XML::Simple;
my $file = shift; # get file name from command line
my %opt = ( KeepRoot => 1,
            KeyAttr => { datafield => 'tag', subfield => 'code' },
          );
my %inopt = ( ForceArray => [qw(datafield subfield record)] );
my $xml = XMLin($file, %opt, %inopt) or die "could not parse $file: $!";
foreach my $rec (@{$xml->{collection}{record}}) {
    my $data = $rec->{datafield}{866};
    if (not $data->{ind1} and not $data->{ind2}) {
        $data->{subfield}{i} = 'DEFAULT';
    }
    next if ($data->{subfield}{a} or $data->{subfield}{x} ne 'Elsevier SD Freedom Collection:Full Text');
    print 'Title: ', $rec->{datafield}{245}{subfield}{a}, "\n";
}
print XMLout($xml, %opt);

Open in new window

0
 

Author Closing Comment

by:fabiano petrone
ID: 40394212
Hi there
tested & all OK!! :=))
Thanks a lot,
Fabiano
0

Featured Post

Free Tool: Path Explorer

An intuitive utility to help find the CSS path to UI elements on a webpage. These paths are used frequently in a variety of front-end development and QA automation tasks.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Introduction In my previous article (http://www.experts-exchange.com/Microsoft/Development/MS-SQL-Server/SSIS/A_9150-Loading-XML-Using-SSIS.html) I showed you how the XML Source component can be used to load XML files into a SQL Server database, us…
Browsing the questions asked to the Experts of this forum, you will be amazed to see how many times people are headaching about monster regular expressions (regex) to select that specific part of some HTML or XML file they want to extract. The examp…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
Six Sigma Control Plans
Suggested Courses

580 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question