asked on

Deleting n number of instances of a pattern after x number of them

I have a file that contains data in this form (this is a shorten version)

<product>
<isbn>0000000001</isbn>
<contributor>
<role>A01</role>
<author>Lee, Stan</author>
</contributor>
<contributor>
<role>A01</role>
<author>Steranko, Jim</author>
</contributor>
<contributor>
<role>A01</role>
<author>Adams, Neal</author>
</contributor>
<contributor>
<role>A01</role>
<author>Smith, Barry</author>
</contributor>
</product>

The problem is that sometimes the number of author composites can be as high as 50 - if not more - and I only need to the first 5. Is there a regular expression (since I'm using Perl) to keep a set number of <contributor> ... </contributor> (say 5) and delete the rest?

ASKER CERTIFIED SOLUTION

ozo

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

Derek Jensen

Might I suggest a small modification to the regex above?

s{<product>.*?((<contributor>(.+?)</contributor>){1,5}).*?(?<!</product>)}{$1}sig;

Open in new window

hadrons

ASKER

Hi, the regular expression work great when I used it the form Ozo provided (I did include the modification provided by bigdogman,) but I want to integrate this expression in a large script I have (see below) and the substitutions weren't made.

#!/usr/bin/perl

use strict;
use Encode qw(encode decode);
use File::Copy;

## start process 5
system ("echo processing part 5 of 10: Editing of parsed files \.\.\.");
my @files = glob("Edit19");
foreach my $file(@files) {
system ("echo currently processing file: $file");
                  open FILE, '<:encoding(UTF-8)', $file or warn "Can't open $file: $!";
                  open PARSED, '>:encoding(UTF-8)', ($file . "edited.txt") or warn "Cannot open file for write: $!";

while (<FILE>) {

s{(<product>((?!</product>).)*?(<contributor>.*?</contributor>\s*){1,5})((?!</product>).)*}{$1}sg;

print PARSED;
}
}
close FILE;
close PARSED;

Derek Jensen

My apologies, I gave you the wrong group #. Try this one:

s{<product>(?!<\/product>).*?((<contributor>.*?<\/contributor>\s*){1,5})(?!<\/product>).*}{$1}sig;

Open in new window

hadrons

ASKER

I worked with some files and the pattern match did the substitutes perfectly; there were some problems with the newlines, but that could be due to the engine I'm using

ozo

$/="";
while (<FILE>) {

hadrons

ASKER

Excellent

hadrons

ASKER

It wasn't provided in the example I gave, but the regular expression is deleted additional data after the first 5 contributors to the </product>, so for example:

<product>
<isbn>0000000001</isbn>
<contributor>
<role>A01</role>
<author>Lee, Stan</author>
</contributor>
<contributor>
<role>A01</role>
<author>Steranko, Jim</author>
</contributor>
<contributor>
<role>A01</role>
<author>Adams, Neal</author>
</contributor>
<contributor>
<role>A01</role>
<author>Smith, Barry</author>
</contributor>
<additional>
<additional data>1</additional data>
</additional>
</product>

Would leave just this:

<product>
<isbn>0000000001</isbn>
<contributor>
<role>A01</role>
<author>Lee, Stan</author>
</contributor>
<contributor>
<role>A01</role>
<author>Steranko, Jim</author>
</contributor>
<contributor>
<role>A01</role>
<author>Adams, Neal</author>
</contributor>
<contributor>
<role>A01</role>
<author>Smith, Barry</author>
</contributor>
</product>

With whatever was under it deleted also; is there way to correct this?