hadrons
asked on
Deleting n number of instances of a pattern after x number of them
I have a file that contains data in this form (this is a shorten version)
<product>
<isbn>0000000001</isbn>
<contributor>
<role>A01</role>
<author>Lee, Stan</author>
</contributor>
<contributor>
<role>A01</role>
<author>Steranko, Jim</author>
</contributor>
<contributor>
<role>A01</role>
<author>Adams, Neal</author>
</contributor>
<contributor>
<role>A01</role>
<author>Smith, Barry</author>
</contributor>
</product>
The problem is that sometimes the number of author composites can be as high as 50 - if not more - and I only need to the first 5. Is there a regular expression (since I'm using Perl) to keep a set number of <contributor> ... </contributor> (say 5) and delete the rest?
<product>
<isbn>0000000001</isbn>
<contributor>
<role>A01</role>
<author>Lee, Stan</author>
</contributor>
<contributor>
<role>A01</role>
<author>Steranko, Jim</author>
</contributor>
<contributor>
<role>A01</role>
<author>Adams, Neal</author>
</contributor>
<contributor>
<role>A01</role>
<author>Smith, Barry</author>
</contributor>
</product>
The problem is that sometimes the number of author composites can be as high as 50 - if not more - and I only need to the first 5. Is there a regular expression (since I'm using Perl) to keep a set number of <contributor> ... </contributor> (say 5) and delete the rest?
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Hi, the regular expression work great when I used it the form Ozo provided (I did include the modification provided by bigdogman,) but I want to integrate this expression in a large script I have (see below) and the substitutions weren't made.
#!/usr/bin/perl
use strict;
use Encode qw(encode decode);
use File::Copy;
## start process 5
system ("echo processing part 5 of 10: Editing of parsed files \.\.\.");
my @files = glob("Edit19");
foreach my $file(@files) {
system ("echo currently processing file: $file");
open FILE, '<:encoding(UTF-8)', $file or warn "Can't open $file: $!";
open PARSED, '>:encoding(UTF-8)', ($file . "edited.txt") or warn "Cannot open file for write: $!";
while (<FILE>) {
s{(<product>((?!</product> ).)*?(<con tributor>. *?</contri butor>\s*) {1,5})((?! </product> ).)*}{$1}s g;
print PARSED;
}
}
close FILE;
close PARSED;
#!/usr/bin/perl
use strict;
use Encode qw(encode decode);
use File::Copy;
## start process 5
system ("echo processing part 5 of 10: Editing of parsed files \.\.\.");
my @files = glob("Edit19");
foreach my $file(@files) {
system ("echo currently processing file: $file");
open FILE, '<:encoding(UTF-8)', $file or warn "Can't open $file: $!";
open PARSED, '>:encoding(UTF-8)', ($file . "edited.txt") or warn "Cannot open file for write: $!";
while (<FILE>) {
s{(<product>((?!</product>
print PARSED;
}
}
close FILE;
close PARSED;
My apologies, I gave you the wrong group #. Try this one:
s{<product>(?!<\/product>).*?((<contributor>.*?<\/contributor>\s*){1,5})(?!<\/product>).*}{$1}sig;
ASKER
I worked with some files and the pattern match did the substitutes perfectly; there were some problems with the newlines, but that could be due to the engine I'm using
$/="";
while (<FILE>) {
while (<FILE>) {
ASKER
Excellent
ASKER
It wasn't provided in the example I gave, but the regular expression is deleted additional data after the first 5 contributors to the </product>, so for example:
<product>
<isbn>0000000001</isbn>
<contributor>
<role>A01</role>
<author>Lee, Stan</author>
</contributor>
<contributor>
<role>A01</role>
<author>Steranko, Jim</author>
</contributor>
<contributor>
<role>A01</role>
<author>Adams, Neal</author>
</contributor>
<contributor>
<role>A01</role>
<author>Smith, Barry</author>
</contributor>
<additional>
<additional data>1</additional data>
</additional>
</product>
Would leave just this:
<product>
<isbn>0000000001</isbn>
<contributor>
<role>A01</role>
<author>Lee, Stan</author>
</contributor>
<contributor>
<role>A01</role>
<author>Steranko, Jim</author>
</contributor>
<contributor>
<role>A01</role>
<author>Adams, Neal</author>
</contributor>
<contributor>
<role>A01</role>
<author>Smith, Barry</author>
</contributor>
</product>
With whatever was under it deleted also; is there way to correct this?
<product>
<isbn>0000000001</isbn>
<contributor>
<role>A01</role>
<author>Lee, Stan</author>
</contributor>
<contributor>
<role>A01</role>
<author>Steranko, Jim</author>
</contributor>
<contributor>
<role>A01</role>
<author>Adams, Neal</author>
</contributor>
<contributor>
<role>A01</role>
<author>Smith, Barry</author>
</contributor>
<additional>
<additional data>1</additional data>
</additional>
</product>
Would leave just this:
<product>
<isbn>0000000001</isbn>
<contributor>
<role>A01</role>
<author>Lee, Stan</author>
</contributor>
<contributor>
<role>A01</role>
<author>Steranko, Jim</author>
</contributor>
<contributor>
<role>A01</role>
<author>Adams, Neal</author>
</contributor>
<contributor>
<role>A01</role>
<author>Smith, Barry</author>
</contributor>
</product>
With whatever was under it deleted also; is there way to correct this?
Open in new window