Unique records in a text file

Hello

I have a 30MB text file (a printer spool file) that has a lot of duplicate information. It basically has the form:

B/M NUMBER: *PAU 00001
00010 xxxxxxxxxxxx
00020 xxxxxxxxxxxx
00030 xxxxxxxxxxxx

B/M NUMBER: *PAU 00002
00010 xxxxxxxxxxxx
00020 xxxxxxxxxxxx
00030 xxxxxxxxxxxx

B/M NUMBER: *PAU 00001
00010 xxxxxxxxxxxx
00020 xxxxxxxxxxxx
00030 xxxxxxxxxxxx

. . .

for many different B/M numbers. Someone suggested Perl could help sort this out and gave me the following three lines:

$/ = "" ;
while (<> ) { $Bills{$_}++ };
foreach $Bill (sort keys %Bills) { print $Bill };

I haven't yet figured out how everything in the code works, but it does indeed sort the file very quickly and remove duplicates. However, I'm still not getting the unique ocurrences of the B/Ms themselves. In the case where a page break splits a B/M, there is another header inserted and I have:

B/M NUMBER: *PAU 00001
00010 xxxxxxxxxxxx

B/M NUMBER: *PAU 00001
00020 xxxxxxxxxxxx
00030 xxxxxxxxxxxx
00040 xxxxxxxxxxxx

These B/Ms need to be concatenated somehow and then the duplicates eliminated.

Any suggestions either in Perl or something else?

Patrick
PatrickLawrenceAsked:
Who is Participating?
 
shlomoyConnect With a Mentor Commented:
would you like to leave just the PAU and the rest of the line?
Or would you prefer having just the number?

here is the code - in case you want to just get rid of the "B/M NUMBER: " prefix:


      1 #!/usr/bin/perl -w
      2 use strict;
      3 my %blocks=();
      4 my $current_block;
      5 while (my $line=<>) {
      6         chomp($line);
      7         next if $line=~m/^\s*$/;
      8         $line=~s/^\s*//; ## loose whitespaces from beginning of line
      9         $line=~s/\s*$//; ## loose whitespaces from end of line
     10         if ($line=~m/^B\/M NUMBER:.*?(\d+)$/) {
     11                 $current_block = $line;
     12                 $current_block =~ s/B\/M NUMBER:\s*//;
     13                 if (not exists $blocks{$current_block}) {
     14                         $blocks{$current_block} = ();
     15                 }
     16         } elsif ($line=~m/^\d{5}\s/) {
     17                 my $h = $blocks{$current_block};
     18                 $h->{$line}+=1;
     19                 $blocks{$current_block}=$h;
     20         } else {
     21                 next;
     22         }
     23 }
     24
     25 foreach my $b (sort keys %blocks) {
     26         print $b,"\n";
     27         foreach my $l (sort keys %{$blocks{$b}}) {
     28                 print $l,"\n";
     29         }
     30         print "\n";
     31 }            
0
 
shlomoyCommented:
let me see if I get you:

suppose your input is like this:

               B/M NUMBER: *PAU 00001
               00010 xxxxxxxxxxxx

               B/M NUMBER: *PAU 00001
               00020 xxxxxxxxxxxx
               00030 xxxxxxxxxxxx
               00040 xxxxxxxxxxxx

               B/M NUMBER: *PAU 00003
               00010 xxxxxxxxxxxx
               00020 xxxxxxxxxxxx
               00030 xxxxxxxxxxxx
               00040 xxxxxxxxxxxx

               B/M NUMBER: *PAU 00001
               00020 xxxxxxxxxxxx
               00030 xxxxxxxxxxxx
               00040 xxxxxxxxxxxx

you want the output to be

               B/M NUMBER: *PAU 00001
               00010 xxxxxxxxxxxx
               00020 xxxxxxxxxxxx
               00030 xxxxxxxxxxxx
               00040 xxxxxxxxxxxx

               B/M NUMBER: *PAU 00003
               00010 xxxxxxxxxxxx
               00020 xxxxxxxxxxxx
               00030 xxxxxxxxxxxx
               00040 xxxxxxxxxxxx


is that right?
0
 
PatrickLawrenceAuthor Commented:
That is correct. Thanks for the interest shlomoy.
0
Free Tool: Site Down Detector

Helpful to verify reports of your own downtime, or to double check a downed website you are trying to access.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

 
shlomoyCommented:
sounds faily easy.
Let me write a script to do that for you...
0
 
PatrickLawrenceAuthor Commented:
I'm glad it sounds easy to somebody! I wrote something in VBA to use Word methods but it takes days to plow through all the data. Thanks for the help!
0
 
shlomoyCommented:
Sure.
It might take me some time to post it - because I'm doing something else right now. So don't go away.. :=)
0
 
shlomoyCommented:
     1 #!/usr/bin/perl -w
      2 use strict;
      3 my %blocks=();
      4 my $current_block;
      5 while (my $line=<>) {
      6         chomp($line);
      7         next if $line=~m/^\s*$/;
      8         $line=~s/^\s*//; ## loose whitespaces from beginning of line
      9         $line=~s/\s*$//; ## loose whitespaces from end of line
     10         if ($line=~m/^B\/M NUMBER:.*?(\d+)$/) {
     11                 $current_block = $line;
     12                 if (not exists $blocks{$current_block}) {
     13                         $blocks{$current_block} = ();
     14                 }
     15         } elsif ($line=~m/^\d{5}\s/) {
     16                 my $h = $blocks{$current_block};
     17                 $h->{$line}+=1;
     18                 $blocks{$current_block}=$h;
     19         } else {
     20                 next;
     21         }
     22 }
     23
     24 foreach my $b (sort keys %blocks) {
     25         print $b,"\n";
     26         foreach my $l (sort keys %{$blocks{$b}}) {
     27                 print $l,"\n";
     28         }
     29         print "\n";
     30 }                                                
0
 
shlomoyCommented:
if you save the above program (not including the line numbers, of course) in a filename 'ur.pl' and you chmod +x , and if your input text file is in a file called "input.txt" you can see how it works:

cat input.txt | ./ur.pl > output.txt

and then you can see the results in output.txt


Let me know if something is not understood, or not working to your satisfaction.
0
 
PatrickLawrenceAuthor Commented:
It works like a charm!
One last question:

How and where can I remove the 'B/M NUMBER: ' prefix?

I tried playing around with the code and royally messed things up.

Thanks again.
0
 
PatrickLawrenceAuthor Commented:
It works like a charm!
One last question:

How and where can I remove the 'B/M NUMBER: ' prefix?

I tried playing around with the code and royally messed things up.

Thanks again.
0
 
PatrickLawrenceAuthor Commented:
Thanks for all the help and teaching me a little more about Perl.
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.