Link to home
Start Free TrialLog in
Avatar of PatrickLawrence
PatrickLawrence

asked on

Unique records in a text file

Hello

I have a 30MB text file (a printer spool file) that has a lot of duplicate information. It basically has the form:

B/M NUMBER: *PAU 00001
00010 xxxxxxxxxxxx
00020 xxxxxxxxxxxx
00030 xxxxxxxxxxxx

B/M NUMBER: *PAU 00002
00010 xxxxxxxxxxxx
00020 xxxxxxxxxxxx
00030 xxxxxxxxxxxx

B/M NUMBER: *PAU 00001
00010 xxxxxxxxxxxx
00020 xxxxxxxxxxxx
00030 xxxxxxxxxxxx

. . .

for many different B/M numbers. Someone suggested Perl could help sort this out and gave me the following three lines:

$/ = "" ;
while (<> ) { $Bills{$_}++ };
foreach $Bill (sort keys %Bills) { print $Bill };

I haven't yet figured out how everything in the code works, but it does indeed sort the file very quickly and remove duplicates. However, I'm still not getting the unique ocurrences of the B/Ms themselves. In the case where a page break splits a B/M, there is another header inserted and I have:

B/M NUMBER: *PAU 00001
00010 xxxxxxxxxxxx

B/M NUMBER: *PAU 00001
00020 xxxxxxxxxxxx
00030 xxxxxxxxxxxx
00040 xxxxxxxxxxxx

These B/Ms need to be concatenated somehow and then the duplicates eliminated.

Any suggestions either in Perl or something else?

Patrick
Avatar of shlomoy
shlomoy

let me see if I get you:

suppose your input is like this:

               B/M NUMBER: *PAU 00001
               00010 xxxxxxxxxxxx

               B/M NUMBER: *PAU 00001
               00020 xxxxxxxxxxxx
               00030 xxxxxxxxxxxx
               00040 xxxxxxxxxxxx

               B/M NUMBER: *PAU 00003
               00010 xxxxxxxxxxxx
               00020 xxxxxxxxxxxx
               00030 xxxxxxxxxxxx
               00040 xxxxxxxxxxxx

               B/M NUMBER: *PAU 00001
               00020 xxxxxxxxxxxx
               00030 xxxxxxxxxxxx
               00040 xxxxxxxxxxxx

you want the output to be

               B/M NUMBER: *PAU 00001
               00010 xxxxxxxxxxxx
               00020 xxxxxxxxxxxx
               00030 xxxxxxxxxxxx
               00040 xxxxxxxxxxxx

               B/M NUMBER: *PAU 00003
               00010 xxxxxxxxxxxx
               00020 xxxxxxxxxxxx
               00030 xxxxxxxxxxxx
               00040 xxxxxxxxxxxx


is that right?
Avatar of PatrickLawrence

ASKER

That is correct. Thanks for the interest shlomoy.
sounds faily easy.
Let me write a script to do that for you...
I'm glad it sounds easy to somebody! I wrote something in VBA to use Word methods but it takes days to plow through all the data. Thanks for the help!
Sure.
It might take me some time to post it - because I'm doing something else right now. So don't go away.. :=)
     1 #!/usr/bin/perl -w
      2 use strict;
      3 my %blocks=();
      4 my $current_block;
      5 while (my $line=<>) {
      6         chomp($line);
      7         next if $line=~m/^\s*$/;
      8         $line=~s/^\s*//; ## loose whitespaces from beginning of line
      9         $line=~s/\s*$//; ## loose whitespaces from end of line
     10         if ($line=~m/^B\/M NUMBER:.*?(\d+)$/) {
     11                 $current_block = $line;
     12                 if (not exists $blocks{$current_block}) {
     13                         $blocks{$current_block} = ();
     14                 }
     15         } elsif ($line=~m/^\d{5}\s/) {
     16                 my $h = $blocks{$current_block};
     17                 $h->{$line}+=1;
     18                 $blocks{$current_block}=$h;
     19         } else {
     20                 next;
     21         }
     22 }
     23
     24 foreach my $b (sort keys %blocks) {
     25         print $b,"\n";
     26         foreach my $l (sort keys %{$blocks{$b}}) {
     27                 print $l,"\n";
     28         }
     29         print "\n";
     30 }                                                
if you save the above program (not including the line numbers, of course) in a filename 'ur.pl' and you chmod +x , and if your input text file is in a file called "input.txt" you can see how it works:

cat input.txt | ./ur.pl > output.txt

and then you can see the results in output.txt


Let me know if something is not understood, or not working to your satisfaction.
It works like a charm!
One last question:

How and where can I remove the 'B/M NUMBER: ' prefix?

I tried playing around with the code and royally messed things up.

Thanks again.
It works like a charm!
One last question:

How and where can I remove the 'B/M NUMBER: ' prefix?

I tried playing around with the code and royally messed things up.

Thanks again.
ASKER CERTIFIED SOLUTION
Avatar of shlomoy
shlomoy

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Thanks for all the help and teaching me a little more about Perl.