• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 308
  • Last Modified:

Unique records in a text file

Hello

I have a 30MB text file (a printer spool file) that has a lot of duplicate information. It basically has the form:

B/M NUMBER: *PAU 00001
00010 xxxxxxxxxxxx
00020 xxxxxxxxxxxx
00030 xxxxxxxxxxxx

B/M NUMBER: *PAU 00002
00010 xxxxxxxxxxxx
00020 xxxxxxxxxxxx
00030 xxxxxxxxxxxx

B/M NUMBER: *PAU 00001
00010 xxxxxxxxxxxx
00020 xxxxxxxxxxxx
00030 xxxxxxxxxxxx

. . .

for many different B/M numbers. Someone suggested Perl could help sort this out and gave me the following three lines:

$/ = "" ;
while (<> ) { $Bills{$_}++ };
foreach $Bill (sort keys %Bills) { print $Bill };

I haven't yet figured out how everything in the code works, but it does indeed sort the file very quickly and remove duplicates. However, I'm still not getting the unique ocurrences of the B/Ms themselves. In the case where a page break splits a B/M, there is another header inserted and I have:

B/M NUMBER: *PAU 00001
00010 xxxxxxxxxxxx

B/M NUMBER: *PAU 00001
00020 xxxxxxxxxxxx
00030 xxxxxxxxxxxx
00040 xxxxxxxxxxxx

These B/Ms need to be concatenated somehow and then the duplicates eliminated.

Any suggestions either in Perl or something else?

Patrick
0
PatrickLawrence
Asked:
PatrickLawrence
  • 6
  • 5
1 Solution
 
shlomoyCommented:
let me see if I get you:

suppose your input is like this:

               B/M NUMBER: *PAU 00001
               00010 xxxxxxxxxxxx

               B/M NUMBER: *PAU 00001
               00020 xxxxxxxxxxxx
               00030 xxxxxxxxxxxx
               00040 xxxxxxxxxxxx

               B/M NUMBER: *PAU 00003
               00010 xxxxxxxxxxxx
               00020 xxxxxxxxxxxx
               00030 xxxxxxxxxxxx
               00040 xxxxxxxxxxxx

               B/M NUMBER: *PAU 00001
               00020 xxxxxxxxxxxx
               00030 xxxxxxxxxxxx
               00040 xxxxxxxxxxxx

you want the output to be

               B/M NUMBER: *PAU 00001
               00010 xxxxxxxxxxxx
               00020 xxxxxxxxxxxx
               00030 xxxxxxxxxxxx
               00040 xxxxxxxxxxxx

               B/M NUMBER: *PAU 00003
               00010 xxxxxxxxxxxx
               00020 xxxxxxxxxxxx
               00030 xxxxxxxxxxxx
               00040 xxxxxxxxxxxx


is that right?
0
 
PatrickLawrenceAuthor Commented:
That is correct. Thanks for the interest shlomoy.
0
 
shlomoyCommented:
sounds faily easy.
Let me write a script to do that for you...
0
Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

 
PatrickLawrenceAuthor Commented:
I'm glad it sounds easy to somebody! I wrote something in VBA to use Word methods but it takes days to plow through all the data. Thanks for the help!
0
 
shlomoyCommented:
Sure.
It might take me some time to post it - because I'm doing something else right now. So don't go away.. :=)
0
 
shlomoyCommented:
     1 #!/usr/bin/perl -w
      2 use strict;
      3 my %blocks=();
      4 my $current_block;
      5 while (my $line=<>) {
      6         chomp($line);
      7         next if $line=~m/^\s*$/;
      8         $line=~s/^\s*//; ## loose whitespaces from beginning of line
      9         $line=~s/\s*$//; ## loose whitespaces from end of line
     10         if ($line=~m/^B\/M NUMBER:.*?(\d+)$/) {
     11                 $current_block = $line;
     12                 if (not exists $blocks{$current_block}) {
     13                         $blocks{$current_block} = ();
     14                 }
     15         } elsif ($line=~m/^\d{5}\s/) {
     16                 my $h = $blocks{$current_block};
     17                 $h->{$line}+=1;
     18                 $blocks{$current_block}=$h;
     19         } else {
     20                 next;
     21         }
     22 }
     23
     24 foreach my $b (sort keys %blocks) {
     25         print $b,"\n";
     26         foreach my $l (sort keys %{$blocks{$b}}) {
     27                 print $l,"\n";
     28         }
     29         print "\n";
     30 }                                                
0
 
shlomoyCommented:
if you save the above program (not including the line numbers, of course) in a filename 'ur.pl' and you chmod +x , and if your input text file is in a file called "input.txt" you can see how it works:

cat input.txt | ./ur.pl > output.txt

and then you can see the results in output.txt


Let me know if something is not understood, or not working to your satisfaction.
0
 
PatrickLawrenceAuthor Commented:
It works like a charm!
One last question:

How and where can I remove the 'B/M NUMBER: ' prefix?

I tried playing around with the code and royally messed things up.

Thanks again.
0
 
PatrickLawrenceAuthor Commented:
It works like a charm!
One last question:

How and where can I remove the 'B/M NUMBER: ' prefix?

I tried playing around with the code and royally messed things up.

Thanks again.
0
 
shlomoyCommented:
would you like to leave just the PAU and the rest of the line?
Or would you prefer having just the number?

here is the code - in case you want to just get rid of the "B/M NUMBER: " prefix:


      1 #!/usr/bin/perl -w
      2 use strict;
      3 my %blocks=();
      4 my $current_block;
      5 while (my $line=<>) {
      6         chomp($line);
      7         next if $line=~m/^\s*$/;
      8         $line=~s/^\s*//; ## loose whitespaces from beginning of line
      9         $line=~s/\s*$//; ## loose whitespaces from end of line
     10         if ($line=~m/^B\/M NUMBER:.*?(\d+)$/) {
     11                 $current_block = $line;
     12                 $current_block =~ s/B\/M NUMBER:\s*//;
     13                 if (not exists $blocks{$current_block}) {
     14                         $blocks{$current_block} = ();
     15                 }
     16         } elsif ($line=~m/^\d{5}\s/) {
     17                 my $h = $blocks{$current_block};
     18                 $h->{$line}+=1;
     19                 $blocks{$current_block}=$h;
     20         } else {
     21                 next;
     22         }
     23 }
     24
     25 foreach my $b (sort keys %blocks) {
     26         print $b,"\n";
     27         foreach my $l (sort keys %{$blocks{$b}}) {
     28                 print $l,"\n";
     29         }
     30         print "\n";
     31 }            
0
 
PatrickLawrenceAuthor Commented:
Thanks for all the help and teaching me a little more about Perl.
0

Featured Post

Upgrade your Question Security!

Add Premium security features to your question to ensure its privacy or anonymity. Learn more about your ability to control Question Security today.

  • 6
  • 5
Tackle projects and never again get stuck behind a technical roadblock.
Join Now