Solved

Unique records in a text file

Posted on 2001-07-17
11
304 Views
Last Modified: 2006-11-17
Hello

I have a 30MB text file (a printer spool file) that has a lot of duplicate information. It basically has the form:

B/M NUMBER: *PAU 00001
00010 xxxxxxxxxxxx
00020 xxxxxxxxxxxx
00030 xxxxxxxxxxxx

B/M NUMBER: *PAU 00002
00010 xxxxxxxxxxxx
00020 xxxxxxxxxxxx
00030 xxxxxxxxxxxx

B/M NUMBER: *PAU 00001
00010 xxxxxxxxxxxx
00020 xxxxxxxxxxxx
00030 xxxxxxxxxxxx

. . .

for many different B/M numbers. Someone suggested Perl could help sort this out and gave me the following three lines:

$/ = "" ;
while (<> ) { $Bills{$_}++ };
foreach $Bill (sort keys %Bills) { print $Bill };

I haven't yet figured out how everything in the code works, but it does indeed sort the file very quickly and remove duplicates. However, I'm still not getting the unique ocurrences of the B/Ms themselves. In the case where a page break splits a B/M, there is another header inserted and I have:

B/M NUMBER: *PAU 00001
00010 xxxxxxxxxxxx

B/M NUMBER: *PAU 00001
00020 xxxxxxxxxxxx
00030 xxxxxxxxxxxx
00040 xxxxxxxxxxxx

These B/Ms need to be concatenated somehow and then the duplicates eliminated.

Any suggestions either in Perl or something else?

Patrick
0
Comment
Question by:PatrickLawrence
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 6
  • 5
11 Comments
 
LVL 8

Expert Comment

by:shlomoy
ID: 6289784
let me see if I get you:

suppose your input is like this:

               B/M NUMBER: *PAU 00001
               00010 xxxxxxxxxxxx

               B/M NUMBER: *PAU 00001
               00020 xxxxxxxxxxxx
               00030 xxxxxxxxxxxx
               00040 xxxxxxxxxxxx

               B/M NUMBER: *PAU 00003
               00010 xxxxxxxxxxxx
               00020 xxxxxxxxxxxx
               00030 xxxxxxxxxxxx
               00040 xxxxxxxxxxxx

               B/M NUMBER: *PAU 00001
               00020 xxxxxxxxxxxx
               00030 xxxxxxxxxxxx
               00040 xxxxxxxxxxxx

you want the output to be

               B/M NUMBER: *PAU 00001
               00010 xxxxxxxxxxxx
               00020 xxxxxxxxxxxx
               00030 xxxxxxxxxxxx
               00040 xxxxxxxxxxxx

               B/M NUMBER: *PAU 00003
               00010 xxxxxxxxxxxx
               00020 xxxxxxxxxxxx
               00030 xxxxxxxxxxxx
               00040 xxxxxxxxxxxx


is that right?
0
 

Author Comment

by:PatrickLawrence
ID: 6289830
That is correct. Thanks for the interest shlomoy.
0
 
LVL 8

Expert Comment

by:shlomoy
ID: 6289958
sounds faily easy.
Let me write a script to do that for you...
0
Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 

Author Comment

by:PatrickLawrence
ID: 6290050
I'm glad it sounds easy to somebody! I wrote something in VBA to use Word methods but it takes days to plow through all the data. Thanks for the help!
0
 
LVL 8

Expert Comment

by:shlomoy
ID: 6290070
Sure.
It might take me some time to post it - because I'm doing something else right now. So don't go away.. :=)
0
 
LVL 8

Expert Comment

by:shlomoy
ID: 6290349
     1 #!/usr/bin/perl -w
      2 use strict;
      3 my %blocks=();
      4 my $current_block;
      5 while (my $line=<>) {
      6         chomp($line);
      7         next if $line=~m/^\s*$/;
      8         $line=~s/^\s*//; ## loose whitespaces from beginning of line
      9         $line=~s/\s*$//; ## loose whitespaces from end of line
     10         if ($line=~m/^B\/M NUMBER:.*?(\d+)$/) {
     11                 $current_block = $line;
     12                 if (not exists $blocks{$current_block}) {
     13                         $blocks{$current_block} = ();
     14                 }
     15         } elsif ($line=~m/^\d{5}\s/) {
     16                 my $h = $blocks{$current_block};
     17                 $h->{$line}+=1;
     18                 $blocks{$current_block}=$h;
     19         } else {
     20                 next;
     21         }
     22 }
     23
     24 foreach my $b (sort keys %blocks) {
     25         print $b,"\n";
     26         foreach my $l (sort keys %{$blocks{$b}}) {
     27                 print $l,"\n";
     28         }
     29         print "\n";
     30 }                                                
0
 
LVL 8

Expert Comment

by:shlomoy
ID: 6290369
if you save the above program (not including the line numbers, of course) in a filename 'ur.pl' and you chmod +x , and if your input text file is in a file called "input.txt" you can see how it works:

cat input.txt | ./ur.pl > output.txt

and then you can see the results in output.txt


Let me know if something is not understood, or not working to your satisfaction.
0
 

Author Comment

by:PatrickLawrence
ID: 6291316
It works like a charm!
One last question:

How and where can I remove the 'B/M NUMBER: ' prefix?

I tried playing around with the code and royally messed things up.

Thanks again.
0
 

Author Comment

by:PatrickLawrence
ID: 6291396
It works like a charm!
One last question:

How and where can I remove the 'B/M NUMBER: ' prefix?

I tried playing around with the code and royally messed things up.

Thanks again.
0
 
LVL 8

Accepted Solution

by:
shlomoy earned 75 total points
ID: 6292793
would you like to leave just the PAU and the rest of the line?
Or would you prefer having just the number?

here is the code - in case you want to just get rid of the "B/M NUMBER: " prefix:


      1 #!/usr/bin/perl -w
      2 use strict;
      3 my %blocks=();
      4 my $current_block;
      5 while (my $line=<>) {
      6         chomp($line);
      7         next if $line=~m/^\s*$/;
      8         $line=~s/^\s*//; ## loose whitespaces from beginning of line
      9         $line=~s/\s*$//; ## loose whitespaces from end of line
     10         if ($line=~m/^B\/M NUMBER:.*?(\d+)$/) {
     11                 $current_block = $line;
     12                 $current_block =~ s/B\/M NUMBER:\s*//;
     13                 if (not exists $blocks{$current_block}) {
     14                         $blocks{$current_block} = ();
     15                 }
     16         } elsif ($line=~m/^\d{5}\s/) {
     17                 my $h = $blocks{$current_block};
     18                 $h->{$line}+=1;
     19                 $blocks{$current_block}=$h;
     20         } else {
     21                 next;
     22         }
     23 }
     24
     25 foreach my $b (sort keys %blocks) {
     26         print $b,"\n";
     27         foreach my $l (sort keys %{$blocks{$b}}) {
     28                 print $l,"\n";
     29         }
     30         print "\n";
     31 }            
0
 

Author Comment

by:PatrickLawrence
ID: 6293619
Thanks for all the help and teaching me a little more about Perl.
0

Featured Post

On Demand Webinar: Networking for the Cloud Era

Did you know SD-WANs can improve network connectivity? Check out this webinar to learn how an SD-WAN simplified, one-click tool can help you migrate and manage data in the cloud.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

On Microsoft Windows, if  when you click or type the name of a .pl file, you get an error "is not recognized as an internal or external command, operable program or batch file", then this means you do not have the .pl file extension associated with …
In the distant past (last year) I hacked together a little toy that would allow a couple of Manager types to query, preview, and extract data from a number of MongoDB instances, to their tool of choice: Excel (http://dilbert.com/strips/comic/2007-08…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
Six Sigma Control Plans

691 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question