Solved

Unique records in a text file

Posted on 2001-07-17
11
298 Views
Last Modified: 2006-11-17
Hello

I have a 30MB text file (a printer spool file) that has a lot of duplicate information. It basically has the form:

B/M NUMBER: *PAU 00001
00010 xxxxxxxxxxxx
00020 xxxxxxxxxxxx
00030 xxxxxxxxxxxx

B/M NUMBER: *PAU 00002
00010 xxxxxxxxxxxx
00020 xxxxxxxxxxxx
00030 xxxxxxxxxxxx

B/M NUMBER: *PAU 00001
00010 xxxxxxxxxxxx
00020 xxxxxxxxxxxx
00030 xxxxxxxxxxxx

. . .

for many different B/M numbers. Someone suggested Perl could help sort this out and gave me the following three lines:

$/ = "" ;
while (<> ) { $Bills{$_}++ };
foreach $Bill (sort keys %Bills) { print $Bill };

I haven't yet figured out how everything in the code works, but it does indeed sort the file very quickly and remove duplicates. However, I'm still not getting the unique ocurrences of the B/Ms themselves. In the case where a page break splits a B/M, there is another header inserted and I have:

B/M NUMBER: *PAU 00001
00010 xxxxxxxxxxxx

B/M NUMBER: *PAU 00001
00020 xxxxxxxxxxxx
00030 xxxxxxxxxxxx
00040 xxxxxxxxxxxx

These B/Ms need to be concatenated somehow and then the duplicates eliminated.

Any suggestions either in Perl or something else?

Patrick
0
Comment
Question by:PatrickLawrence
  • 6
  • 5
11 Comments
 
LVL 8

Expert Comment

by:shlomoy
ID: 6289784
let me see if I get you:

suppose your input is like this:

               B/M NUMBER: *PAU 00001
               00010 xxxxxxxxxxxx

               B/M NUMBER: *PAU 00001
               00020 xxxxxxxxxxxx
               00030 xxxxxxxxxxxx
               00040 xxxxxxxxxxxx

               B/M NUMBER: *PAU 00003
               00010 xxxxxxxxxxxx
               00020 xxxxxxxxxxxx
               00030 xxxxxxxxxxxx
               00040 xxxxxxxxxxxx

               B/M NUMBER: *PAU 00001
               00020 xxxxxxxxxxxx
               00030 xxxxxxxxxxxx
               00040 xxxxxxxxxxxx

you want the output to be

               B/M NUMBER: *PAU 00001
               00010 xxxxxxxxxxxx
               00020 xxxxxxxxxxxx
               00030 xxxxxxxxxxxx
               00040 xxxxxxxxxxxx

               B/M NUMBER: *PAU 00003
               00010 xxxxxxxxxxxx
               00020 xxxxxxxxxxxx
               00030 xxxxxxxxxxxx
               00040 xxxxxxxxxxxx


is that right?
0
 

Author Comment

by:PatrickLawrence
ID: 6289830
That is correct. Thanks for the interest shlomoy.
0
 
LVL 8

Expert Comment

by:shlomoy
ID: 6289958
sounds faily easy.
Let me write a script to do that for you...
0
 

Author Comment

by:PatrickLawrence
ID: 6290050
I'm glad it sounds easy to somebody! I wrote something in VBA to use Word methods but it takes days to plow through all the data. Thanks for the help!
0
 
LVL 8

Expert Comment

by:shlomoy
ID: 6290070
Sure.
It might take me some time to post it - because I'm doing something else right now. So don't go away.. :=)
0
What Should I Do With This Threat Intelligence?

Are you wondering if you actually need threat intelligence? The answer is yes. We explain the basics for creating useful threat intelligence.

 
LVL 8

Expert Comment

by:shlomoy
ID: 6290349
     1 #!/usr/bin/perl -w
      2 use strict;
      3 my %blocks=();
      4 my $current_block;
      5 while (my $line=<>) {
      6         chomp($line);
      7         next if $line=~m/^\s*$/;
      8         $line=~s/^\s*//; ## loose whitespaces from beginning of line
      9         $line=~s/\s*$//; ## loose whitespaces from end of line
     10         if ($line=~m/^B\/M NUMBER:.*?(\d+)$/) {
     11                 $current_block = $line;
     12                 if (not exists $blocks{$current_block}) {
     13                         $blocks{$current_block} = ();
     14                 }
     15         } elsif ($line=~m/^\d{5}\s/) {
     16                 my $h = $blocks{$current_block};
     17                 $h->{$line}+=1;
     18                 $blocks{$current_block}=$h;
     19         } else {
     20                 next;
     21         }
     22 }
     23
     24 foreach my $b (sort keys %blocks) {
     25         print $b,"\n";
     26         foreach my $l (sort keys %{$blocks{$b}}) {
     27                 print $l,"\n";
     28         }
     29         print "\n";
     30 }                                                
0
 
LVL 8

Expert Comment

by:shlomoy
ID: 6290369
if you save the above program (not including the line numbers, of course) in a filename 'ur.pl' and you chmod +x , and if your input text file is in a file called "input.txt" you can see how it works:

cat input.txt | ./ur.pl > output.txt

and then you can see the results in output.txt


Let me know if something is not understood, or not working to your satisfaction.
0
 

Author Comment

by:PatrickLawrence
ID: 6291316
It works like a charm!
One last question:

How and where can I remove the 'B/M NUMBER: ' prefix?

I tried playing around with the code and royally messed things up.

Thanks again.
0
 

Author Comment

by:PatrickLawrence
ID: 6291396
It works like a charm!
One last question:

How and where can I remove the 'B/M NUMBER: ' prefix?

I tried playing around with the code and royally messed things up.

Thanks again.
0
 
LVL 8

Accepted Solution

by:
shlomoy earned 75 total points
ID: 6292793
would you like to leave just the PAU and the rest of the line?
Or would you prefer having just the number?

here is the code - in case you want to just get rid of the "B/M NUMBER: " prefix:


      1 #!/usr/bin/perl -w
      2 use strict;
      3 my %blocks=();
      4 my $current_block;
      5 while (my $line=<>) {
      6         chomp($line);
      7         next if $line=~m/^\s*$/;
      8         $line=~s/^\s*//; ## loose whitespaces from beginning of line
      9         $line=~s/\s*$//; ## loose whitespaces from end of line
     10         if ($line=~m/^B\/M NUMBER:.*?(\d+)$/) {
     11                 $current_block = $line;
     12                 $current_block =~ s/B\/M NUMBER:\s*//;
     13                 if (not exists $blocks{$current_block}) {
     14                         $blocks{$current_block} = ();
     15                 }
     16         } elsif ($line=~m/^\d{5}\s/) {
     17                 my $h = $blocks{$current_block};
     18                 $h->{$line}+=1;
     19                 $blocks{$current_block}=$h;
     20         } else {
     21                 next;
     22         }
     23 }
     24
     25 foreach my $b (sort keys %blocks) {
     26         print $b,"\n";
     27         foreach my $l (sort keys %{$blocks{$b}}) {
     28                 print $l,"\n";
     29         }
     30         print "\n";
     31 }            
0
 

Author Comment

by:PatrickLawrence
ID: 6293619
Thanks for all the help and teaching me a little more about Perl.
0

Featured Post

How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

Join & Write a Comment

Many time we need to work with multiple files all together. If its windows system then we can use some GUI based editor to accomplish our task. But what if you are on putty or have only CLI(Command Line Interface) as an option to  edit your files. I…
There are many situations when we need to display the data in sorted order. For example: Student details by name or by rank or by total marks etc. If you are working on data driven based projects then you will use sorting techniques very frequently.…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
This video discusses moving either the default database or any database to a new volume.

758 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

16 Experts available now in Live!

Get 1:1 Help Now