PatrickLawrence
asked on
Unique records in a text file
Hello
I have a 30MB text file (a printer spool file) that has a lot of duplicate information. It basically has the form:
B/M NUMBER: *PAU 00001
00010 xxxxxxxxxxxx
00020 xxxxxxxxxxxx
00030 xxxxxxxxxxxx
B/M NUMBER: *PAU 00002
00010 xxxxxxxxxxxx
00020 xxxxxxxxxxxx
00030 xxxxxxxxxxxx
B/M NUMBER: *PAU 00001
00010 xxxxxxxxxxxx
00020 xxxxxxxxxxxx
00030 xxxxxxxxxxxx
. . .
for many different B/M numbers. Someone suggested Perl could help sort this out and gave me the following three lines:
$/ = "" ;
while (<> ) { $Bills{$_}++ };
foreach $Bill (sort keys %Bills) { print $Bill };
I haven't yet figured out how everything in the code works, but it does indeed sort the file very quickly and remove duplicates. However, I'm still not getting the unique ocurrences of the B/Ms themselves. In the case where a page break splits a B/M, there is another header inserted and I have:
B/M NUMBER: *PAU 00001
00010 xxxxxxxxxxxx
B/M NUMBER: *PAU 00001
00020 xxxxxxxxxxxx
00030 xxxxxxxxxxxx
00040 xxxxxxxxxxxx
These B/Ms need to be concatenated somehow and then the duplicates eliminated.
Any suggestions either in Perl or something else?
Patrick
I have a 30MB text file (a printer spool file) that has a lot of duplicate information. It basically has the form:
B/M NUMBER: *PAU 00001
00010 xxxxxxxxxxxx
00020 xxxxxxxxxxxx
00030 xxxxxxxxxxxx
B/M NUMBER: *PAU 00002
00010 xxxxxxxxxxxx
00020 xxxxxxxxxxxx
00030 xxxxxxxxxxxx
B/M NUMBER: *PAU 00001
00010 xxxxxxxxxxxx
00020 xxxxxxxxxxxx
00030 xxxxxxxxxxxx
. . .
for many different B/M numbers. Someone suggested Perl could help sort this out and gave me the following three lines:
$/ = "" ;
while (<> ) { $Bills{$_}++ };
foreach $Bill (sort keys %Bills) { print $Bill };
I haven't yet figured out how everything in the code works, but it does indeed sort the file very quickly and remove duplicates. However, I'm still not getting the unique ocurrences of the B/Ms themselves. In the case where a page break splits a B/M, there is another header inserted and I have:
B/M NUMBER: *PAU 00001
00010 xxxxxxxxxxxx
B/M NUMBER: *PAU 00001
00020 xxxxxxxxxxxx
00030 xxxxxxxxxxxx
00040 xxxxxxxxxxxx
These B/Ms need to be concatenated somehow and then the duplicates eliminated.
Any suggestions either in Perl or something else?
Patrick
ASKER
That is correct. Thanks for the interest shlomoy.
sounds faily easy.
Let me write a script to do that for you...
Let me write a script to do that for you...
ASKER
I'm glad it sounds easy to somebody! I wrote something in VBA to use Word methods but it takes days to plow through all the data. Thanks for the help!
Sure.
It might take me some time to post it - because I'm doing something else right now. So don't go away.. :=)
It might take me some time to post it - because I'm doing something else right now. So don't go away.. :=)
1 #!/usr/bin/perl -w
2 use strict;
3 my %blocks=();
4 my $current_block;
5 while (my $line=<>) {
6 chomp($line);
7 next if $line=~m/^\s*$/;
8 $line=~s/^\s*//; ## loose whitespaces from beginning of line
9 $line=~s/\s*$//; ## loose whitespaces from end of line
10 if ($line=~m/^B\/M NUMBER:.*?(\d+)$/) {
11 $current_block = $line;
12 if (not exists $blocks{$current_block}) {
13 $blocks{$current_block} = ();
14 }
15 } elsif ($line=~m/^\d{5}\s/) {
16 my $h = $blocks{$current_block};
17 $h->{$line}+=1;
18 $blocks{$current_block}=$h ;
19 } else {
20 next;
21 }
22 }
23
24 foreach my $b (sort keys %blocks) {
25 print $b,"\n";
26 foreach my $l (sort keys %{$blocks{$b}}) {
27 print $l,"\n";
28 }
29 print "\n";
30 }
2 use strict;
3 my %blocks=();
4 my $current_block;
5 while (my $line=<>) {
6 chomp($line);
7 next if $line=~m/^\s*$/;
8 $line=~s/^\s*//; ## loose whitespaces from beginning of line
9 $line=~s/\s*$//; ## loose whitespaces from end of line
10 if ($line=~m/^B\/M NUMBER:.*?(\d+)$/) {
11 $current_block = $line;
12 if (not exists $blocks{$current_block}) {
13 $blocks{$current_block} = ();
14 }
15 } elsif ($line=~m/^\d{5}\s/) {
16 my $h = $blocks{$current_block};
17 $h->{$line}+=1;
18 $blocks{$current_block}=$h
19 } else {
20 next;
21 }
22 }
23
24 foreach my $b (sort keys %blocks) {
25 print $b,"\n";
26 foreach my $l (sort keys %{$blocks{$b}}) {
27 print $l,"\n";
28 }
29 print "\n";
30 }
if you save the above program (not including the line numbers, of course) in a filename 'ur.pl' and you chmod +x , and if your input text file is in a file called "input.txt" you can see how it works:
cat input.txt | ./ur.pl > output.txt
and then you can see the results in output.txt
Let me know if something is not understood, or not working to your satisfaction.
cat input.txt | ./ur.pl > output.txt
and then you can see the results in output.txt
Let me know if something is not understood, or not working to your satisfaction.
ASKER
It works like a charm!
One last question:
How and where can I remove the 'B/M NUMBER: ' prefix?
I tried playing around with the code and royally messed things up.
Thanks again.
One last question:
How and where can I remove the 'B/M NUMBER: ' prefix?
I tried playing around with the code and royally messed things up.
Thanks again.
ASKER
It works like a charm!
One last question:
How and where can I remove the 'B/M NUMBER: ' prefix?
I tried playing around with the code and royally messed things up.
Thanks again.
One last question:
How and where can I remove the 'B/M NUMBER: ' prefix?
I tried playing around with the code and royally messed things up.
Thanks again.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Thanks for all the help and teaching me a little more about Perl.
suppose your input is like this:
B/M NUMBER: *PAU 00001
00010 xxxxxxxxxxxx
B/M NUMBER: *PAU 00001
00020 xxxxxxxxxxxx
00030 xxxxxxxxxxxx
00040 xxxxxxxxxxxx
B/M NUMBER: *PAU 00003
00010 xxxxxxxxxxxx
00020 xxxxxxxxxxxx
00030 xxxxxxxxxxxx
00040 xxxxxxxxxxxx
B/M NUMBER: *PAU 00001
00020 xxxxxxxxxxxx
00030 xxxxxxxxxxxx
00040 xxxxxxxxxxxx
you want the output to be
B/M NUMBER: *PAU 00001
00010 xxxxxxxxxxxx
00020 xxxxxxxxxxxx
00030 xxxxxxxxxxxx
00040 xxxxxxxxxxxx
B/M NUMBER: *PAU 00003
00010 xxxxxxxxxxxx
00020 xxxxxxxxxxxx
00030 xxxxxxxxxxxx
00040 xxxxxxxxxxxx
is that right?