Perl code to remove duplicate lines from text file
Ok, I need a quick Perl script that can take a sorted text file and remove duplicate lines and my brain is too fried to write this myself so I'm hoping someone can save me some time.
I have a text file (mailing list) which consists of two fields per line: <email> (space as delimiter) (person's name in parenthesis) - I need a routine to read the text file and remove duplicates in the first field.
I'll also award points if someone can give you a unix solution to this.
This needs to handle a 12,000+ line text file and be case insensitive (in determining dupes).
perldoc perlfaq4
How can I extract just the unique elements of an array?
There are several possible ways, depending on whether
the array is ordered and whether you wish to preserve
the ordering.
a) If @in is sorted, and you want @out to be sorted:
(this assumes all true values in the array)
$prev = 'nonesuch';
@out = grep($_ ne $prev && ($prev = $_), @in);
This is nice in that it doesn't use much extra
memory, simulating uniq(1)'s behavior of removing
only adjacent duplicates. It's less nice in that it
won't work with false values like undef, 0, or "";
"0 but true" is ok, though.
b) If you don't know whether @in is sorted:
undef %saw;
@out = grep(!$saw{$_}++, @in);
c) Like (b), but @in contains only small integers:
@out = grep(!$saw[$_]++, @in);
d) A way to do (b) without any loops or greps:
undef %saw;
@saw{@in} = ();
@out = sort keys %saw; # remove sort if undesired
e) Like (d), but @in contains only small positive integers:
undef @ary;
@ary[@in] = @in;
@out = @ary;
How can I get the unique keys from two hashes?
First you extract the keys from the hashes into arrays,
and then solve the uniquifying the array problem
described above. For example:
As for the perl, here is a solution for either unix or Win32:
----
# Read in each line from STDIN or filename on cmd line
while(<>) {
chomp;
# Push it onto the text array
push(@text);
}
# Loop through the sorted text array
foreach $line (sort @text) {
# Split the line into fields
($email,$name) = split(' ',$line);
# email has embedded '@'?
next if (! $email =~ /\@/);
# @nodup will contain no-duplicate emails
if (! @nodup =~ /$email/i) {
print '$email $name\n';
push(@nodup,$email);
}
}
----
Note this will print the first email/name pair of duplicate fields.
0
Question has a verified solution.
Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.