Solved

Perl code to remove duplicate lines from text file

Posted on 1998-10-10
3
1,063 Views
Last Modified: 2008-03-17
Ok, I need a quick Perl script that can take a sorted text file and remove duplicate lines and my brain is too fried to write this myself so I'm hoping someone can save me some time.

I have a text file (mailing list) which consists of two fields per line: <email> (space as delimiter) (person's name in parenthesis) - I need a routine to read the text file and remove duplicates in the first field.

I'll also award points if someone can give you a unix solution to this.

This needs to handle a 12,000+ line text file and be case insensitive (in determining dupes).
0
Comment
Question by:wisdom042597
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
3 Comments
 
LVL 1

Author Comment

by:wisdom042597
ID: 1205214
As a plus, it would be nice to reject any e-mail field which doesn't contain the '@' symbol.

0
 
LVL 84

Expert Comment

by:ozo
ID: 1205215
perldoc perlfaq4
  How can I extract just the unique elements of an array?

            There are several possible ways, depending on whether
            the array is ordered and whether you wish to preserve
            the ordering.

    a) If @in is sorted, and you want @out to be sorted:
    (this assumes all true values in the array)
                    $prev = 'nonesuch';
                    @out = grep($_ ne $prev && ($prev = $_), @in);

                This is nice in that it doesn't use much extra
                memory, simulating uniq(1)'s behavior of removing
                only adjacent duplicates. It's less nice in that it
                won't work with false values like undef, 0, or "";
                "0 but true" is ok, though.

    b) If you don't know whether @in is sorted:
                    undef %saw;
                    @out = grep(!$saw{$_}++, @in);

    c) Like (b), but @in contains only small integers:
                    @out = grep(!$saw[$_]++, @in);

    d) A way to do (b) without any loops or greps:
                    undef %saw;
                    @saw{@in} = ();
                    @out = sort keys %saw;  # remove sort if undesired

    e) Like (d), but @in contains only small positive integers:
                    undef @ary;
                    @ary[@in] = @in;
                    @out = @ary;

  How can I get the unique keys from two hashes?

            First you extract the keys from the hashes into arrays,
            and then solve the uniquifying the array problem
            described above. For example:

                %seen = ();
                for $element (keys(%foo), keys(%bar)) {
                    $seen{$element}++;
                }
                @uniq = keys %seen;

            Or more succinctly:

                @uniq = keys %{{%foo,%bar}};

            Or if you really want to save space:

                %seen = ();
                while (defined ($key = each %foo)) {
                    $seen{$key}++;
                }
                while (defined ($key = each %bar)) {
                    $seen{$key}++;
                }
                @uniq = keys %seen;

0
 

Accepted Solution

by:
pcrutch earned 200 total points
ID: 1205216
Here's the unix solution..

 sort -fu -t' ' -k 0 -o <textfile> <textfile> | \
   grep '^\w+\@\w+ '

As for the perl, here is a solution for either unix or Win32:
----
# Read in each line from STDIN or filename on cmd line
while(<>) {
  chomp;
  # Push it onto the text array
  push(@text);
}

# Loop through the sorted text array
foreach $line (sort @text) {
  # Split the line into fields
  ($email,$name) = split(' ',$line);

  # email has embedded '@'?
  next if (! $email =~ /\@/);

  # @nodup will contain no-duplicate emails
  if (! @nodup =~ /$email/i) {
    print '$email $name\n';
    push(@nodup,$email);
  }
}
----
Note this will print the first email/name pair of duplicate fields.

0

Featured Post

Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
How to strip .csv from file name 9 85
Port 80 requests 16 111
perl script to check whether folder contains any files 5 110
Vb script to unzip a files and rename the files 12 133
Email validation in proper way is  very important validation required in any web pages. This code is self explainable except that Regular Expression which I used for pattern matching. I originally published as a thread on my website : http://www…
I have been pestered over the years to produce and distribute regular data extracts, and often the request have explicitly requested the data be emailed as an Excel attachement; specifically Excel, as it appears: CSV files confuse (no Red or Green h…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…

738 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question