Solved

Perl - check duplicates in file and case

Posted on 2013-01-24
9
530 Views
Last Modified: 2013-01-25
I'm using this I grabbed from perlmonks that's does what I need, which checks a file and prints out any duplicate lines.

my %duplicates;

while (<>) {
    chomp;
    $duplicates{$_}++;
}

foreach my $key (keys %duplicates) {
    if ($duplicates{$key} > 1) {
        delete $duplicates{$key};
        print "$key\n";
    }
}

Open in new window

Just one issue, I need to match lines in the file that is the same but may have different case, I can do a lower case on the file then run it but I need to keep the case.

How can I do a lower case to do the checks with but still keep the same case for my output?

Thanks
0
Comment
Question by:bt707
  • 3
  • 3
  • 2
  • +1
9 Comments
 
LVL 31

Expert Comment

by:farzanj
ID: 38817059
First you need to make a map.

while (<>) {
    chomp;
    $duplicates{lc($_)} = $_;
}
Then if it already exists in the map, you should print it otherwise not
0
 

Author Comment

by:bt707
ID: 38817071
I had already tried using a lc but I just get errors from that, putting it in like that I just get an error of:

Useless use of lc in void context at ./dup_lines.pl line 10.

#! /usr/bin/perl

use strict;
use warnings;

my %duplicates;

while (<>) {
    chomp;
    lc;
    $duplicates{$_}++;
}

foreach my $key (keys %duplicates) {
    if ($duplicates{$key} > 1) {
        delete $duplicates{$key};
        print "$key\n";
    }
}

What am I missing?
0
 
LVL 26

Accepted Solution

by:
wilcoxon earned 500 total points
ID: 38817078
I would do something like this...

This will print out the lowercased key followed by the the lines that matched it (indented by tabs).

my %duplicates;

while (<>) {
    chomp;
    my $key = lc $_;
    $duplicates{$key} = [] unless $duplicates{$key};
    push @{$duplicates{$key}}, $_;
}

foreach my $key (keys %duplicates) {
    if (@{$duplicates{$key}} > 1) {
        print "$key:\n\t", join("\n\t", @{$duplicates{$key}}), "\n";
        delete $duplicates{$key};
    }
}

Open in new window

0
 
LVL 26

Expert Comment

by:wilcoxon
ID: 38817084
Bare lc will not work.  You need to add:

$_ = lc;  # which should act the same as $_ = lc $_

This method will also not preserve case (one of the things you asked for).
0
Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

 

Author Closing Comment

by:bt707
ID: 38817097
Thanks that worked fine, not sure why I did not get the other to work, maybe me but this one worked fine.

Thanks,
0
 
LVL 84

Expert Comment

by:ozo
ID: 38817103
hile (<>) {
    chomp;
    push @{$duplicates{lc}},$_;
}

foreach my $key (keys %duplicates) {
    if (@{$duplicates{$key}} > 1) {
        delete $duplicates{$key}->[0];
        print "$duplicates{$key}->[0]\n";
    }
}
0
 
LVL 26

Expert Comment

by:wilcoxon
ID: 38817165
Interesting.  I would have sworn "push @{$duplicates{$key}}" failed on an undefined $duplicates{$key} but I just re-tested and it works fine.

As such, you can omit the "... = [] unless $duplicates{$key}" line (and as ozo said, you can then combine the "$key = lc $_" and push lines).

As usual, ozo has provided a good concise answer (though I would not do his delete/print part like he did but that's just preference).
0
 
LVL 84

Expert Comment

by:ozo
ID: 38817912
Sorry, I didn't see http:#a38817078 when I posted.
I was trying to duplicate the behaviour of the routine in the original question,
which seemed to be deleting only one of each duplicate name, so I just deleted the first.
It could easily be changed to be the last, or all but the first/last, or all.

If the intent is to re-write the file with duplicates eliminated, that might be done with

$^I=".bak";
$duplicates{+lc}++ or print while <>;

(and I see I omitted the  + in my previous post, not to mention the w in while)
0
 

Author Comment

by:bt707
ID: 38818623
Thanks for all the info, I had got what I what I need from the one I accepted by changing a a few things just so I got the output I now needed to see, but just learned several things from the comments which is very much appreciated.

Thanks to all.
0

Featured Post

Is Your Active Directory as Secure as You Think?

More than 75% of all records are compromised because of the loss or theft of a privileged credential. Experts have been exploring Active Directory infrastructure to identify key threats and establish best practices for keeping data safe. Attend this month’s webinar to learn more.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

A year or so back I was asked to have a play with MongoDB; within half an hour I had downloaded (http://www.mongodb.org/downloads),  installed and started the daemon, and had a console window open. After an hour or two of playing at the command …
There are many situations when we need to display the data in sorted order. For example: Student details by name or by rank or by total marks etc. If you are working on data driven based projects then you will use sorting techniques very frequently.…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
Edureka is one of the fastest growing and most effective online learning sites.  We are here to help you succeed.

911 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

20 Experts available now in Live!

Get 1:1 Help Now