Solved

Perl - check duplicates in file and case

Posted on 2013-01-24
9
529 Views
Last Modified: 2013-01-25
I'm using this I grabbed from perlmonks that's does what I need, which checks a file and prints out any duplicate lines.

my %duplicates;

while (<>) {
    chomp;
    $duplicates{$_}++;
}

foreach my $key (keys %duplicates) {
    if ($duplicates{$key} > 1) {
        delete $duplicates{$key};
        print "$key\n";
    }
}

Open in new window

Just one issue, I need to match lines in the file that is the same but may have different case, I can do a lower case on the file then run it but I need to keep the case.

How can I do a lower case to do the checks with but still keep the same case for my output?

Thanks
0
Comment
Question by:bt707
  • 3
  • 3
  • 2
  • +1
9 Comments
 
LVL 31

Expert Comment

by:farzanj
ID: 38817059
First you need to make a map.

while (<>) {
    chomp;
    $duplicates{lc($_)} = $_;
}
Then if it already exists in the map, you should print it otherwise not
0
 

Author Comment

by:bt707
ID: 38817071
I had already tried using a lc but I just get errors from that, putting it in like that I just get an error of:

Useless use of lc in void context at ./dup_lines.pl line 10.

#! /usr/bin/perl

use strict;
use warnings;

my %duplicates;

while (<>) {
    chomp;
    lc;
    $duplicates{$_}++;
}

foreach my $key (keys %duplicates) {
    if ($duplicates{$key} > 1) {
        delete $duplicates{$key};
        print "$key\n";
    }
}

What am I missing?
0
 
LVL 26

Accepted Solution

by:
wilcoxon earned 500 total points
ID: 38817078
I would do something like this...

This will print out the lowercased key followed by the the lines that matched it (indented by tabs).

my %duplicates;

while (<>) {
    chomp;
    my $key = lc $_;
    $duplicates{$key} = [] unless $duplicates{$key};
    push @{$duplicates{$key}}, $_;
}

foreach my $key (keys %duplicates) {
    if (@{$duplicates{$key}} > 1) {
        print "$key:\n\t", join("\n\t", @{$duplicates{$key}}), "\n";
        delete $duplicates{$key};
    }
}

Open in new window

0
 
LVL 26

Expert Comment

by:wilcoxon
ID: 38817084
Bare lc will not work.  You need to add:

$_ = lc;  # which should act the same as $_ = lc $_

This method will also not preserve case (one of the things you asked for).
0
Threat Intelligence Starter Resources

Integrating threat intelligence can be challenging, and not all companies are ready. These resources can help you build awareness and prepare for defense.

 

Author Closing Comment

by:bt707
ID: 38817097
Thanks that worked fine, not sure why I did not get the other to work, maybe me but this one worked fine.

Thanks,
0
 
LVL 84

Expert Comment

by:ozo
ID: 38817103
hile (<>) {
    chomp;
    push @{$duplicates{lc}},$_;
}

foreach my $key (keys %duplicates) {
    if (@{$duplicates{$key}} > 1) {
        delete $duplicates{$key}->[0];
        print "$duplicates{$key}->[0]\n";
    }
}
0
 
LVL 26

Expert Comment

by:wilcoxon
ID: 38817165
Interesting.  I would have sworn "push @{$duplicates{$key}}" failed on an undefined $duplicates{$key} but I just re-tested and it works fine.

As such, you can omit the "... = [] unless $duplicates{$key}" line (and as ozo said, you can then combine the "$key = lc $_" and push lines).

As usual, ozo has provided a good concise answer (though I would not do his delete/print part like he did but that's just preference).
0
 
LVL 84

Expert Comment

by:ozo
ID: 38817912
Sorry, I didn't see http:#a38817078 when I posted.
I was trying to duplicate the behaviour of the routine in the original question,
which seemed to be deleting only one of each duplicate name, so I just deleted the first.
It could easily be changed to be the last, or all but the first/last, or all.

If the intent is to re-write the file with duplicates eliminated, that might be done with

$^I=".bak";
$duplicates{+lc}++ or print while <>;

(and I see I omitted the  + in my previous post, not to mention the w in while)
0
 

Author Comment

by:bt707
ID: 38818623
Thanks for all the info, I had got what I what I need from the one I accepted by changing a a few things just so I got the output I now needed to see, but just learned several things from the comments which is very much appreciated.

Thanks to all.
0

Featured Post

Find Ransomware Secrets With All-Source Analysis

Ransomware has become a major concern for organizations; its prevalence has grown due to past successes achieved by threat actors. While each ransomware variant is different, we’ve seen some common tactics and trends used among the authors of the malware.

Join & Write a Comment

On Microsoft Windows, if  when you click or type the name of a .pl file, you get an error "is not recognized as an internal or external command, operable program or batch file", then this means you do not have the .pl file extension associated with …
Many time we need to work with multiple files all together. If its windows system then we can use some GUI based editor to accomplish our task. But what if you are on putty or have only CLI(Command Line Interface) as an option to  edit your files. I…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
This video demonstrates how to create an example email signature rule for a department in a company using CodeTwo Exchange Rules. The signature will be inserted beneath users' latest emails in conversations and will be displayed in users' Sent Items…

707 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

18 Experts available now in Live!

Get 1:1 Help Now