Solved

Perl - check duplicates in file and case

Posted on 2013-01-24
9
532 Views
Last Modified: 2013-01-25
I'm using this I grabbed from perlmonks that's does what I need, which checks a file and prints out any duplicate lines.

my %duplicates;

while (<>) {
    chomp;
    $duplicates{$_}++;
}

foreach my $key (keys %duplicates) {
    if ($duplicates{$key} > 1) {
        delete $duplicates{$key};
        print "$key\n";
    }
}

Open in new window

Just one issue, I need to match lines in the file that is the same but may have different case, I can do a lower case on the file then run it but I need to keep the case.

How can I do a lower case to do the checks with but still keep the same case for my output?

Thanks
0
Comment
Question by:bt707
  • 3
  • 3
  • 2
  • +1
9 Comments
 
LVL 31

Expert Comment

by:farzanj
ID: 38817059
First you need to make a map.

while (<>) {
    chomp;
    $duplicates{lc($_)} = $_;
}
Then if it already exists in the map, you should print it otherwise not
0
 

Author Comment

by:bt707
ID: 38817071
I had already tried using a lc but I just get errors from that, putting it in like that I just get an error of:

Useless use of lc in void context at ./dup_lines.pl line 10.

#! /usr/bin/perl

use strict;
use warnings;

my %duplicates;

while (<>) {
    chomp;
    lc;
    $duplicates{$_}++;
}

foreach my $key (keys %duplicates) {
    if ($duplicates{$key} > 1) {
        delete $duplicates{$key};
        print "$key\n";
    }
}

What am I missing?
0
 
LVL 26

Accepted Solution

by:
wilcoxon earned 500 total points
ID: 38817078
I would do something like this...

This will print out the lowercased key followed by the the lines that matched it (indented by tabs).

my %duplicates;

while (<>) {
    chomp;
    my $key = lc $_;
    $duplicates{$key} = [] unless $duplicates{$key};
    push @{$duplicates{$key}}, $_;
}

foreach my $key (keys %duplicates) {
    if (@{$duplicates{$key}} > 1) {
        print "$key:\n\t", join("\n\t", @{$duplicates{$key}}), "\n";
        delete $duplicates{$key};
    }
}

Open in new window

0
Netscaler Common Configuration How To guides

If you use NetScaler you will want to see these guides. The NetScaler How To Guides show administrators how to get NetScaler up and configured by providing instructions for common scenarios and some not so common ones.

 
LVL 26

Expert Comment

by:wilcoxon
ID: 38817084
Bare lc will not work.  You need to add:

$_ = lc;  # which should act the same as $_ = lc $_

This method will also not preserve case (one of the things you asked for).
0
 

Author Closing Comment

by:bt707
ID: 38817097
Thanks that worked fine, not sure why I did not get the other to work, maybe me but this one worked fine.

Thanks,
0
 
LVL 84

Expert Comment

by:ozo
ID: 38817103
hile (<>) {
    chomp;
    push @{$duplicates{lc}},$_;
}

foreach my $key (keys %duplicates) {
    if (@{$duplicates{$key}} > 1) {
        delete $duplicates{$key}->[0];
        print "$duplicates{$key}->[0]\n";
    }
}
0
 
LVL 26

Expert Comment

by:wilcoxon
ID: 38817165
Interesting.  I would have sworn "push @{$duplicates{$key}}" failed on an undefined $duplicates{$key} but I just re-tested and it works fine.

As such, you can omit the "... = [] unless $duplicates{$key}" line (and as ozo said, you can then combine the "$key = lc $_" and push lines).

As usual, ozo has provided a good concise answer (though I would not do his delete/print part like he did but that's just preference).
0
 
LVL 84

Expert Comment

by:ozo
ID: 38817912
Sorry, I didn't see http:#a38817078 when I posted.
I was trying to duplicate the behaviour of the routine in the original question,
which seemed to be deleting only one of each duplicate name, so I just deleted the first.
It could easily be changed to be the last, or all but the first/last, or all.

If the intent is to re-write the file with duplicates eliminated, that might be done with

$^I=".bak";
$duplicates{+lc}++ or print while <>;

(and I see I omitted the  + in my previous post, not to mention the w in while)
0
 

Author Comment

by:bt707
ID: 38818623
Thanks for all the info, I had got what I what I need from the one I accepted by changing a a few things just so I got the output I now needed to see, but just learned several things from the comments which is very much appreciated.

Thanks to all.
0

Featured Post

PRTG Network Monitor: Intuitive Network Monitoring

Network Monitoring is essential to ensure that computer systems and network devices are running. Use PRTG to monitor LANs, servers, websites, applications and devices, bandwidth, virtual environments, remote systems, IoT, and many more. PRTG is easy to set up & use.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

There are many situations when we need to display the data in sorted order. For example: Student details by name or by rank or by total marks etc. If you are working on data driven based projects then you will use sorting techniques very frequently.…
Checking the Alert Log in AWS RDS Oracle can be a pain through their user interface.  I made a script to download the Alert Log, look for errors, and email me the trace files.  In this article I'll describe what I did and share my script.
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
Finds all prime numbers in a range requested and places them in a public primes() array. I've demostrated a template size of 30 (2 * 3 * 5) but larger templates can be built such 210  (2 * 3 * 5 * 7) or 2310  (2 * 3 * 5 * 7 * 11). The larger templa…

770 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question