Want to win a PS4? Go Premium and enter to win our High-Tech Treats giveaway. Enter to Win

x
?
Solved

Perl - check duplicates in file and case

Posted on 2013-01-24
9
Medium Priority
?
540 Views
Last Modified: 2013-01-25
I'm using this I grabbed from perlmonks that's does what I need, which checks a file and prints out any duplicate lines.

my %duplicates;

while (<>) {
    chomp;
    $duplicates{$_}++;
}

foreach my $key (keys %duplicates) {
    if ($duplicates{$key} > 1) {
        delete $duplicates{$key};
        print "$key\n";
    }
}

Open in new window

Just one issue, I need to match lines in the file that is the same but may have different case, I can do a lower case on the file then run it but I need to keep the case.

How can I do a lower case to do the checks with but still keep the same case for my output?

Thanks
0
Comment
Question by:bt707
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 3
  • 3
  • 2
  • +1
9 Comments
 
LVL 31

Expert Comment

by:farzanj
ID: 38817059
First you need to make a map.

while (<>) {
    chomp;
    $duplicates{lc($_)} = $_;
}
Then if it already exists in the map, you should print it otherwise not
0
 

Author Comment

by:bt707
ID: 38817071
I had already tried using a lc but I just get errors from that, putting it in like that I just get an error of:

Useless use of lc in void context at ./dup_lines.pl line 10.

#! /usr/bin/perl

use strict;
use warnings;

my %duplicates;

while (<>) {
    chomp;
    lc;
    $duplicates{$_}++;
}

foreach my $key (keys %duplicates) {
    if ($duplicates{$key} > 1) {
        delete $duplicates{$key};
        print "$key\n";
    }
}

What am I missing?
0
 
LVL 27

Accepted Solution

by:
wilcoxon earned 2000 total points
ID: 38817078
I would do something like this...

This will print out the lowercased key followed by the the lines that matched it (indented by tabs).

my %duplicates;

while (<>) {
    chomp;
    my $key = lc $_;
    $duplicates{$key} = [] unless $duplicates{$key};
    push @{$duplicates{$key}}, $_;
}

foreach my $key (keys %duplicates) {
    if (@{$duplicates{$key}} > 1) {
        print "$key:\n\t", join("\n\t", @{$duplicates{$key}}), "\n";
        delete $duplicates{$key};
    }
}

Open in new window

0
Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

 
LVL 27

Expert Comment

by:wilcoxon
ID: 38817084
Bare lc will not work.  You need to add:

$_ = lc;  # which should act the same as $_ = lc $_

This method will also not preserve case (one of the things you asked for).
0
 

Author Closing Comment

by:bt707
ID: 38817097
Thanks that worked fine, not sure why I did not get the other to work, maybe me but this one worked fine.

Thanks,
0
 
LVL 84

Expert Comment

by:ozo
ID: 38817103
hile (<>) {
    chomp;
    push @{$duplicates{lc}},$_;
}

foreach my $key (keys %duplicates) {
    if (@{$duplicates{$key}} > 1) {
        delete $duplicates{$key}->[0];
        print "$duplicates{$key}->[0]\n";
    }
}
0
 
LVL 27

Expert Comment

by:wilcoxon
ID: 38817165
Interesting.  I would have sworn "push @{$duplicates{$key}}" failed on an undefined $duplicates{$key} but I just re-tested and it works fine.

As such, you can omit the "... = [] unless $duplicates{$key}" line (and as ozo said, you can then combine the "$key = lc $_" and push lines).

As usual, ozo has provided a good concise answer (though I would not do his delete/print part like he did but that's just preference).
0
 
LVL 84

Expert Comment

by:ozo
ID: 38817912
Sorry, I didn't see http:#a38817078 when I posted.
I was trying to duplicate the behaviour of the routine in the original question,
which seemed to be deleting only one of each duplicate name, so I just deleted the first.
It could easily be changed to be the last, or all but the first/last, or all.

If the intent is to re-write the file with duplicates eliminated, that might be done with

$^I=".bak";
$duplicates{+lc}++ or print while <>;

(and I see I omitted the  + in my previous post, not to mention the w in while)
0
 

Author Comment

by:bt707
ID: 38818623
Thanks for all the info, I had got what I what I need from the one I accepted by changing a a few things just so I got the output I now needed to see, but just learned several things from the comments which is very much appreciated.

Thanks to all.
0

Featured Post

Important Lessons on Recovering from Petya

In their most recent webinar, Skyport Systems explores ways to isolate and protect critical databases to keep the core of your company safe from harm.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

I have been pestered over the years to produce and distribute regular data extracts, and often the request have explicitly requested the data be emailed as an Excel attachement; specifically Excel, as it appears: CSV files confuse (no Red or Green h…
There are many situations when we need to display the data in sorted order. For example: Student details by name or by rank or by total marks etc. If you are working on data driven based projects then you will use sorting techniques very frequently.…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
Six Sigma Control Plans

610 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question