Solved

find duplicates in pipe delimited file based on first value

Posted on 2009-07-14
3
486 Views
Last Modified: 2012-05-07
Hi all,

I have a script in perl to look through a flat file and find duplicate lines. I output the dupes found to a text file.
I am trying to change this so it finds duplicates based on the first value in the pipe delmited file and not based on the whole line.

FIND DUPES SCRIPT:

open(FILE,"test.txt") || die "$!";
%seen =();
$line=0 ;
while (<FILE>) {
  $seen{$_}++;
  $line++;
  ## output dupes to text file
  open (MYFILE, '>>dupes.txt');
  print MYFILE "line $line : $_" if $seen{$_} > 1 ;
  close (MYFILE);
}

So as an example if I have a flat file with the following data:

774143||Mahou Tsukai Ninaru Houhou|||n||||||n|0|05/30/2009||
774143||Mahou Tsukai Ninaru Houhou|||n||||||n|0|05/30/2009||
774143||Mahou Tsukai Ninaru Houhou|||n||||||n|0|05/30/2009||
774143||Mahou Tsukai Ninaru Houhou|||n||||||n|0|05/30/2007||
773752||Dream Generation: Koi Ka? Shigoto Ka?|||n||||||n|0|05/30/2009||
773752||Dream Generation: Koi Ka? Shigoto Ka?|||n||||||n|0|05/30/2009||
773752||Dream Generation: Koi Ka? Shigoto Ka?|||n||||||n|0|05/30/2009||
773752||Dream Generation: Koi Ka? Shigoto Ka?|||n||||||n|0|05/30/2006||
773234||Burpy|||n||||||n|0|05/30/2006||

Checking for dupe lines my result file will output the following which
is correct. Here we find the exact duplicate "lines".

line 2 : 774143||Mahou Tsukai Ninaru Houhou|||n||||||n|0|05/30/2009||
line 3 : 774143||Mahou Tsukai Ninaru Houhou|||n||||||n|0|05/30/2009||
line 6 : 773752||Dream Generation: Koi Ka? Shigoto Ka?|||n||||||n|0|05/30/2009||
line 7 : 773752||Dream Generation: Koi Ka? Shigoto Ka?|||n||||||n|0|05/30/2009||


What I need to do is get dupe lines based on the very first value which
is the ID number. So my output file should instead show:

line 2 : 774143||Mahou Tsukai Ninaru Houhou|||n||||||n|0|05/30/2009||
line 3 : 774143||Mahou Tsukai Ninaru Houhou|||n||||||n|0|05/30/2009||
line 4 : 774143||Mahou Tsukai Ninaru Houhou|||n||||||n|0|05/30/2007||
line 6 : 773752||Dream Generation: Koi Ka? Shigoto Ka?|||n||||||n|0|05/30/2009||
line 7 : 773752||Dream Generation: Koi Ka? Shigoto Ka?|||n||||||n|0|05/30/2009||
line 8 : 773752||Dream Generation: Koi Ka? Shigoto Ka?|||n||||||n|0|05/30/2006||

Lastly. I am new to perl but I would like to learn more about how perl expression work. Im finding perl very handy for setting up fast little utility scripts to process large files (on windows using active perl) but I'm a complete newb to perl.
0
Comment
Question by:binovpd
3 Comments
 
LVL 40

Accepted Solution

by:
mrjoltcola earned 75 total points
Comment Utility
Instead do:

open(FILE,"test.txt") || die "$!";

%seen =();

$line=0 ;

while (<FILE>) {

  /^(\d+)/;

  $seen{$1}++;

  $line++;

  ## output dupes to text file

  open (MYFILE, '>>dupes.txt');

  print MYFILE "line $line : $_" if $seen{$1} > 1 ;

  close (MYFILE);

}

Open in new window

0
 
LVL 84

Assisted Solution

by:ozo
ozo earned 50 total points
Comment Utility
open(FILE,"test.txt") || die "$!";
open (MYFILE, '>>dupes.txt') || die $!;
%seen =();
while (<FILE>) {
  ## output dupes to text file
  print MYFILE "line $. : $_" if $seen{(/(\d+)/)[0]}++ ;
}
0
 

Author Closing Comment

by:binovpd
Comment Utility
Thanks mrjoltcola and ozo. Both solutions work fine. I split the points between you both since both solutions will work. I gave mrjoltcola a bit more since he answered first.

ozo your solutions intersting you. you look though the orignal file then reparse through the output results and look through that.

If I may ask, Im trying to understand perl expressions. What does /^(\d+)/; do?
0

Featured Post

Maximize Your Threat Intelligence Reporting

Reporting is one of the most important and least talked about aspects of a world-class threat intelligence program. Here’s how to do it right.

Join & Write a Comment

I've just discovered very important differences between Windows an Unix formats in Perl,at least 5.xx.. MOST IMPORTANT: Use Unix file format while saving Your script. otherwise it will have ^M s or smth likely weird in the EOL, Then DO NOT use m…
I have been pestered over the years to produce and distribute regular data extracts, and often the request have explicitly requested the data be emailed as an Excel attachement; specifically Excel, as it appears: CSV files confuse (no Red or Green h…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
Excel styles will make formatting consistent and let you apply and change formatting faster. In this tutorial, you'll learn how to use Excel's built-in styles, how to modify styles, and how to create your own. You'll also learn how to use your custo…

743 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

14 Experts available now in Live!

Get 1:1 Help Now