Solved

find duplicates in pipe delimited file based on first value

Posted on 2009-07-14
3
489 Views
Last Modified: 2012-05-07
Hi all,

I have a script in perl to look through a flat file and find duplicate lines. I output the dupes found to a text file.
I am trying to change this so it finds duplicates based on the first value in the pipe delmited file and not based on the whole line.

FIND DUPES SCRIPT:

open(FILE,"test.txt") || die "$!";
%seen =();
$line=0 ;
while (<FILE>) {
  $seen{$_}++;
  $line++;
  ## output dupes to text file
  open (MYFILE, '>>dupes.txt');
  print MYFILE "line $line : $_" if $seen{$_} > 1 ;
  close (MYFILE);
}

So as an example if I have a flat file with the following data:

774143||Mahou Tsukai Ninaru Houhou|||n||||||n|0|05/30/2009||
774143||Mahou Tsukai Ninaru Houhou|||n||||||n|0|05/30/2009||
774143||Mahou Tsukai Ninaru Houhou|||n||||||n|0|05/30/2009||
774143||Mahou Tsukai Ninaru Houhou|||n||||||n|0|05/30/2007||
773752||Dream Generation: Koi Ka? Shigoto Ka?|||n||||||n|0|05/30/2009||
773752||Dream Generation: Koi Ka? Shigoto Ka?|||n||||||n|0|05/30/2009||
773752||Dream Generation: Koi Ka? Shigoto Ka?|||n||||||n|0|05/30/2009||
773752||Dream Generation: Koi Ka? Shigoto Ka?|||n||||||n|0|05/30/2006||
773234||Burpy|||n||||||n|0|05/30/2006||

Checking for dupe lines my result file will output the following which
is correct. Here we find the exact duplicate "lines".

line 2 : 774143||Mahou Tsukai Ninaru Houhou|||n||||||n|0|05/30/2009||
line 3 : 774143||Mahou Tsukai Ninaru Houhou|||n||||||n|0|05/30/2009||
line 6 : 773752||Dream Generation: Koi Ka? Shigoto Ka?|||n||||||n|0|05/30/2009||
line 7 : 773752||Dream Generation: Koi Ka? Shigoto Ka?|||n||||||n|0|05/30/2009||


What I need to do is get dupe lines based on the very first value which
is the ID number. So my output file should instead show:

line 2 : 774143||Mahou Tsukai Ninaru Houhou|||n||||||n|0|05/30/2009||
line 3 : 774143||Mahou Tsukai Ninaru Houhou|||n||||||n|0|05/30/2009||
line 4 : 774143||Mahou Tsukai Ninaru Houhou|||n||||||n|0|05/30/2007||
line 6 : 773752||Dream Generation: Koi Ka? Shigoto Ka?|||n||||||n|0|05/30/2009||
line 7 : 773752||Dream Generation: Koi Ka? Shigoto Ka?|||n||||||n|0|05/30/2009||
line 8 : 773752||Dream Generation: Koi Ka? Shigoto Ka?|||n||||||n|0|05/30/2006||

Lastly. I am new to perl but I would like to learn more about how perl expression work. Im finding perl very handy for setting up fast little utility scripts to process large files (on windows using active perl) but I'm a complete newb to perl.
0
Comment
Question by:binovpd
3 Comments
 
LVL 40

Accepted Solution

by:
mrjoltcola earned 75 total points
ID: 24852533
Instead do:

open(FILE,"test.txt") || die "$!";
%seen =();
$line=0 ;
while (<FILE>) {
  /^(\d+)/;
  $seen{$1}++;
  $line++;
  ## output dupes to text file
  open (MYFILE, '>>dupes.txt');
  print MYFILE "line $line : $_" if $seen{$1} > 1 ;
  close (MYFILE);
}

Open in new window

0
 
LVL 84

Assisted Solution

by:ozo
ozo earned 50 total points
ID: 24852633
open(FILE,"test.txt") || die "$!";
open (MYFILE, '>>dupes.txt') || die $!;
%seen =();
while (<FILE>) {
  ## output dupes to text file
  print MYFILE "line $. : $_" if $seen{(/(\d+)/)[0]}++ ;
}
0
 

Author Closing Comment

by:binovpd
ID: 31603483
Thanks mrjoltcola and ozo. Both solutions work fine. I split the points between you both since both solutions will work. I gave mrjoltcola a bit more since he answered first.

ozo your solutions intersting you. you look though the orignal file then reparse through the output results and look through that.

If I may ask, Im trying to understand perl expressions. What does /^(\d+)/; do?
0

Featured Post

Free Tool: Port Scanner

Check which ports are open to the outside world. Helps make sure that your firewall rules are working as intended.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

I've just discovered very important differences between Windows an Unix formats in Perl,at least 5.xx.. MOST IMPORTANT: Use Unix file format while saving Your script. otherwise it will have ^M s or smth likely weird in the EOL, Then DO NOT use m…
On Microsoft Windows, if  when you click or type the name of a .pl file, you get an error "is not recognized as an internal or external command, operable program or batch file", then this means you do not have the .pl file extension associated with …
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…

860 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question