Solved

find duplicates in pipe delimited file based on first value

Posted on 2009-07-14
3
491 Views
Last Modified: 2012-05-07
Hi all,

I have a script in perl to look through a flat file and find duplicate lines. I output the dupes found to a text file.
I am trying to change this so it finds duplicates based on the first value in the pipe delmited file and not based on the whole line.

FIND DUPES SCRIPT:

open(FILE,"test.txt") || die "$!";
%seen =();
$line=0 ;
while (<FILE>) {
  $seen{$_}++;
  $line++;
  ## output dupes to text file
  open (MYFILE, '>>dupes.txt');
  print MYFILE "line $line : $_" if $seen{$_} > 1 ;
  close (MYFILE);
}

So as an example if I have a flat file with the following data:

774143||Mahou Tsukai Ninaru Houhou|||n||||||n|0|05/30/2009||
774143||Mahou Tsukai Ninaru Houhou|||n||||||n|0|05/30/2009||
774143||Mahou Tsukai Ninaru Houhou|||n||||||n|0|05/30/2009||
774143||Mahou Tsukai Ninaru Houhou|||n||||||n|0|05/30/2007||
773752||Dream Generation: Koi Ka? Shigoto Ka?|||n||||||n|0|05/30/2009||
773752||Dream Generation: Koi Ka? Shigoto Ka?|||n||||||n|0|05/30/2009||
773752||Dream Generation: Koi Ka? Shigoto Ka?|||n||||||n|0|05/30/2009||
773752||Dream Generation: Koi Ka? Shigoto Ka?|||n||||||n|0|05/30/2006||
773234||Burpy|||n||||||n|0|05/30/2006||

Checking for dupe lines my result file will output the following which
is correct. Here we find the exact duplicate "lines".

line 2 : 774143||Mahou Tsukai Ninaru Houhou|||n||||||n|0|05/30/2009||
line 3 : 774143||Mahou Tsukai Ninaru Houhou|||n||||||n|0|05/30/2009||
line 6 : 773752||Dream Generation: Koi Ka? Shigoto Ka?|||n||||||n|0|05/30/2009||
line 7 : 773752||Dream Generation: Koi Ka? Shigoto Ka?|||n||||||n|0|05/30/2009||


What I need to do is get dupe lines based on the very first value which
is the ID number. So my output file should instead show:

line 2 : 774143||Mahou Tsukai Ninaru Houhou|||n||||||n|0|05/30/2009||
line 3 : 774143||Mahou Tsukai Ninaru Houhou|||n||||||n|0|05/30/2009||
line 4 : 774143||Mahou Tsukai Ninaru Houhou|||n||||||n|0|05/30/2007||
line 6 : 773752||Dream Generation: Koi Ka? Shigoto Ka?|||n||||||n|0|05/30/2009||
line 7 : 773752||Dream Generation: Koi Ka? Shigoto Ka?|||n||||||n|0|05/30/2009||
line 8 : 773752||Dream Generation: Koi Ka? Shigoto Ka?|||n||||||n|0|05/30/2006||

Lastly. I am new to perl but I would like to learn more about how perl expression work. Im finding perl very handy for setting up fast little utility scripts to process large files (on windows using active perl) but I'm a complete newb to perl.
0
Comment
Question by:binovpd
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
3 Comments
 
LVL 40

Accepted Solution

by:
mrjoltcola earned 75 total points
ID: 24852533
Instead do:

open(FILE,"test.txt") || die "$!";
%seen =();
$line=0 ;
while (<FILE>) {
  /^(\d+)/;
  $seen{$1}++;
  $line++;
  ## output dupes to text file
  open (MYFILE, '>>dupes.txt');
  print MYFILE "line $line : $_" if $seen{$1} > 1 ;
  close (MYFILE);
}

Open in new window

0
 
LVL 84

Assisted Solution

by:ozo
ozo earned 50 total points
ID: 24852633
open(FILE,"test.txt") || die "$!";
open (MYFILE, '>>dupes.txt') || die $!;
%seen =();
while (<FILE>) {
  ## output dupes to text file
  print MYFILE "line $. : $_" if $seen{(/(\d+)/)[0]}++ ;
}
0
 

Author Closing Comment

by:binovpd
ID: 31603483
Thanks mrjoltcola and ozo. Both solutions work fine. I split the points between you both since both solutions will work. I gave mrjoltcola a bit more since he answered first.

ozo your solutions intersting you. you look though the orignal file then reparse through the output results and look through that.

If I may ask, Im trying to understand perl expressions. What does /^(\d+)/; do?
0

Featured Post

On Demand Webinar: Networking for the Cloud Era

Ready to improve network connectivity? Watch this webinar to learn how SD-WANs and a one-click instant connect tool can boost provisions, deployment, and management of your cloud connection.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Many time we need to work with multiple files all together. If its windows system then we can use some GUI based editor to accomplish our task. But what if you are on putty or have only CLI(Command Line Interface) as an option to  edit your files. I…
I have been pestered over the years to produce and distribute regular data extracts, and often the request have explicitly requested the data be emailed as an Excel attachement; specifically Excel, as it appears: CSV files confuse (no Red or Green h…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
Six Sigma Control Plans

729 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question