?
Solved

find duplicates in pipe delimited file based on first value

Posted on 2009-07-14
3
Medium Priority
?
492 Views
Last Modified: 2012-05-07
Hi all,

I have a script in perl to look through a flat file and find duplicate lines. I output the dupes found to a text file.
I am trying to change this so it finds duplicates based on the first value in the pipe delmited file and not based on the whole line.

FIND DUPES SCRIPT:

open(FILE,"test.txt") || die "$!";
%seen =();
$line=0 ;
while (<FILE>) {
  $seen{$_}++;
  $line++;
  ## output dupes to text file
  open (MYFILE, '>>dupes.txt');
  print MYFILE "line $line : $_" if $seen{$_} > 1 ;
  close (MYFILE);
}

So as an example if I have a flat file with the following data:

774143||Mahou Tsukai Ninaru Houhou|||n||||||n|0|05/30/2009||
774143||Mahou Tsukai Ninaru Houhou|||n||||||n|0|05/30/2009||
774143||Mahou Tsukai Ninaru Houhou|||n||||||n|0|05/30/2009||
774143||Mahou Tsukai Ninaru Houhou|||n||||||n|0|05/30/2007||
773752||Dream Generation: Koi Ka? Shigoto Ka?|||n||||||n|0|05/30/2009||
773752||Dream Generation: Koi Ka? Shigoto Ka?|||n||||||n|0|05/30/2009||
773752||Dream Generation: Koi Ka? Shigoto Ka?|||n||||||n|0|05/30/2009||
773752||Dream Generation: Koi Ka? Shigoto Ka?|||n||||||n|0|05/30/2006||
773234||Burpy|||n||||||n|0|05/30/2006||

Checking for dupe lines my result file will output the following which
is correct. Here we find the exact duplicate "lines".

line 2 : 774143||Mahou Tsukai Ninaru Houhou|||n||||||n|0|05/30/2009||
line 3 : 774143||Mahou Tsukai Ninaru Houhou|||n||||||n|0|05/30/2009||
line 6 : 773752||Dream Generation: Koi Ka? Shigoto Ka?|||n||||||n|0|05/30/2009||
line 7 : 773752||Dream Generation: Koi Ka? Shigoto Ka?|||n||||||n|0|05/30/2009||


What I need to do is get dupe lines based on the very first value which
is the ID number. So my output file should instead show:

line 2 : 774143||Mahou Tsukai Ninaru Houhou|||n||||||n|0|05/30/2009||
line 3 : 774143||Mahou Tsukai Ninaru Houhou|||n||||||n|0|05/30/2009||
line 4 : 774143||Mahou Tsukai Ninaru Houhou|||n||||||n|0|05/30/2007||
line 6 : 773752||Dream Generation: Koi Ka? Shigoto Ka?|||n||||||n|0|05/30/2009||
line 7 : 773752||Dream Generation: Koi Ka? Shigoto Ka?|||n||||||n|0|05/30/2009||
line 8 : 773752||Dream Generation: Koi Ka? Shigoto Ka?|||n||||||n|0|05/30/2006||

Lastly. I am new to perl but I would like to learn more about how perl expression work. Im finding perl very handy for setting up fast little utility scripts to process large files (on windows using active perl) but I'm a complete newb to perl.
0
Comment
Question by:binovpd
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
3 Comments
 
LVL 40

Accepted Solution

by:
mrjoltcola earned 300 total points
ID: 24852533
Instead do:

open(FILE,"test.txt") || die "$!";
%seen =();
$line=0 ;
while (<FILE>) {
  /^(\d+)/;
  $seen{$1}++;
  $line++;
  ## output dupes to text file
  open (MYFILE, '>>dupes.txt');
  print MYFILE "line $line : $_" if $seen{$1} > 1 ;
  close (MYFILE);
}

Open in new window

0
 
LVL 84

Assisted Solution

by:ozo
ozo earned 200 total points
ID: 24852633
open(FILE,"test.txt") || die "$!";
open (MYFILE, '>>dupes.txt') || die $!;
%seen =();
while (<FILE>) {
  ## output dupes to text file
  print MYFILE "line $. : $_" if $seen{(/(\d+)/)[0]}++ ;
}
0
 

Author Closing Comment

by:binovpd
ID: 31603483
Thanks mrjoltcola and ozo. Both solutions work fine. I split the points between you both since both solutions will work. I gave mrjoltcola a bit more since he answered first.

ozo your solutions intersting you. you look though the orignal file then reparse through the output results and look through that.

If I may ask, Im trying to understand perl expressions. What does /^(\d+)/; do?
0

Featured Post

Free Tool: Subnet Calculator

The subnet calculator helps you design networks by taking an IP address and network mask and returning information such as network, broadcast address, and host range.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Many time we need to work with multiple files all together. If its windows system then we can use some GUI based editor to accomplish our task. But what if you are on putty or have only CLI(Command Line Interface) as an option to  edit your files. I…
Checking the Alert Log in AWS RDS Oracle can be a pain through their user interface.  I made a script to download the Alert Log, look for errors, and email me the trace files.  In this article I'll describe what I did and share my script.
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
Six Sigma Control Plans

741 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question