We help IT Professionals succeed at work.

remove duplicates from the csv file

228 Views
Last Modified: 2015-08-13
Hi - I wrote the below perl script to get emp Id and emp date of join.
I write the output of the script into a csv file in the below format

100000001,20150601
100000002,20150101
100000231,20150101
100002431,20150101
100023545,20150301
100000021,20150101
100000031,20150101
100000051,20150418
102044353,20150401
100000054,20150601
104560071,20150301
100045671,20150301
100045671,20150301
100045671,20150601
215645567,20150301


if you observe the emp id 100045671 was repeated 3 times, but i want it only once and only the first occurence
so from the below three
100045671,20150301
100045671,20150301
100045671,20150601
i just wanted the first one
100045671,20150301

in order to achieve that how can i modify the script.
I just don't want any duplicate employee id's in the file, if there are duplicates i want just the first occurence and remove the remaining from the csv.

use strict;
use warnings;
use Pod::Usage;
use Getopt::Long;
use Time::Piece;

my $helpme = 0;
my $man = 0;

my $outputFileName = 'C:\\temp\\test_v1.csv';
my $inputFileName =  'C:\\temp\\test.txt';

my $errorcode = 0;
my $DEBUG=0;

if(exists $ENV{DEBUG}) {
	$DEBUG = ($ENV{DEBUG} eq "") ? 0 : $ENV{DEBUG};
}

GetOptions('help' => \$helpme, 'man' => \$man, 'infile=s' => \$inputFileName, 'outfile=s' => \$outputFileName) or pod2usage(2);

pod2usage(1) if $helpme;
pod2usage(-verbose => 2) if $man;

die 'No input file name specified!' unless $inputFileName;
die 'No output file name specified!' unless $outputFileName;

open(INFILE, '<', $inputFileName) or die "Could not open input file: $!";
open(OUTFILE, '>', $outputFileName) or die "Could not open/create output file: $!";

while(<INFILE>) {
	chomp;
	
	if ((/^50\|([^|]+)\|/) || (/^51\|([^|]+)\|/)) {
		my $empID = $1;
		print OUTFILE $empID, ",";		
	}
	 if ((/^90\|([^|]+)\|/) || (/^91\|([^|]+)\|/)) {
		my $eDate = $1;
		my $dt = Time::Piece->strptime($eDate, '%m/%d/%Y');
		print OUTFILE $dt->strftime('%Y%m%d'), "\n";
	}
}

close INFILE;
close OUTFILE;

Open in new window


Thanks,
Comment
Watch Question

ozo
CERTIFIED EXPERT
Most Valuable Expert 2014
Top Expert 2015

Commented:
$seen{$empID}++ || print OUTFILE $empID, ",";

Author

Commented:
OZO - there are two problems with your solution

1) Error - Global symbol "%seen" requires explicit package name at line 38
2) if the error is fixed and let say it ran well, then it will remove my emp id's but the dates that i get from other if conditions are not gone... so if your code works then my output will be something like

100000001,20150601
100000002,20150101
100000231,20150101
100002431,20150101
100023545,20150301
100000021,20150101
100000031,20150101
100000051,20150418
102044353,20150401
100000054,20150601
104560071,20150301
100045671,20150301
 ,20150301
 ,20150601
215645567,20150301

Thanks
ozo
CERTIFIED EXPERT
Most Valuable Expert 2014
Top Expert 2015

Commented:
$seen{$empID}++ && next;

Author

Commented:
how to get rid of that error - any ideas ?
CERTIFIED EXPERT

Commented:
You just need to declare %seen somewhere before the loop:
my %seen;

Open in new window

Author

Commented:
I fixed that issue, but looks like i did not ask the question properly so I am not getting what i am expecting... so here is the input file that i am using

05|IDC|MEMBER|MAT EAGLE|5324 AKFHADLKFHSDAKLJFHDSLKF||NEWYORK|NY|402291375
10|526|Health Cooperation||534526|I
51|100000128|002|MAT||EAGLE|||NEWYORK|NY|4022434234291375|2|NEW|New Addition| ||||292343244890|||| |08/30/1986
91|06/01/2015|12/31/2015|M|HGFGLT|450045|1503450|FGADSFSDSUPNE|||ASDFD65465465||| |||||0.00|0|D|0|D|0|D|06/01/2015|||0.00|08/01/2015|0.00||||Y||0.00
36|PLNSUM|1|101808208|AF;DLKFJ AKLDFJD|4ADFDSF6546501393366|01/01/2015|    |M|20|40|150|60|0|294890|
40|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
42|||||||
44||||||||
05|IDC|MEMBER|STEVE AFJHADKJFH|547 DFLAKJHKAJSD||RTGYH|ERT|654651435
10|IDP|Individual Sub-Group||1|||O
50|10000001|001|STEVE|DFB|BEAMAFDDSF||RTGYH|ERT|40W23452450331435|2| |100132413243289|||294124869|ADFD|ASDF||||||  ||02/21/1958||654657988|
90|01/01/2015|12/31/2015|M|HLT|3000|1000|KYHC SUPNE|||72341344131000101||0.00|0|D|0|D|0|D|01/01/2014|ISDFND||0.00|08/01/2015|0.00||||YERE|CADFADFSR|0.00
40||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
42|||||||
44||||||||

so when i update my script with the solution provided by ozo it did not remove the date part
solution like below

100045671,20150301
 20150301
 20150601
215645567,20150301

#!perl

use strict;
use warnings;
use Pod::Usage;
use Getopt::Long;
use Time::Piece;

my $helpme = 0;
my $man = 0;

my $outputFileName = 'C:\\temp\\test_v1.csv';
my $inputFileName =  'C:\\temp\\test.txt';

my $errorcode = 0;
my $DEBUG=0;

if(exists $ENV{DEBUG}) {
	$DEBUG = ($ENV{DEBUG} eq "") ? 0 : $ENV{DEBUG};
}

GetOptions('help' => \$helpme, 'man' => \$man, 'infile=s' => \$inputFileName, 'outfile=s' => \$outputFileName) or pod2usage(2);

pod2usage(1) if $helpme;
pod2usage(-verbose => 2) if $man;

die 'No input file name specified!' unless $inputFileName;
die 'No output file name specified!' unless $outputFileName;

open(INFILE, '<', $inputFileName) or die "Could not open input file: $!";
open(OUTFILE, '>', $outputFileName) or die "Could not open/create output file: $!";

while(<INFILE>) {
	chomp;
	
	if ((/^50\|([^|]+)\|/) || (/^51\|([^|]+)\|/)) {
		my $empID = $1;
		$seen{$empID}++ && next; 	
	}
	 if ((/^90\|([^|]+)\|/) || (/^91\|([^|]+)\|/)) {
		my $eDate = $1;
		my $dt = Time::Piece->strptime($eDate, '%m/%d/%Y');
		print OUTFILE $dt->strftime('%Y%m%d'), "\n";
	}
}

close INFILE;
close OUTFILE;

Open in new window

ozo
CERTIFIED EXPERT
Most Valuable Expert 2014
Top Expert 2015

Commented:
my $empID = $1;
                $seen{$empID}++ && next;       
            print OUTFILE $empID, ",";

Author

Commented:
yes i did that ozo, no effect.
but if you check the script

let's say we are checking line that starts with 50 and lets say empid is seen so "&&next"
 will go to next line which is line that starts with 90 (second if condition in script)
as it passes second if condition it gets the date - so now i have no emp id but have date

Thanks,
ozo
CERTIFIED EXPERT
Most Valuable Expert 2014
Top Expert 2015
Commented:
This one is on us!
(Get your first solution completely free - no credit card required)
UNLOCK SOLUTION

Author

Commented:
OZO I just figured out another solution, it worked can you please check whether i made any mistakes I mean does it gonna fail ever.

#!perl

use strict;
use warnings;
use Pod::Usage;
use Getopt::Long;
use Time::Piece;

my $helpme = 0;
my $man = 0;

my $outputFileName = 'C:\\temp\\test_v1.csv';
my $inputFileName =  'C:\\temp\\test.txt';

my $errorcode = 0;
my $DEBUG=0;
my $tempflag = 0;

if(exists $ENV{DEBUG}) {
	$DEBUG = ($ENV{DEBUG} eq "") ? 0 : $ENV{DEBUG};
}

GetOptions('help' => \$helpme, 'man' => \$man, 'infile=s' => \$inputFileName, 'outfile=s' => \$outputFileName) or pod2usage(2);

pod2usage(1) if $helpme;
pod2usage(-verbose => 2) if $man;

die 'No input file name specified!' unless $inputFileName;
die 'No output file name specified!' unless $outputFileName;

open(INFILE, '<', $inputFileName) or die "Could not open input file: $!";
open(OUTFILE, '>', $outputFileName) or die "Could not open/create output file: $!";

while(<INFILE>) {
	chomp;
	
	if (((/^50\|([^|]+)\|/) || (/^51\|([^|]+)\|/)) && ($tempflag == 0)) {
		my $empID = $1;
		if($seen{$empID}++){
			$tempflag++ ; 	
			next;
		}
		print OUTFILE $empID, ",";
	}
	 if ((/^90\|([^|]+)\|/) || (/^91\|([^|]+)\|/)) {
		my $eDate = $1;
		my $dt = Time::Piece->strptime($eDate, '%m/%d/%Y');
		if($tempflag == 0){
			print OUTFILE $dt->strftime('%Y%m%d'), "\n";
		}		
		if($tempflag > 0){
			$tempflag = 0;
		}
	}
}

close INFILE;
close OUTFILE;

Open in new window

CERTIFIED EXPERT
Commented:
This one is on us!
(Get your first solution completely free - no credit card required)
UNLOCK SOLUTION
CERTIFIED EXPERT

Commented:
I think your way will work but it's certainly not as clean or efficient as ozo/my solution (I always hate when someone else responds with the same answer while I'm typing it out).

Author

Commented:
Thanks OZO and wilcoxon - both solution are same and it worked along with mine. I will use your solution as it is much cleaner than mine.

Thanks,
Unlock the solution to this question.
Join our community and discover your potential

Experts Exchange is the only place where you can interact directly with leading experts in the technology field. Become a member today and access the collective knowledge of thousands of technology experts.

*This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

OR

Please enter a first name

Please enter a last name

8+ characters (letters, numbers, and a symbol)

By clicking, you agree to the Terms of Use and Privacy Policy.