Link to home
Create AccountLog in
Avatar of shragi
shragiFlag for India

asked on

remove duplicates from the csv file

Hi - I wrote the below perl script to get emp Id and emp date of join.
I write the output of the script into a csv file in the below format

100000001,20150601
100000002,20150101
100000231,20150101
100002431,20150101
100023545,20150301
100000021,20150101
100000031,20150101
100000051,20150418
102044353,20150401
100000054,20150601
104560071,20150301
100045671,20150301
100045671,20150301
100045671,20150601
215645567,20150301


if you observe the emp id 100045671 was repeated 3 times, but i want it only once and only the first occurence
so from the below three
100045671,20150301
100045671,20150301
100045671,20150601
i just wanted the first one
100045671,20150301

in order to achieve that how can i modify the script.
I just don't want any duplicate employee id's in the file, if there are duplicates i want just the first occurence and remove the remaining from the csv.

use strict;
use warnings;
use Pod::Usage;
use Getopt::Long;
use Time::Piece;

my $helpme = 0;
my $man = 0;

my $outputFileName = 'C:\\temp\\test_v1.csv';
my $inputFileName =  'C:\\temp\\test.txt';

my $errorcode = 0;
my $DEBUG=0;

if(exists $ENV{DEBUG}) {
	$DEBUG = ($ENV{DEBUG} eq "") ? 0 : $ENV{DEBUG};
}

GetOptions('help' => \$helpme, 'man' => \$man, 'infile=s' => \$inputFileName, 'outfile=s' => \$outputFileName) or pod2usage(2);

pod2usage(1) if $helpme;
pod2usage(-verbose => 2) if $man;

die 'No input file name specified!' unless $inputFileName;
die 'No output file name specified!' unless $outputFileName;

open(INFILE, '<', $inputFileName) or die "Could not open input file: $!";
open(OUTFILE, '>', $outputFileName) or die "Could not open/create output file: $!";

while(<INFILE>) {
	chomp;
	
	if ((/^50\|([^|]+)\|/) || (/^51\|([^|]+)\|/)) {
		my $empID = $1;
		print OUTFILE $empID, ",";		
	}
	 if ((/^90\|([^|]+)\|/) || (/^91\|([^|]+)\|/)) {
		my $eDate = $1;
		my $dt = Time::Piece->strptime($eDate, '%m/%d/%Y');
		print OUTFILE $dt->strftime('%Y%m%d'), "\n";
	}
}

close INFILE;
close OUTFILE;

Open in new window


Thanks,
Avatar of ozo
ozo
Flag of United States of America image

$seen{$empID}++ || print OUTFILE $empID, ",";
Avatar of shragi

ASKER

OZO - there are two problems with your solution

1) Error - Global symbol "%seen" requires explicit package name at line 38
2) if the error is fixed and let say it ran well, then it will remove my emp id's but the dates that i get from other if conditions are not gone... so if your code works then my output will be something like

100000001,20150601
100000002,20150101
100000231,20150101
100002431,20150101
100023545,20150301
100000021,20150101
100000031,20150101
100000051,20150418
102044353,20150401
100000054,20150601
104560071,20150301
100045671,20150301
 ,20150301
 ,20150601
215645567,20150301

Thanks
$seen{$empID}++ && next;
Avatar of shragi

ASKER

how to get rid of that error - any ideas ?
You just need to declare %seen somewhere before the loop:
my %seen;

Open in new window

Avatar of shragi

ASKER

I fixed that issue, but looks like i did not ask the question properly so I am not getting what i am expecting... so here is the input file that i am using

05|IDC|MEMBER|MAT EAGLE|5324 AKFHADLKFHSDAKLJFHDSLKF||NEWYORK|NY|402291375
10|526|Health Cooperation||534526|I
51|100000128|002|MAT||EAGLE|||NEWYORK|NY|4022434234291375|2|NEW|New Addition| ||||292343244890|||| |08/30/1986
91|06/01/2015|12/31/2015|M|HGFGLT|450045|1503450|FGADSFSDSUPNE|||ASDFD65465465||| |||||0.00|0|D|0|D|0|D|06/01/2015|||0.00|08/01/2015|0.00||||Y||0.00
36|PLNSUM|1|101808208|AF;DLKFJ AKLDFJD|4ADFDSF6546501393366|01/01/2015|    |M|20|40|150|60|0|294890|
40|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
42|||||||
44||||||||
05|IDC|MEMBER|STEVE AFJHADKJFH|547 DFLAKJHKAJSD||RTGYH|ERT|654651435
10|IDP|Individual Sub-Group||1|||O
50|10000001|001|STEVE|DFB|BEAMAFDDSF||RTGYH|ERT|40W23452450331435|2| |100132413243289|||294124869|ADFD|ASDF||||||  ||02/21/1958||654657988|
90|01/01/2015|12/31/2015|M|HLT|3000|1000|KYHC SUPNE|||72341344131000101||0.00|0|D|0|D|0|D|01/01/2014|ISDFND||0.00|08/01/2015|0.00||||YERE|CADFADFSR|0.00
40||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
42|||||||
44||||||||

so when i update my script with the solution provided by ozo it did not remove the date part
solution like below

100045671,20150301
 20150301
 20150601
215645567,20150301

#!perl

use strict;
use warnings;
use Pod::Usage;
use Getopt::Long;
use Time::Piece;

my $helpme = 0;
my $man = 0;

my $outputFileName = 'C:\\temp\\test_v1.csv';
my $inputFileName =  'C:\\temp\\test.txt';

my $errorcode = 0;
my $DEBUG=0;

if(exists $ENV{DEBUG}) {
	$DEBUG = ($ENV{DEBUG} eq "") ? 0 : $ENV{DEBUG};
}

GetOptions('help' => \$helpme, 'man' => \$man, 'infile=s' => \$inputFileName, 'outfile=s' => \$outputFileName) or pod2usage(2);

pod2usage(1) if $helpme;
pod2usage(-verbose => 2) if $man;

die 'No input file name specified!' unless $inputFileName;
die 'No output file name specified!' unless $outputFileName;

open(INFILE, '<', $inputFileName) or die "Could not open input file: $!";
open(OUTFILE, '>', $outputFileName) or die "Could not open/create output file: $!";

while(<INFILE>) {
	chomp;
	
	if ((/^50\|([^|]+)\|/) || (/^51\|([^|]+)\|/)) {
		my $empID = $1;
		$seen{$empID}++ && next; 	
	}
	 if ((/^90\|([^|]+)\|/) || (/^91\|([^|]+)\|/)) {
		my $eDate = $1;
		my $dt = Time::Piece->strptime($eDate, '%m/%d/%Y');
		print OUTFILE $dt->strftime('%Y%m%d'), "\n";
	}
}

close INFILE;
close OUTFILE;

Open in new window

my $empID = $1;
                $seen{$empID}++ && next;       
            print OUTFILE $empID, ",";
Avatar of shragi

ASKER

yes i did that ozo, no effect.
but if you check the script

let's say we are checking line that starts with 50 and lets say empid is seen so "&&next"
 will go to next line which is line that starts with 90 (second if condition in script)
as it passes second if condition it gets the date - so now i have no emp id but have date

Thanks,
SOLUTION
Avatar of ozo
ozo
Flag of United States of America image

Link to home
membership
Create an account to see this answer
Signing up is free. No credit card required.
Create Account
Avatar of shragi

ASKER

OZO I just figured out another solution, it worked can you please check whether i made any mistakes I mean does it gonna fail ever.

#!perl

use strict;
use warnings;
use Pod::Usage;
use Getopt::Long;
use Time::Piece;

my $helpme = 0;
my $man = 0;

my $outputFileName = 'C:\\temp\\test_v1.csv';
my $inputFileName =  'C:\\temp\\test.txt';

my $errorcode = 0;
my $DEBUG=0;
my $tempflag = 0;

if(exists $ENV{DEBUG}) {
	$DEBUG = ($ENV{DEBUG} eq "") ? 0 : $ENV{DEBUG};
}

GetOptions('help' => \$helpme, 'man' => \$man, 'infile=s' => \$inputFileName, 'outfile=s' => \$outputFileName) or pod2usage(2);

pod2usage(1) if $helpme;
pod2usage(-verbose => 2) if $man;

die 'No input file name specified!' unless $inputFileName;
die 'No output file name specified!' unless $outputFileName;

open(INFILE, '<', $inputFileName) or die "Could not open input file: $!";
open(OUTFILE, '>', $outputFileName) or die "Could not open/create output file: $!";

while(<INFILE>) {
	chomp;
	
	if (((/^50\|([^|]+)\|/) || (/^51\|([^|]+)\|/)) && ($tempflag == 0)) {
		my $empID = $1;
		if($seen{$empID}++){
			$tempflag++ ; 	
			next;
		}
		print OUTFILE $empID, ",";
	}
	 if ((/^90\|([^|]+)\|/) || (/^91\|([^|]+)\|/)) {
		my $eDate = $1;
		my $dt = Time::Piece->strptime($eDate, '%m/%d/%Y');
		if($tempflag == 0){
			print OUTFILE $dt->strftime('%Y%m%d'), "\n";
		}		
		if($tempflag > 0){
			$tempflag = 0;
		}
	}
}

close INFILE;
close OUTFILE;

Open in new window

ASKER CERTIFIED SOLUTION
Link to home
membership
Create an account to see this answer
Signing up is free. No credit card required.
Create Account
I think your way will work but it's certainly not as clean or efficient as ozo/my solution (I always hate when someone else responds with the same answer while I'm typing it out).
Avatar of shragi

ASKER

Thanks OZO and wilcoxon - both solution are same and it worked along with mine. I will use your solution as it is much cleaner than mine.

Thanks,