pdadddino
asked on
need to identify dups in a comma delimited file
I have a daily job that recieves a file created by another group. The file is dropped into a directory for me to pickup and process using a unix korn shell script, KSH.
In the ksh scripts the file gets messaged (AWK) and then bcp'd in to sybase DB.
the file is supposed to have unique rows based on 4 fields. Sometimes we get "dup" recs. i need to check in the ksh for a dup.
if a dup occurs i will email the contact's and tell them to create a dup free file.
my main concern is how to identify the "dups"!
In the ksh scripts the file gets messaged (AWK) and then bcp'd in to sybase DB.
the file is supposed to have unique rows based on 4 fields. Sometimes we get "dup" recs. i need to check in the ksh for a dup.
if a dup occurs i will email the contact's and tell them to create a dup free file.
my main concern is how to identify the "dups"!
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
rightmirem.
Your method will be *extremely* slow.
$ wc -l file.dat
235892 file.dat
$ tail -3 file.dat
ventconfd_event_channel/re g_door
a,b,c,d
a,b,c,d
$ time sort file.dat|uniq -d
a,b,c,d
real 0m4.909s
user 0m4.070s
sys 0m0.390s
$ cat dup.sh
#!/bin/sh
cat file.dat | while read line
do
if [ `grep -c "^$line$" file.dat` -gt 1 ]
then
echo "Duplicate found: $line"
fi
done
$ time ./dup.sh
I left dup.sh running for almost an hour and a half, before I killed it (at which time, I had only processed 7500 lines out of the total of 235000). At that rate, it would have taken almost 2 days to complete, instead of 5 seconds for the sort/uniq solution!
Your method will be *extremely* slow.
$ wc -l file.dat
235892 file.dat
$ tail -3 file.dat
ventconfd_event_channel/re
a,b,c,d
a,b,c,d
$ time sort file.dat|uniq -d
a,b,c,d
real 0m4.909s
user 0m4.070s
sys 0m0.390s
$ cat dup.sh
#!/bin/sh
cat file.dat | while read line
do
if [ `grep -c "^$line$" file.dat` -gt 1 ]
then
echo "Duplicate found: $line"
fi
done
$ time ./dup.sh
I left dup.sh running for almost an hour and a half, before I killed it (at which time, I had only processed 7500 lines out of the total of 235000). At that rate, it would have taken almost 2 days to complete, instead of 5 seconds for the sort/uniq solution!
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
ended up using nawk, but both suggestions were right on!
<filename> contains...
a,d,c,d
1,2,3,4
w,x,y,z
1,2,3,4
...where "1,2,3,4" are duplicate lines, it might be as simple as reading the whole line into a variable, grepping for the "$line", and then parsing things out with awk.
I.E. IN the following, the var "line" will get the entire line from the file ("1,2,3,4"), check the file for additional instances of it, and then go on to use awk from the var "file" for the rest of the script.
cat filename | while read line
do
if [ `grep -c "$line" <filename>` -gt 1 ]
then
<email message>
fi
firstVar=`echo $line | awk -F, '{ print $1]'`
<etc>
<etc>
<etc>
done
exit 0