Link to home
Start Free TrialLog in
Avatar of pdadddino
pdadddinoFlag for United States of America

asked on

need to identify dups in a comma delimited file

I have a daily job that recieves a file created by another group. The file is dropped into a directory for me to pickup and process using a unix korn shell script, KSH.
In the ksh scripts the file gets messaged (AWK) and then bcp'd in to sybase DB.
the file is supposed to have unique rows based on 4 fields. Sometimes we get "dup" recs. i need to check in the ksh for a dup.
if a dup occurs i will email the contact's and tell them to create a dup free file.
my main concern is how to identify the "dups"!
SOLUTION
Avatar of Tintin
Tintin

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of Mike R.
Mike R.

I am not sure I understand the file you are parsing (an example might be nice)...but maybe it is as simple as the following...

<filename> contains...
a,d,c,d
1,2,3,4
w,x,y,z
1,2,3,4

...where "1,2,3,4" are duplicate lines, it might be as simple as reading the whole line into a variable, grepping for the "$line", and then parsing things out with awk.

I.E.  IN the following, the var "line" will get the entire line from the file ("1,2,3,4"), check the file for additional instances of it, and then go on to use awk from the var "file" for the rest of the script.

cat filename | while read line
 do
  if [ `grep -c "$line" <filename>` -gt 1 ]
   then
    <email message>
  fi
  firstVar=`echo $line | awk -F, '{ print $1]'`
  <etc>
  <etc>
  <etc>
 done
exit 0
rightmirem.

Your method will be *extremely* slow.  

$ wc -l file.dat
  235892 file.dat
$ tail -3 file.dat
ventconfd_event_channel/reg_door
a,b,c,d
a,b,c,d
$ time sort file.dat|uniq -d
a,b,c,d

real    0m4.909s
user    0m4.070s
sys     0m0.390s
$ cat dup.sh
#!/bin/sh
cat file.dat | while read line
do
  if [ `grep -c "^$line$" file.dat` -gt 1 ]
  then
     echo "Duplicate found: $line"
  fi
done
$ time ./dup.sh


I left dup.sh running for almost an hour and a half, before I killed it (at which time, I had only processed 7500 lines out of the total of 235000). At that rate, it would have taken almost 2 days to complete, instead of 5 seconds for the sort/uniq solution!
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of pdadddino

ASKER

ended up using nawk, but both suggestions were right on!