Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 396
  • Last Modified:

need to identify dups in a comma delimited file

I have a daily job that recieves a file created by another group. The file is dropped into a directory for me to pickup and process using a unix korn shell script, KSH.
In the ksh scripts the file gets messaged (AWK) and then bcp'd in to sybase DB.
the file is supposed to have unique rows based on 4 fields. Sometimes we get "dup" recs. i need to check in the ksh for a dup.
if a dup occurs i will email the contact's and tell them to create a dup free file.
my main concern is how to identify the "dups"!
0
pdadddino
Asked:
pdadddino
2 Solutions
 
TintinCommented:
Assuming your data looks something like:

a,b,c
one,two,three
d,e,f
one,two,three

Then you can determine duplicates by doing:

[ -n "`sort file.dat | uniq -d`" ] &&  mailx -s "Duplicates found in file" email@address
0
 
Mike R.Commented:
I am not sure I understand the file you are parsing (an example might be nice)...but maybe it is as simple as the following...

<filename> contains...
a,d,c,d
1,2,3,4
w,x,y,z
1,2,3,4

...where "1,2,3,4" are duplicate lines, it might be as simple as reading the whole line into a variable, grepping for the "$line", and then parsing things out with awk.

I.E.  IN the following, the var "line" will get the entire line from the file ("1,2,3,4"), check the file for additional instances of it, and then go on to use awk from the var "file" for the rest of the script.

cat filename | while read line
 do
  if [ `grep -c "$line" <filename>` -gt 1 ]
   then
    <email message>
  fi
  firstVar=`echo $line | awk -F, '{ print $1]'`
  <etc>
  <etc>
  <etc>
 done
exit 0
0
 
TintinCommented:
rightmirem.

Your method will be *extremely* slow.  

$ wc -l file.dat
  235892 file.dat
$ tail -3 file.dat
ventconfd_event_channel/reg_door
a,b,c,d
a,b,c,d
$ time sort file.dat|uniq -d
a,b,c,d

real    0m4.909s
user    0m4.070s
sys     0m0.390s
$ cat dup.sh
#!/bin/sh
cat file.dat | while read line
do
  if [ `grep -c "^$line$" file.dat` -gt 1 ]
  then
     echo "Duplicate found: $line"
  fi
done
$ time ./dup.sh


I left dup.sh running for almost an hour and a half, before I killed it (at which time, I had only processed 7500 lines out of the total of 235000). At that rate, it would have taken almost 2 days to complete, instead of 5 seconds for the sort/uniq solution!
0
 
glassdCommented:
If you need to identify duplicate fields you could always run it through nawk first:

nawk 'BEGIN{RS=","}{print}' filename | sort | uniq -d
0
 
pdadddinoAuthor Commented:
ended up using nawk, but both suggestions were right on!
0

Featured Post

Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Tackle projects and never again get stuck behind a technical roadblock.
Join Now