Data Cleansing Records based on Phone No
Posted on 2009-04-06
I have a 800K prospect records which i want to run a limited deduplication process on as described below.
It is a flat file format with each master record having up to 5 telephone fields and it is on these fields alone the de dupe process is based.
1.Were a master record has the same tel nos as another master record (any) *one* master record is to be kept
2.A master record should be removed were all of its tel nos are contained in another master record that also has additional tel nos.
3.If a master record has 1 number that is contained in another master record, and one number that is unique, even if the other master record has lots more additional numbers the first master record should be kept (ie likely both these records will be kept)
So, telephone numbers are never removed, only whole master records. Also, tel nos from 2 master records are never combined. Also, a unique telephone number to the db should never be discarded, its master record needs to be kept. A record with a unique telephone no may include dupe telephone nos from other records were they also have more unique tel nos.
I have a draft algorithm made up, can you check if it gives the results i expect:
1. I normalize the tels into a seperate table with fields
-id of original master record (i created this, the records originally had no id)
2. i check the master records were all tels match exactly and remove the dupes from the master tbl (being carefull not to allow numbers duplicated within the master record fields to distort process). Any *one* of the identicle records will be kept here.
3. i create an extra field in this tel nos table called 'no of tels'. This is populated with the total number of tels that the master record has
4. i compare each tel no. record with the others, if the tel no matches another tel no. the one(s) with the largest "no of tel nos" is marked with a "keep" (in a new field in the tel no. table), while the others are marked with a "dupe"
5.i pull the master records back together with its tel no records, those master records were all tel nos have been marked with "dupe" are removed from the master table, those records were any tel nos are marked keep are kept
I realize this strategy of dedupe is not very thorough, **i really just want comment on whether my algorithm acheives what i am trying to do** - am i sure not to lose any unique tels etc...