Solved

Data Cleansing Records based on Phone No

Posted on 2009-04-06
2
811 Views
Last Modified: 2013-11-15
I have a 800K prospect records which i want to run a limited deduplication process on as described below.

It is a flat file format with each master record having up to 5 telephone fields and it is on these fields alone the de dupe process is based.

Aims:

1.Were a master record has the same tel nos as another master record (any) *one* master record is to be kept
2.A master record should be removed were all of its tel nos are contained in another master record that also has additional tel nos.
3.If a master record has 1 number that is contained in another master record, and one number that is unique, even if the other master record has lots more additional numbers the first master record should be kept (ie likely both these records will be kept)

So, telephone numbers are never removed, only whole master records. Also, tel nos from 2 master records are never combined. Also, a unique telephone number to the db should never be discarded, its master record needs to be kept. A record with a unique telephone no may include dupe telephone nos from other records were they also have more unique tel nos.
 
I have a draft algorithm made up, can you check if it gives the results i expect:

here goes:

1. I normalize the tels into a seperate table with fields
-id of original master record (i created this, the records originally had no id)
-tel no
2. i check the master records were all tels match exactly and remove the dupes from the master tbl (being carefull not to allow numbers duplicated within the master record fields to distort process). Any *one* of the identicle records will be kept here.
3. i create an extra field in this tel nos table called 'no of tels'. This is populated with the total number of tels that the master record has
4. i compare each tel no. record with the others, if the tel no matches another tel no. the one(s) with the largest "no of tel nos" is marked with a "keep" (in a new field in the tel no. table), while the others are marked with a "dupe"
5.i pull the master records back together with its tel no records, those master records were all tel nos have been marked with "dupe" are removed from the master table, those records were any tel nos are marked keep are kept

I realize this strategy of dedupe is not very thorough, **i really just want comment on whether my algorithm acheives what i am trying to do** - am i sure not to lose any unique tels etc...
0
Comment
Question by:garyb2008
2 Comments
 
LVL 51

Accepted Solution

by:
Mark Wills earned 400 total points
ID: 24088391
Is the flat file the only source of data (ie an initial build or migrate) ? or is it being applied against an existing database (ie an ongoing update process) ?

Your process sounds about right... My initial thought / approach would be :

1) initial load of flat file into "staging" area, use of ID identity / auto_number column to uniquely identify rows.
2) validate / format phone numbers to consistant format (e.g. remove imbedded spaces, country codes, brakets, etc)
3) remove / flag rows where validation fails (and then have to decide what to do with those - for the remainder of this process they are ignored)
4) remove / flag 'lesser' rows where a phone number exists for a different ID and that ID has at least the same number of (different) phone number entries.
5) remaining are new ID's with any one phone number existing only once in the staging table
6) could then repeat step 3 by coparing the "live" table with the staging table
7) update live data table with staging data


The big challenge is deciding which is the best method for deciding if any one of my phone numbers exists in any other row. Could either create a new table as you have previously said, or simply use a subquery to effectively create a derived table. either which way, it is probably the same subquery to either build a temp table, or use in a query. That will depend pretty much on row counts and at potentially 800K rows * 5 numbers there is a reasonably significant number of entries to consider.

So, that subquery is probably something like :

select * from
(
  select id, phone1 as valid_phone from my_staging_table where phone1 is not NULL and valid_row_flag = 'Y'
  union
  select id, phone2 from my_staging_table where phone2 is not NULL and valid_row_flag = 'Y'
  union
  select id, phone3 from my_staging_table where phone3 is not NULL and valid_row_flag = 'Y'
  union
  select id, phone4 from my_staging_table where phone4 is not NULL and valid_row_flag = 'Y'
  union
  select id, phone5 from my_staging_table where phone5 is not NULL and valid_row_flag = 'Y'
) as P

0
 
LVL 69

Assisted Solution

by:Scott Pletcher
Scott Pletcher earned 100 total points
ID: 24089031
Add these steps:

0) When creating the work table, make sure flag is defined as CHAR(1) or some other *fixed* length type (tinyint, etc.). Don't make it varchar and start it out as empty and fill it in later -- this could lead to page splits.

1B) Create a CLUSTERED index on (tel#, id)

1K) Mark as "K" (keep) all rows for any master record (mr) that has a unique tel#:

UPDATE tn1
SET flag = 'K'
FROM tablename tn1
WHERE NOT EXISTS (
SELECT 1
FROM tablename tn2
WHERE tn2.tel# = tn1.tel#
AND tn2.id <> tn1.id
)

[With a clus index built, this query should run VERY quickly.)

For all your future steps, add a "WHERE flag <> 'K'" condition.
0

Featured Post

Efficient way to get backups off site to Azure

This user guide provides instructions on how to deploy and configure both a StoneFly Scale Out NAS Enterprise Cloud Drive virtual machine and Veeam Cloud Connect in the Microsoft Azure Cloud.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Why is this different from all of the other step by step guides?  Because I make a living as a DBA and not as a writer and I lived through this experience. Defining the name: When I talk to people they say different names on this subject stuff l…
When table data gets too large to manage or queries take too long to execute the solution is often to buy bigger hardware or assign more CPUs and memory resources to the machine to solve the problem. However, the best, cheapest and most effective so…
Video by: Steve
Using examples as well as descriptions, step through each of the common simple join types, explaining differences in syntax, differences in expected outputs and showing how the queries run along with the actual outputs based upon a simple set of dem…
Viewers will learn how to use the SELECT statement in SQL to return specific rows and columns, with various degrees of sorting and limits in place.

730 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question