Solved

Data Cleansing Records based on Phone No

Posted on 2009-04-06
2
801 Views
Last Modified: 2013-11-15
I have a 800K prospect records which i want to run a limited deduplication process on as described below.

It is a flat file format with each master record having up to 5 telephone fields and it is on these fields alone the de dupe process is based.

Aims:

1.Were a master record has the same tel nos as another master record (any) *one* master record is to be kept
2.A master record should be removed were all of its tel nos are contained in another master record that also has additional tel nos.
3.If a master record has 1 number that is contained in another master record, and one number that is unique, even if the other master record has lots more additional numbers the first master record should be kept (ie likely both these records will be kept)

So, telephone numbers are never removed, only whole master records. Also, tel nos from 2 master records are never combined. Also, a unique telephone number to the db should never be discarded, its master record needs to be kept. A record with a unique telephone no may include dupe telephone nos from other records were they also have more unique tel nos.
 
I have a draft algorithm made up, can you check if it gives the results i expect:

here goes:

1. I normalize the tels into a seperate table with fields
-id of original master record (i created this, the records originally had no id)
-tel no
2. i check the master records were all tels match exactly and remove the dupes from the master tbl (being carefull not to allow numbers duplicated within the master record fields to distort process). Any *one* of the identicle records will be kept here.
3. i create an extra field in this tel nos table called 'no of tels'. This is populated with the total number of tels that the master record has
4. i compare each tel no. record with the others, if the tel no matches another tel no. the one(s) with the largest "no of tel nos" is marked with a "keep" (in a new field in the tel no. table), while the others are marked with a "dupe"
5.i pull the master records back together with its tel no records, those master records were all tel nos have been marked with "dupe" are removed from the master table, those records were any tel nos are marked keep are kept

I realize this strategy of dedupe is not very thorough, **i really just want comment on whether my algorithm acheives what i am trying to do** - am i sure not to lose any unique tels etc...
0
Comment
Question by:garyb2008
2 Comments
 
LVL 51

Accepted Solution

by:
Mark Wills earned 400 total points
ID: 24088391
Is the flat file the only source of data (ie an initial build or migrate) ? or is it being applied against an existing database (ie an ongoing update process) ?

Your process sounds about right... My initial thought / approach would be :

1) initial load of flat file into "staging" area, use of ID identity / auto_number column to uniquely identify rows.
2) validate / format phone numbers to consistant format (e.g. remove imbedded spaces, country codes, brakets, etc)
3) remove / flag rows where validation fails (and then have to decide what to do with those - for the remainder of this process they are ignored)
4) remove / flag 'lesser' rows where a phone number exists for a different ID and that ID has at least the same number of (different) phone number entries.
5) remaining are new ID's with any one phone number existing only once in the staging table
6) could then repeat step 3 by coparing the "live" table with the staging table
7) update live data table with staging data


The big challenge is deciding which is the best method for deciding if any one of my phone numbers exists in any other row. Could either create a new table as you have previously said, or simply use a subquery to effectively create a derived table. either which way, it is probably the same subquery to either build a temp table, or use in a query. That will depend pretty much on row counts and at potentially 800K rows * 5 numbers there is a reasonably significant number of entries to consider.

So, that subquery is probably something like :

select * from
(
  select id, phone1 as valid_phone from my_staging_table where phone1 is not NULL and valid_row_flag = 'Y'
  union
  select id, phone2 from my_staging_table where phone2 is not NULL and valid_row_flag = 'Y'
  union
  select id, phone3 from my_staging_table where phone3 is not NULL and valid_row_flag = 'Y'
  union
  select id, phone4 from my_staging_table where phone4 is not NULL and valid_row_flag = 'Y'
  union
  select id, phone5 from my_staging_table where phone5 is not NULL and valid_row_flag = 'Y'
) as P

0
 
LVL 69

Assisted Solution

by:Scott Pletcher
Scott Pletcher earned 100 total points
ID: 24089031
Add these steps:

0) When creating the work table, make sure flag is defined as CHAR(1) or some other *fixed* length type (tinyint, etc.). Don't make it varchar and start it out as empty and fill it in later -- this could lead to page splits.

1B) Create a CLUSTERED index on (tel#, id)

1K) Mark as "K" (keep) all rows for any master record (mr) that has a unique tel#:

UPDATE tn1
SET flag = 'K'
FROM tablename tn1
WHERE NOT EXISTS (
SELECT 1
FROM tablename tn2
WHERE tn2.tel# = tn1.tel#
AND tn2.id <> tn1.id
)

[With a clus index built, this query should run VERY quickly.)

For all your future steps, add a "WHERE flag <> 'K'" condition.
0

Featured Post

3 Use Cases for Connected Systems

Our Dev teams are like yours. They’re continually cranking out code for new features/bugs fixes, testing, deploying, testing some more, responding to production monitoring events and more. It’s complex. So, we thought you’d like to see what’s working for us.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

These days, all we hear about hacktivists took down so and so websites and retrieved thousands of user’s data. One of the techniques to get unauthorized access to database is by performing SQL injection. This article is quite lengthy which gives bas…
As technology users and professionals, we’re always learning. Our universal interest in advancing our knowledge of the trade is unmatched by most industries. It’s a curiosity that makes sense, given the climate of change. Within that, there lies a…
Via a live example, show how to set up a backup for SQL Server using a Maintenance Plan and how to schedule the job into SQL Server Agent.
Viewers will learn how to use the UPDATE and DELETE statements to change or remove existing data from their tables. Make a table: Update a specific column given a specific row using the UPDATE statement: Remove a set of values using the DELETE s…

825 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question