Solved

Data Cleansing Records based on Phone No

Posted on 2009-04-06
2
817 Views
Last Modified: 2013-11-15
I have a 800K prospect records which i want to run a limited deduplication process on as described below.

It is a flat file format with each master record having up to 5 telephone fields and it is on these fields alone the de dupe process is based.

Aims:

1.Were a master record has the same tel nos as another master record (any) *one* master record is to be kept
2.A master record should be removed were all of its tel nos are contained in another master record that also has additional tel nos.
3.If a master record has 1 number that is contained in another master record, and one number that is unique, even if the other master record has lots more additional numbers the first master record should be kept (ie likely both these records will be kept)

So, telephone numbers are never removed, only whole master records. Also, tel nos from 2 master records are never combined. Also, a unique telephone number to the db should never be discarded, its master record needs to be kept. A record with a unique telephone no may include dupe telephone nos from other records were they also have more unique tel nos.
 
I have a draft algorithm made up, can you check if it gives the results i expect:

here goes:

1. I normalize the tels into a seperate table with fields
-id of original master record (i created this, the records originally had no id)
-tel no
2. i check the master records were all tels match exactly and remove the dupes from the master tbl (being carefull not to allow numbers duplicated within the master record fields to distort process). Any *one* of the identicle records will be kept here.
3. i create an extra field in this tel nos table called 'no of tels'. This is populated with the total number of tels that the master record has
4. i compare each tel no. record with the others, if the tel no matches another tel no. the one(s) with the largest "no of tel nos" is marked with a "keep" (in a new field in the tel no. table), while the others are marked with a "dupe"
5.i pull the master records back together with its tel no records, those master records were all tel nos have been marked with "dupe" are removed from the master table, those records were any tel nos are marked keep are kept

I realize this strategy of dedupe is not very thorough, **i really just want comment on whether my algorithm acheives what i am trying to do** - am i sure not to lose any unique tels etc...
0
Comment
Question by:garyb2008
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
2 Comments
 
LVL 51

Accepted Solution

by:
Mark Wills earned 400 total points
ID: 24088391
Is the flat file the only source of data (ie an initial build or migrate) ? or is it being applied against an existing database (ie an ongoing update process) ?

Your process sounds about right... My initial thought / approach would be :

1) initial load of flat file into "staging" area, use of ID identity / auto_number column to uniquely identify rows.
2) validate / format phone numbers to consistant format (e.g. remove imbedded spaces, country codes, brakets, etc)
3) remove / flag rows where validation fails (and then have to decide what to do with those - for the remainder of this process they are ignored)
4) remove / flag 'lesser' rows where a phone number exists for a different ID and that ID has at least the same number of (different) phone number entries.
5) remaining are new ID's with any one phone number existing only once in the staging table
6) could then repeat step 3 by coparing the "live" table with the staging table
7) update live data table with staging data


The big challenge is deciding which is the best method for deciding if any one of my phone numbers exists in any other row. Could either create a new table as you have previously said, or simply use a subquery to effectively create a derived table. either which way, it is probably the same subquery to either build a temp table, or use in a query. That will depend pretty much on row counts and at potentially 800K rows * 5 numbers there is a reasonably significant number of entries to consider.

So, that subquery is probably something like :

select * from
(
  select id, phone1 as valid_phone from my_staging_table where phone1 is not NULL and valid_row_flag = 'Y'
  union
  select id, phone2 from my_staging_table where phone2 is not NULL and valid_row_flag = 'Y'
  union
  select id, phone3 from my_staging_table where phone3 is not NULL and valid_row_flag = 'Y'
  union
  select id, phone4 from my_staging_table where phone4 is not NULL and valid_row_flag = 'Y'
  union
  select id, phone5 from my_staging_table where phone5 is not NULL and valid_row_flag = 'Y'
) as P

0
 
LVL 69

Assisted Solution

by:Scott Pletcher
Scott Pletcher earned 100 total points
ID: 24089031
Add these steps:

0) When creating the work table, make sure flag is defined as CHAR(1) or some other *fixed* length type (tinyint, etc.). Don't make it varchar and start it out as empty and fill it in later -- this could lead to page splits.

1B) Create a CLUSTERED index on (tel#, id)

1K) Mark as "K" (keep) all rows for any master record (mr) that has a unique tel#:

UPDATE tn1
SET flag = 'K'
FROM tablename tn1
WHERE NOT EXISTS (
SELECT 1
FROM tablename tn2
WHERE tn2.tel# = tn1.tel#
AND tn2.id <> tn1.id
)

[With a clus index built, this query should run VERY quickly.)

For all your future steps, add a "WHERE flag <> 'K'" condition.
0

Featured Post

Back Up Your Microsoft Windows Server®

Back up all your Microsoft Windows Server – on-premises, in remote locations, in private and hybrid clouds. Your entire Windows Server will be backed up in one easy step with patented, block-level disk imaging. We achieve RTOs (recovery time objectives) as low as 15 seconds.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

As technology users and professionals, we’re always learning. Our universal interest in advancing our knowledge of the trade is unmatched by most industries. It’s a curiosity that makes sense, given the climate of change. Within that, there lies a…
Your data is at risk. Probably more today that at any other time in history. There are simply more people with more access to the Web with bad intentions.
Using examples as well as descriptions, and references to Books Online, show the documentation available for date manipulation functions and by using a select few of these functions, show how date based data can be manipulated with these functions.
Via a live example combined with referencing Books Online, show some of the information that can be extracted from the Catalog Views in SQL Server.

696 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question