Solved

Data Cleansing Records based on Phone No

Posted on 2009-04-06
2
792 Views
Last Modified: 2013-11-15
I have a 800K prospect records which i want to run a limited deduplication process on as described below.

It is a flat file format with each master record having up to 5 telephone fields and it is on these fields alone the de dupe process is based.

Aims:

1.Were a master record has the same tel nos as another master record (any) *one* master record is to be kept
2.A master record should be removed were all of its tel nos are contained in another master record that also has additional tel nos.
3.If a master record has 1 number that is contained in another master record, and one number that is unique, even if the other master record has lots more additional numbers the first master record should be kept (ie likely both these records will be kept)

So, telephone numbers are never removed, only whole master records. Also, tel nos from 2 master records are never combined. Also, a unique telephone number to the db should never be discarded, its master record needs to be kept. A record with a unique telephone no may include dupe telephone nos from other records were they also have more unique tel nos.
 
I have a draft algorithm made up, can you check if it gives the results i expect:

here goes:

1. I normalize the tels into a seperate table with fields
-id of original master record (i created this, the records originally had no id)
-tel no
2. i check the master records were all tels match exactly and remove the dupes from the master tbl (being carefull not to allow numbers duplicated within the master record fields to distort process). Any *one* of the identicle records will be kept here.
3. i create an extra field in this tel nos table called 'no of tels'. This is populated with the total number of tels that the master record has
4. i compare each tel no. record with the others, if the tel no matches another tel no. the one(s) with the largest "no of tel nos" is marked with a "keep" (in a new field in the tel no. table), while the others are marked with a "dupe"
5.i pull the master records back together with its tel no records, those master records were all tel nos have been marked with "dupe" are removed from the master table, those records were any tel nos are marked keep are kept

I realize this strategy of dedupe is not very thorough, **i really just want comment on whether my algorithm acheives what i am trying to do** - am i sure not to lose any unique tels etc...
0
Comment
Question by:garyb2008
2 Comments
 
LVL 51

Accepted Solution

by:
Mark Wills earned 400 total points
Comment Utility
Is the flat file the only source of data (ie an initial build or migrate) ? or is it being applied against an existing database (ie an ongoing update process) ?

Your process sounds about right... My initial thought / approach would be :

1) initial load of flat file into "staging" area, use of ID identity / auto_number column to uniquely identify rows.
2) validate / format phone numbers to consistant format (e.g. remove imbedded spaces, country codes, brakets, etc)
3) remove / flag rows where validation fails (and then have to decide what to do with those - for the remainder of this process they are ignored)
4) remove / flag 'lesser' rows where a phone number exists for a different ID and that ID has at least the same number of (different) phone number entries.
5) remaining are new ID's with any one phone number existing only once in the staging table
6) could then repeat step 3 by coparing the "live" table with the staging table
7) update live data table with staging data


The big challenge is deciding which is the best method for deciding if any one of my phone numbers exists in any other row. Could either create a new table as you have previously said, or simply use a subquery to effectively create a derived table. either which way, it is probably the same subquery to either build a temp table, or use in a query. That will depend pretty much on row counts and at potentially 800K rows * 5 numbers there is a reasonably significant number of entries to consider.

So, that subquery is probably something like :

select * from
(
  select id, phone1 as valid_phone from my_staging_table where phone1 is not NULL and valid_row_flag = 'Y'
  union
  select id, phone2 from my_staging_table where phone2 is not NULL and valid_row_flag = 'Y'
  union
  select id, phone3 from my_staging_table where phone3 is not NULL and valid_row_flag = 'Y'
  union
  select id, phone4 from my_staging_table where phone4 is not NULL and valid_row_flag = 'Y'
  union
  select id, phone5 from my_staging_table where phone5 is not NULL and valid_row_flag = 'Y'
) as P

0
 
LVL 69

Assisted Solution

by:ScottPletcher
ScottPletcher earned 100 total points
Comment Utility
Add these steps:

0) When creating the work table, make sure flag is defined as CHAR(1) or some other *fixed* length type (tinyint, etc.). Don't make it varchar and start it out as empty and fill it in later -- this could lead to page splits.

1B) Create a CLUSTERED index on (tel#, id)

1K) Mark as "K" (keep) all rows for any master record (mr) that has a unique tel#:

UPDATE tn1
SET flag = 'K'
FROM tablename tn1
WHERE NOT EXISTS (
SELECT 1
FROM tablename tn2
WHERE tn2.tel# = tn1.tel#
AND tn2.id <> tn1.id
)

[With a clus index built, this query should run VERY quickly.)

For all your future steps, add a "WHERE flag <> 'K'" condition.
0

Featured Post

Enabling OSINT in Activity Based Intelligence

Activity based intelligence (ABI) requires access to all available sources of data. Recorded Future allows analysts to observe structured data on the open, deep, and dark web.

Join & Write a Comment

Creating and Managing Databases with phpMyAdmin in cPanel.
Read about achieving the basic levels of HRIS security in the workplace.
Using examples as well as descriptions, and references to Books Online, show the different Recovery Models available in SQL Server and explain, as well as show how full, differential and transaction log backups are performed
Via a live example, show how to setup several different housekeeping processes for a SQL Server.

771 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

11 Experts available now in Live!

Get 1:1 Help Now