?
Solved

Data Cleansing Records based on Phone No

Posted on 2009-04-06
2
Medium Priority
?
821 Views
Last Modified: 2013-11-15
I have a 800K prospect records which i want to run a limited deduplication process on as described below.

It is a flat file format with each master record having up to 5 telephone fields and it is on these fields alone the de dupe process is based.

Aims:

1.Were a master record has the same tel nos as another master record (any) *one* master record is to be kept
2.A master record should be removed were all of its tel nos are contained in another master record that also has additional tel nos.
3.If a master record has 1 number that is contained in another master record, and one number that is unique, even if the other master record has lots more additional numbers the first master record should be kept (ie likely both these records will be kept)

So, telephone numbers are never removed, only whole master records. Also, tel nos from 2 master records are never combined. Also, a unique telephone number to the db should never be discarded, its master record needs to be kept. A record with a unique telephone no may include dupe telephone nos from other records were they also have more unique tel nos.
 
I have a draft algorithm made up, can you check if it gives the results i expect:

here goes:

1. I normalize the tels into a seperate table with fields
-id of original master record (i created this, the records originally had no id)
-tel no
2. i check the master records were all tels match exactly and remove the dupes from the master tbl (being carefull not to allow numbers duplicated within the master record fields to distort process). Any *one* of the identicle records will be kept here.
3. i create an extra field in this tel nos table called 'no of tels'. This is populated with the total number of tels that the master record has
4. i compare each tel no. record with the others, if the tel no matches another tel no. the one(s) with the largest "no of tel nos" is marked with a "keep" (in a new field in the tel no. table), while the others are marked with a "dupe"
5.i pull the master records back together with its tel no records, those master records were all tel nos have been marked with "dupe" are removed from the master table, those records were any tel nos are marked keep are kept

I realize this strategy of dedupe is not very thorough, **i really just want comment on whether my algorithm acheives what i am trying to do** - am i sure not to lose any unique tels etc...
0
Comment
Question by:garyb2008
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
2 Comments
 
LVL 51

Accepted Solution

by:
Mark Wills earned 1600 total points
ID: 24088391
Is the flat file the only source of data (ie an initial build or migrate) ? or is it being applied against an existing database (ie an ongoing update process) ?

Your process sounds about right... My initial thought / approach would be :

1) initial load of flat file into "staging" area, use of ID identity / auto_number column to uniquely identify rows.
2) validate / format phone numbers to consistant format (e.g. remove imbedded spaces, country codes, brakets, etc)
3) remove / flag rows where validation fails (and then have to decide what to do with those - for the remainder of this process they are ignored)
4) remove / flag 'lesser' rows where a phone number exists for a different ID and that ID has at least the same number of (different) phone number entries.
5) remaining are new ID's with any one phone number existing only once in the staging table
6) could then repeat step 3 by coparing the "live" table with the staging table
7) update live data table with staging data


The big challenge is deciding which is the best method for deciding if any one of my phone numbers exists in any other row. Could either create a new table as you have previously said, or simply use a subquery to effectively create a derived table. either which way, it is probably the same subquery to either build a temp table, or use in a query. That will depend pretty much on row counts and at potentially 800K rows * 5 numbers there is a reasonably significant number of entries to consider.

So, that subquery is probably something like :

select * from
(
  select id, phone1 as valid_phone from my_staging_table where phone1 is not NULL and valid_row_flag = 'Y'
  union
  select id, phone2 from my_staging_table where phone2 is not NULL and valid_row_flag = 'Y'
  union
  select id, phone3 from my_staging_table where phone3 is not NULL and valid_row_flag = 'Y'
  union
  select id, phone4 from my_staging_table where phone4 is not NULL and valid_row_flag = 'Y'
  union
  select id, phone5 from my_staging_table where phone5 is not NULL and valid_row_flag = 'Y'
) as P

0
 
LVL 69

Assisted Solution

by:Scott Pletcher
Scott Pletcher earned 400 total points
ID: 24089031
Add these steps:

0) When creating the work table, make sure flag is defined as CHAR(1) or some other *fixed* length type (tinyint, etc.). Don't make it varchar and start it out as empty and fill it in later -- this could lead to page splits.

1B) Create a CLUSTERED index on (tel#, id)

1K) Mark as "K" (keep) all rows for any master record (mr) that has a unique tel#:

UPDATE tn1
SET flag = 'K'
FROM tablename tn1
WHERE NOT EXISTS (
SELECT 1
FROM tablename tn2
WHERE tn2.tel# = tn1.tel#
AND tn2.id <> tn1.id
)

[With a clus index built, this query should run VERY quickly.)

For all your future steps, add a "WHERE flag <> 'K'" condition.
0

Featured Post

Microsoft Certification Exam 74-409

Veeam® is happy to provide the Microsoft community with a study guide prepared by MVP and MCT, Orin Thomas. This guide will take you through each of the exam objectives, helping you to prepare for and pass the examination.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Microsoft Access is a place to store data within tables and represent this stored data using multiple database objects such as in form of macros, forms, reports, etc. After a MS Access database is created there is need to improve the performance and…
Ready to get certified? Check out some courses that help you prepare for third-party exams.
Polish reports in Access so they look terrific. Take yourself to another level. Equations, Back Color, Alternate Back Color. Write easy VBA Code. Tighten space to use less pages. Launch report from a menu, considering criteria only when it is filled…
In this video, Percona Solution Engineer Rick Golba discuss how (and why) you implement high availability in a database environment. To discuss how Percona Consulting can help with your design and architecture needs for your database and infrastr…
Suggested Courses

765 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question