Solved

Duplicate Record Identification and Deletion for MS Access

Posted on 2014-09-24
8
263 Views
Last Modified: 2014-11-06
I am responsible for a database of approximately 750,000 customer records. Using simple duplicate checking routines, I identify approximately 32,000 duplicate records. More if I use Soundex and possibly even more if I were to parse the address string into various elements. And still more when considering misspellings and middle initials appended to first names.

My question - Are any of you aware of any routines developed in Access VP that could assist me in this task. Not only would I like to identify duplicate records, but I would like to automatically delete/consolidate records that were found to be duplicates.

Thanks for the help.
0
Comment
Question by:dwcummings
8 Comments
 
LVL 57

Assisted Solution

by:Jim Dettman (Microsoft MVP/ EE MVE)
Jim Dettman (Microsoft MVP/ EE MVE) earned 100 total points
ID: 40342392
<<My question - Are any of you aware of any routines developed in Access VP that could assist me in this task. Not only would I like to identify duplicate records, but I would like to automatically delete/consolidate records that were found to be duplicates.>>

 There is plenty of code floating around to do:

1. Soundex

2. Collapse a string.

  On the second, what I mean is that you trim both ends, then remove all white space and punctuation.   This gives you a good string to match on.   I would use the last name, address, city, and state for the string.

 Once identified, Access has built-in a duplicates design wizard that you helps you construct a query to delete dups.

  I can't offer more specifics because you'll quickly discover that scrubbing data is not always as straight forward as one might think.

 I do have that code though and can post if you need it, but I thought I'd start with a general comment first to find out what direction(s) you'd like to go in.

Jim.
0
 
LVL 65

Assisted Solution

by:Jim Horn
Jim Horn earned 100 total points
ID: 40342396
Here's a SQL Server-based article on dealing with duplicates that was just published a couple of weeks ago.  The T-SQL won't necessarily translate to Access SQL, but if nothing else it'll give you some ideas.

As far as Access-specific routines, I'm sure the experts in the Access zone have some tricks up their sleeve.

>And still more when considering misspellings and middle initials appended to first names.
SQL Server has a dang deal called Master Data Services which deals specifically with how to orchestrate data in a company so that names are the same, and anyone's fat-finger mis-spellings of those names are ultimately given to a business SME who can make decisions on them, such as if Puerto Rico, Californeeah, Mexico, and Alberta should be considered states.  The price tag of implimenting this is likely higher than an current Access only process would warrant.
0
 
LVL 3

Author Comment

by:dwcummings
ID: 40342608
Thanks for the comments. My situation is a little more complicated than I mentioned. I can't use the built in duplication routines because I need to consolidate as many as 20 child tables from the eliminated record into the retained record. And I need to apply certain rules to determine which record to keep and which to eliminate. I am hoping to do this with Access VB. What I am looking for is some routines to help me resolve name spelling issues as well as non-standard address formats, i.e., 101 W Main Street vs. 101 West Main vs. 101 West Main ST # 102.

The data is stored in MS SQL 2008 R2 databases and the front end is Access - currently 2003, but soon to be 2010.

I wish we could get users to standardize on data input formats and I have built in various traps to catch errors. But it seems that I am not as adept at anticipating their mistakes as they are in making them.

Thanks again.
0
 
LVL 34

Accepted Solution

by:
PatHartman earned 100 total points
ID: 40342729
Cleaning client lists can be time consuming.  About 10 years ago I did a project for a large multi-national insurance company.  There were 8 subsidiaries and 7 in house systems that maintained customer lists.  The task was to take the data from all those systems, scrub it and eliminate duplicates.  The files came to me from Oracle, DB2, Sybase, IMS, LotusNotes, SQL Server, and Excel.  Access was actually an excellent tool for the job since it could interface with all these databases via ODBC except for IMS which I had to use a utility to load into SQL Server first.  I wrote a lot of code to clean up addresses to standardize the names.  So, I did some analysis and found the various ways people spelled Street or Road or Turnpike, etc and created tables to take all the variations and make them the same.  I removed all the punctuation and all the extraneous spaces.  I did similar things with the names.  I spent a couple of months at it.  Then we found some software we could buy (the name escapes me at the moment but they were located in Greenwich, CT)  It was customizable and it assigned a percentage value to the matches.  So after working with it for a while, we decided to accept anything it said was 90% or better as a match.  And then worked on refining the list manually from there.  The software came in two pieces.  One piece analyzed existing files and the second piece was called from your data entry app to look for a duplicate before the new record was saved to prevent future duplicates.

I did a search and came up with this - http://www.winpure.com/article-listcleaningsoftware.html  -  They will clean up the file if it is a one time event or you can purchase software if you have an ongoing need.  Trust me when I tell you that you really don't want to write this yourself if there is any other alternative.  Spend the money to buy the software.  It will be cheaper in the long run.
0
What Security Threats Are You Missing?

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

 
LVL 49

Assisted Solution

by:Gustav Brock
Gustav Brock earned 100 total points
ID: 40343350
Pat is right. This might be the least exciting wheel you may reinvent.
Go for the trial. It could be interesting hearing about your experience.

/gustav
0
 
LVL 74

Assisted Solution

by:Jeffrey Coachman
Jeffrey Coachman earned 100 total points
ID: 40343969
Other thoughts:
(no points wanted)

Consider a scenario where:
101 W Main Street vs. 101 West Main
...all potentially legitimate separate addresses...
101 W Main Street
101 West Main Avenue

Also consider that international address conventions will vary from country to country.
Also note the somewhat ambiguous designations like: Town, City, Province, Hamlet, Village

But this is all just for the Addresses, checking Names will be a totally separate/different process

but I would like to automatically delete/consolidate records that were found to be duplicates.
...With no process being 100% accurate, ...how would you know which were incorrectly "corrected"?
Are you OK with the loosing a few (potentially important) customers?

This topic has been covered many times here.
In other words, ..after you clean the data, ...you need a clear system to show you each customer record, and what it was ultimately changed to.
This way you could easily catch the 10%, or so, addresses that must be edited manually

JeffCoachman
0
 
LVL 3

Author Closing Comment

by:dwcummings
ID: 40427016
Sorry that I haven't revisited this since it was originally posted. Everyone's comments were helpful, but I decided to use the third party solution suggested by Pat Hartman. Thanks to everyone that responded to my issue.
0
 
LVL 34

Expert Comment

by:PatHartman
ID: 40427033
Let us know how it works for you so we know whether or not to recommend it in the future.
0

Featured Post

Complete VMware vSphere® ESX(i) & Hyper-V Backup

Capture your entire system, including the host, with patented disk imaging integrated with VMware VADP / Microsoft VSS and RCT. RTOs is as low as 15 seconds with Acronis Active Restore™. You can enjoy unlimited P2V/V2V migrations from any source (even from a different hypervisor)

Join & Write a Comment

Ever needed a SQL 2008 Database replicated/mirrored/log shipped on another server but you can't take the downtime inflicted by initial snapshot or disconnect while T-logs are restored or mirror applied? You can use SQL Server Initialize from Backup…
Describes a method of obtaining an object variable to an already running instance of Microsoft Access so that it can be controlled via automation.
Viewers will learn how to use the SELECT statement in SQL and will be exposed to the many uses the SELECT statement has.
Viewers will learn how to use the UPDATE and DELETE statements to change or remove existing data from their tables. Make a table: Update a specific column given a specific row using the UPDATE statement: Remove a set of values using the DELETE s…

747 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

12 Experts available now in Live!

Get 1:1 Help Now