Duplicate Record Identification and Deletion for MS Access

I am responsible for a database of approximately 750,000 customer records. Using simple duplicate checking routines, I identify approximately 32,000 duplicate records. More if I use Soundex and possibly even more if I were to parse the address string into various elements. And still more when considering misspellings and middle initials appended to first names.

My question - Are any of you aware of any routines developed in Access VP that could assist me in this task. Not only would I like to identify duplicate records, but I would like to automatically delete/consolidate records that were found to be duplicates.

Thanks for the help.
Douglas CummingsAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Jim Dettman (Microsoft MVP/ EE MVE)President / OwnerCommented:
<<My question - Are any of you aware of any routines developed in Access VP that could assist me in this task. Not only would I like to identify duplicate records, but I would like to automatically delete/consolidate records that were found to be duplicates.>>

 There is plenty of code floating around to do:

1. Soundex

2. Collapse a string.

  On the second, what I mean is that you trim both ends, then remove all white space and punctuation.   This gives you a good string to match on.   I would use the last name, address, city, and state for the string.

 Once identified, Access has built-in a duplicates design wizard that you helps you construct a query to delete dups.

  I can't offer more specifics because you'll quickly discover that scrubbing data is not always as straight forward as one might think.

 I do have that code though and can post if you need it, but I thought I'd start with a general comment first to find out what direction(s) you'd like to go in.

Jim HornSQL Server Data DudeCommented:
Here's a SQL Server-based article on dealing with duplicates that was just published a couple of weeks ago.  The T-SQL won't necessarily translate to Access SQL, but if nothing else it'll give you some ideas.

As far as Access-specific routines, I'm sure the experts in the Access zone have some tricks up their sleeve.

>And still more when considering misspellings and middle initials appended to first names.
SQL Server has a dang deal called Master Data Services which deals specifically with how to orchestrate data in a company so that names are the same, and anyone's fat-finger mis-spellings of those names are ultimately given to a business SME who can make decisions on them, such as if Puerto Rico, Californeeah, Mexico, and Alberta should be considered states.  The price tag of implimenting this is likely higher than an current Access only process would warrant.
Douglas CummingsAuthor Commented:
Thanks for the comments. My situation is a little more complicated than I mentioned. I can't use the built in duplication routines because I need to consolidate as many as 20 child tables from the eliminated record into the retained record. And I need to apply certain rules to determine which record to keep and which to eliminate. I am hoping to do this with Access VB. What I am looking for is some routines to help me resolve name spelling issues as well as non-standard address formats, i.e., 101 W Main Street vs. 101 West Main vs. 101 West Main ST # 102.

The data is stored in MS SQL 2008 R2 databases and the front end is Access - currently 2003, but soon to be 2010.

I wish we could get users to standardize on data input formats and I have built in various traps to catch errors. But it seems that I am not as adept at anticipating their mistakes as they are in making them.

Thanks again.
Determine the Perfect Price for Your IT Services

Do you wonder if your IT business is truly profitable or if you should raise your prices? Learn how to calculate your overhead burden with our free interactive tool and use it to determine the right price for your IT services. Download your free eBook now!

Cleaning client lists can be time consuming.  About 10 years ago I did a project for a large multi-national insurance company.  There were 8 subsidiaries and 7 in house systems that maintained customer lists.  The task was to take the data from all those systems, scrub it and eliminate duplicates.  The files came to me from Oracle, DB2, Sybase, IMS, LotusNotes, SQL Server, and Excel.  Access was actually an excellent tool for the job since it could interface with all these databases via ODBC except for IMS which I had to use a utility to load into SQL Server first.  I wrote a lot of code to clean up addresses to standardize the names.  So, I did some analysis and found the various ways people spelled Street or Road or Turnpike, etc and created tables to take all the variations and make them the same.  I removed all the punctuation and all the extraneous spaces.  I did similar things with the names.  I spent a couple of months at it.  Then we found some software we could buy (the name escapes me at the moment but they were located in Greenwich, CT)  It was customizable and it assigned a percentage value to the matches.  So after working with it for a while, we decided to accept anything it said was 90% or better as a match.  And then worked on refining the list manually from there.  The software came in two pieces.  One piece analyzed existing files and the second piece was called from your data entry app to look for a duplicate before the new record was saved to prevent future duplicates.

I did a search and came up with this - http://www.winpure.com/article-listcleaningsoftware.html  -  They will clean up the file if it is a one time event or you can purchase software if you have an ongoing need.  Trust me when I tell you that you really don't want to write this yourself if there is any other alternative.  Spend the money to buy the software.  It will be cheaper in the long run.

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Gustav BrockCIOCommented:
Pat is right. This might be the least exciting wheel you may reinvent.
Go for the trial. It could be interesting hearing about your experience.

Jeffrey CoachmanMIS LiasonCommented:
Other thoughts:
(no points wanted)

Consider a scenario where:
101 W Main Street vs. 101 West Main
...all potentially legitimate separate addresses...
101 W Main Street
101 West Main Avenue

Also consider that international address conventions will vary from country to country.
Also note the somewhat ambiguous designations like: Town, City, Province, Hamlet, Village

But this is all just for the Addresses, checking Names will be a totally separate/different process

but I would like to automatically delete/consolidate records that were found to be duplicates.
...With no process being 100% accurate, ...how would you know which were incorrectly "corrected"?
Are you OK with the loosing a few (potentially important) customers?

This topic has been covered many times here.
In other words, ..after you clean the data, ...you need a clear system to show you each customer record, and what it was ultimately changed to.
This way you could easily catch the 10%, or so, addresses that must be edited manually

Douglas CummingsAuthor Commented:
Sorry that I haven't revisited this since it was originally posted. Everyone's comments were helpful, but I decided to use the third party solution suggested by Pat Hartman. Thanks to everyone that responded to my issue.
Let us know how it works for you so we know whether or not to recommend it in the future.
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Microsoft Access

From novice to tech pro — start learning today.