Duplicate Record Identification and Deletion for MS Access

Posted on 2014-09-24
Medium Priority
Last Modified: 2014-11-06
I am responsible for a database of approximately 750,000 customer records. Using simple duplicate checking routines, I identify approximately 32,000 duplicate records. More if I use Soundex and possibly even more if I were to parse the address string into various elements. And still more when considering misspellings and middle initials appended to first names.

My question - Are any of you aware of any routines developed in Access VP that could assist me in this task. Not only would I like to identify duplicate records, but I would like to automatically delete/consolidate records that were found to be duplicates.

Thanks for the help.
Question by:dwcummings
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
LVL 58

Assisted Solution

by:Jim Dettman (Microsoft MVP/ EE MVE)
Jim Dettman (Microsoft MVP/ EE MVE) earned 400 total points
ID: 40342392
<<My question - Are any of you aware of any routines developed in Access VP that could assist me in this task. Not only would I like to identify duplicate records, but I would like to automatically delete/consolidate records that were found to be duplicates.>>

 There is plenty of code floating around to do:

1. Soundex

2. Collapse a string.

  On the second, what I mean is that you trim both ends, then remove all white space and punctuation.   This gives you a good string to match on.   I would use the last name, address, city, and state for the string.

 Once identified, Access has built-in a duplicates design wizard that you helps you construct a query to delete dups.

  I can't offer more specifics because you'll quickly discover that scrubbing data is not always as straight forward as one might think.

 I do have that code though and can post if you need it, but I thought I'd start with a general comment first to find out what direction(s) you'd like to go in.

LVL 66

Assisted Solution

by:Jim Horn
Jim Horn earned 400 total points
ID: 40342396
Here's a SQL Server-based article on dealing with duplicates that was just published a couple of weeks ago.  The T-SQL won't necessarily translate to Access SQL, but if nothing else it'll give you some ideas.

As far as Access-specific routines, I'm sure the experts in the Access zone have some tricks up their sleeve.

>And still more when considering misspellings and middle initials appended to first names.
SQL Server has a dang deal called Master Data Services which deals specifically with how to orchestrate data in a company so that names are the same, and anyone's fat-finger mis-spellings of those names are ultimately given to a business SME who can make decisions on them, such as if Puerto Rico, Californeeah, Mexico, and Alberta should be considered states.  The price tag of implimenting this is likely higher than an current Access only process would warrant.

Author Comment

ID: 40342608
Thanks for the comments. My situation is a little more complicated than I mentioned. I can't use the built in duplication routines because I need to consolidate as many as 20 child tables from the eliminated record into the retained record. And I need to apply certain rules to determine which record to keep and which to eliminate. I am hoping to do this with Access VB. What I am looking for is some routines to help me resolve name spelling issues as well as non-standard address formats, i.e., 101 W Main Street vs. 101 West Main vs. 101 West Main ST # 102.

The data is stored in MS SQL 2008 R2 databases and the front end is Access - currently 2003, but soon to be 2010.

I wish we could get users to standardize on data input formats and I have built in various traps to catch errors. But it seems that I am not as adept at anticipating their mistakes as they are in making them.

Thanks again.

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

LVL 38

Accepted Solution

PatHartman earned 400 total points
ID: 40342729
Cleaning client lists can be time consuming.  About 10 years ago I did a project for a large multi-national insurance company.  There were 8 subsidiaries and 7 in house systems that maintained customer lists.  The task was to take the data from all those systems, scrub it and eliminate duplicates.  The files came to me from Oracle, DB2, Sybase, IMS, LotusNotes, SQL Server, and Excel.  Access was actually an excellent tool for the job since it could interface with all these databases via ODBC except for IMS which I had to use a utility to load into SQL Server first.  I wrote a lot of code to clean up addresses to standardize the names.  So, I did some analysis and found the various ways people spelled Street or Road or Turnpike, etc and created tables to take all the variations and make them the same.  I removed all the punctuation and all the extraneous spaces.  I did similar things with the names.  I spent a couple of months at it.  Then we found some software we could buy (the name escapes me at the moment but they were located in Greenwich, CT)  It was customizable and it assigned a percentage value to the matches.  So after working with it for a while, we decided to accept anything it said was 90% or better as a match.  And then worked on refining the list manually from there.  The software came in two pieces.  One piece analyzed existing files and the second piece was called from your data entry app to look for a duplicate before the new record was saved to prevent future duplicates.

I did a search and came up with this - http://www.winpure.com/article-listcleaningsoftware.html  -  They will clean up the file if it is a one time event or you can purchase software if you have an ongoing need.  Trust me when I tell you that you really don't want to write this yourself if there is any other alternative.  Spend the money to buy the software.  It will be cheaper in the long run.
LVL 51

Assisted Solution

by:Gustav Brock
Gustav Brock earned 400 total points
ID: 40343350
Pat is right. This might be the least exciting wheel you may reinvent.
Go for the trial. It could be interesting hearing about your experience.

LVL 74

Assisted Solution

by:Jeffrey Coachman
Jeffrey Coachman earned 400 total points
ID: 40343969
Other thoughts:
(no points wanted)

Consider a scenario where:
101 W Main Street vs. 101 West Main
...all potentially legitimate separate addresses...
101 W Main Street
101 West Main Avenue

Also consider that international address conventions will vary from country to country.
Also note the somewhat ambiguous designations like: Town, City, Province, Hamlet, Village

But this is all just for the Addresses, checking Names will be a totally separate/different process

but I would like to automatically delete/consolidate records that were found to be duplicates.
...With no process being 100% accurate, ...how would you know which were incorrectly "corrected"?
Are you OK with the loosing a few (potentially important) customers?

This topic has been covered many times here.
In other words, ..after you clean the data, ...you need a clear system to show you each customer record, and what it was ultimately changed to.
This way you could easily catch the 10%, or so, addresses that must be edited manually


Author Closing Comment

ID: 40427016
Sorry that I haven't revisited this since it was originally posted. Everyone's comments were helpful, but I decided to use the third party solution suggested by Pat Hartman. Thanks to everyone that responded to my issue.
LVL 38

Expert Comment

ID: 40427033
Let us know how it works for you so we know whether or not to recommend it in the future.

Featured Post

Free Tool: Site Down Detector

Helpful to verify reports of your own downtime, or to double check a downed website you are trying to access.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

AutoNumbers should increment automatically, without duplicates.  But sometimes something goes wrong, and the next AutoNumber value is a duplicate.  This article shows how to recover from this problem.
What if you have to shut down the entire Citrix infrastructure for hardware maintenance, software upgrades or "the unknown"? I developed this plan for "the unknown" and hope that it helps you as well. This article explains how to properly shut down …
Access reports are powerful and flexible. Learn how to create a query and then a grouped report using the wizard. Modify the report design after the wizard is done to make it look better. There will be another video to explain how to put the final p…
Polish reports in Access so they look terrific. Take yourself to another level. Equations, Back Color, Alternate Back Color. Write easy VBA Code. Tighten space to use less pages. Launch report from a menu, considering criteria only when it is filled…
Suggested Courses

752 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question