Solved

Duplicate Record Identification and Deletion for MS Access

Posted on 2014-09-24
8
271 Views
Last Modified: 2014-11-06
I am responsible for a database of approximately 750,000 customer records. Using simple duplicate checking routines, I identify approximately 32,000 duplicate records. More if I use Soundex and possibly even more if I were to parse the address string into various elements. And still more when considering misspellings and middle initials appended to first names.

My question - Are any of you aware of any routines developed in Access VP that could assist me in this task. Not only would I like to identify duplicate records, but I would like to automatically delete/consolidate records that were found to be duplicates.

Thanks for the help.
0
Comment
Question by:dwcummings
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
8 Comments
 
LVL 58

Assisted Solution

by:Jim Dettman (Microsoft MVP/ EE MVE)
Jim Dettman (Microsoft MVP/ EE MVE) earned 100 total points
ID: 40342392
<<My question - Are any of you aware of any routines developed in Access VP that could assist me in this task. Not only would I like to identify duplicate records, but I would like to automatically delete/consolidate records that were found to be duplicates.>>

 There is plenty of code floating around to do:

1. Soundex

2. Collapse a string.

  On the second, what I mean is that you trim both ends, then remove all white space and punctuation.   This gives you a good string to match on.   I would use the last name, address, city, and state for the string.

 Once identified, Access has built-in a duplicates design wizard that you helps you construct a query to delete dups.

  I can't offer more specifics because you'll quickly discover that scrubbing data is not always as straight forward as one might think.

 I do have that code though and can post if you need it, but I thought I'd start with a general comment first to find out what direction(s) you'd like to go in.

Jim.
0
 
LVL 66

Assisted Solution

by:Jim Horn
Jim Horn earned 100 total points
ID: 40342396
Here's a SQL Server-based article on dealing with duplicates that was just published a couple of weeks ago.  The T-SQL won't necessarily translate to Access SQL, but if nothing else it'll give you some ideas.

As far as Access-specific routines, I'm sure the experts in the Access zone have some tricks up their sleeve.

>And still more when considering misspellings and middle initials appended to first names.
SQL Server has a dang deal called Master Data Services which deals specifically with how to orchestrate data in a company so that names are the same, and anyone's fat-finger mis-spellings of those names are ultimately given to a business SME who can make decisions on them, such as if Puerto Rico, Californeeah, Mexico, and Alberta should be considered states.  The price tag of implimenting this is likely higher than an current Access only process would warrant.
0
 
LVL 3

Author Comment

by:dwcummings
ID: 40342608
Thanks for the comments. My situation is a little more complicated than I mentioned. I can't use the built in duplication routines because I need to consolidate as many as 20 child tables from the eliminated record into the retained record. And I need to apply certain rules to determine which record to keep and which to eliminate. I am hoping to do this with Access VB. What I am looking for is some routines to help me resolve name spelling issues as well as non-standard address formats, i.e., 101 W Main Street vs. 101 West Main vs. 101 West Main ST # 102.

The data is stored in MS SQL 2008 R2 databases and the front end is Access - currently 2003, but soon to be 2010.

I wish we could get users to standardize on data input formats and I have built in various traps to catch errors. But it seems that I am not as adept at anticipating their mistakes as they are in making them.

Thanks again.
0
Edgartown IT Case Study

Learn about Edgartown's quest to ensure the safety and security of the entire town's employee and citizen data. Read the case study!

 
LVL 38

Accepted Solution

by:
PatHartman earned 100 total points
ID: 40342729
Cleaning client lists can be time consuming.  About 10 years ago I did a project for a large multi-national insurance company.  There were 8 subsidiaries and 7 in house systems that maintained customer lists.  The task was to take the data from all those systems, scrub it and eliminate duplicates.  The files came to me from Oracle, DB2, Sybase, IMS, LotusNotes, SQL Server, and Excel.  Access was actually an excellent tool for the job since it could interface with all these databases via ODBC except for IMS which I had to use a utility to load into SQL Server first.  I wrote a lot of code to clean up addresses to standardize the names.  So, I did some analysis and found the various ways people spelled Street or Road or Turnpike, etc and created tables to take all the variations and make them the same.  I removed all the punctuation and all the extraneous spaces.  I did similar things with the names.  I spent a couple of months at it.  Then we found some software we could buy (the name escapes me at the moment but they were located in Greenwich, CT)  It was customizable and it assigned a percentage value to the matches.  So after working with it for a while, we decided to accept anything it said was 90% or better as a match.  And then worked on refining the list manually from there.  The software came in two pieces.  One piece analyzed existing files and the second piece was called from your data entry app to look for a duplicate before the new record was saved to prevent future duplicates.

I did a search and came up with this - http://www.winpure.com/article-listcleaningsoftware.html  -  They will clean up the file if it is a one time event or you can purchase software if you have an ongoing need.  Trust me when I tell you that you really don't want to write this yourself if there is any other alternative.  Spend the money to buy the software.  It will be cheaper in the long run.
0
 
LVL 50

Assisted Solution

by:Gustav Brock
Gustav Brock earned 100 total points
ID: 40343350
Pat is right. This might be the least exciting wheel you may reinvent.
Go for the trial. It could be interesting hearing about your experience.

/gustav
0
 
LVL 74

Assisted Solution

by:Jeffrey Coachman
Jeffrey Coachman earned 100 total points
ID: 40343969
Other thoughts:
(no points wanted)

Consider a scenario where:
101 W Main Street vs. 101 West Main
...all potentially legitimate separate addresses...
101 W Main Street
101 West Main Avenue

Also consider that international address conventions will vary from country to country.
Also note the somewhat ambiguous designations like: Town, City, Province, Hamlet, Village

But this is all just for the Addresses, checking Names will be a totally separate/different process

but I would like to automatically delete/consolidate records that were found to be duplicates.
...With no process being 100% accurate, ...how would you know which were incorrectly "corrected"?
Are you OK with the loosing a few (potentially important) customers?

This topic has been covered many times here.
In other words, ..after you clean the data, ...you need a clear system to show you each customer record, and what it was ultimately changed to.
This way you could easily catch the 10%, or so, addresses that must be edited manually

JeffCoachman
0
 
LVL 3

Author Closing Comment

by:dwcummings
ID: 40427016
Sorry that I haven't revisited this since it was originally posted. Everyone's comments were helpful, but I decided to use the third party solution suggested by Pat Hartman. Thanks to everyone that responded to my issue.
0
 
LVL 38

Expert Comment

by:PatHartman
ID: 40427033
Let us know how it works for you so we know whether or not to recommend it in the future.
0

Featured Post

Enroll in June's Course of the Month

June's Course of the Month is now available! Every 10 seconds, a consumer gets hit with ransomware. Refresh your knowledge of ransomware best practices by enrolling in this month's complimentary course for Premium Members, Team Accounts, and Qualified Experts.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Recently we ran in to an issue while running some SQL jobs where we were trying to process the cubes.  We got an error saying failure stating 'NT SERVICE\SQLSERVERAGENT does not have access to Analysis Services. So this is a way to automate that wit…
Microsoft Access is a place to store data within tables and represent this stored data using multiple database objects such as in form of macros, forms, reports, etc. After a MS Access database is created there is need to improve the performance and…
Viewers will learn how to use the SELECT statement in SQL and will be exposed to the many uses the SELECT statement has.
Add bar graphs to Access queries using Unicode block characters. Graphs appear on every record in the color you want. Give life to numbers. Hopes this gives you ideas on visualizing your data in new ways ~ Create a calculated field in a query: …

719 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question