Fuzzy Matching

Posted on 2004-10-19
Last Modified: 2012-08-13
I have some data in my database which contains address data.  The problem with it is that it is of very terrible quality.  Say I have 3 orders placed by the same woman.  Those orders could be in the database as:

Carol Smith 101 State St Boston Masachusets
Carel Smythe One Hundred One State Street Boston Massachusetts
C. Smth One Zero One State Boston MA

I even have country data with some records.  I tried using soundex, but I get tons of false positives.  Is there any other way to query this data so that I get all of poor Carol's orders!!!  I am willing to purchase ANY third part tool which will help with this.

Question by:superfly18
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
LVL 19

Accepted Solution

grant300 earned 500 total points
ID: 12348651
Careful what you ask for, you may get it.  You are, indeed in need of a third party tool to do address cleansing and normalization.  The probelm is, they are not cheap.  Check out <> for a range of PC-based solutions.   (No, I have no affiliation.)  The very least you will spend for a PC-based solution is $4,000 (up to maybe $25,000) plus you will need a regular subscription service to keep the postal ZIP code and other information up to date.  There are other products out there as well; citation just popped up first in the google search I did.

Several notes:
1)  I can only assume that with the numbers spelled out that many of these addresses are entered over the web.  Having built database applications for use by professional data entry people, I can assure you they would never make the extra keystrokes.  Given that, you need to do two things to make certain that that kind of un-parsable crap doesn't get in the database:
  A) Add examples and/or instructions to the entry section of the WEB page so people get a visual image before they start typing.
  B) Incorporate address verification at the time of entry.  To do this, you will have to take the address info from the page they submit, run it though
      the address cleansing software API in real time, and reject anything you cannot successfully process.  You will either have to try and translate
      error messages from the API (assuming they are any use at all anyway) into something granny can understand and/or pop up a window with
      a list of suggested problems and the correct form to enter the information with.
2)  Make certain that third party software you pick has an API so you are not stuck doing things after the fact in batch
3)  Make certain that the software generate 11 digit zipcodes.  There are an additional 2 digits that can be encoded on the zip that are the
     Carrier Walk Sequence.  That is the actual order in which the addresses physically occur on a side of the street for one block.  The nice thing
     about the 11 digit zip is that it is unique for all deliverable addresses.  This gives you a unique serial number for every address and makes finding
     dupes about as easy as one can imagine.
4)  If you truley have international addresses in any volume, you will have to spend the extra green to get a fully internationalized version of the software.
     You will not get the 11 digit zipcode for international addresses; however, your address density for most other countries will be very low anyway.
     For duplicate elimination, you might try the soundex of the last name, the postal code, and the street number.  That gives you a pretty good (but not
     certain) shot at uniqueness.

One other approach you can take is to send your address list out to a service that will normalize it and clean it up for you.  They are usually not too unreasonably priced and can turn this sort of thing around fairly quickly.  This could give you some immediate relief while you are tracking down and implementing the third party software solution.

Hope that helps,

Featured Post

Windows Server 2016: All you need to know

Learn about Hyper-V features that increase functionality and usability of Microsoft Windows Server 2016. Also, throughout this eBook, you’ll find some basic PowerShell examples that will help you leverage the scripts in your environments!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Everyone has problem when going to load data into Data warehouse (EDW). They all need to confirm that data quality is good but they don't no how to proceed. Microsoft has provided new task within SSIS 2008 called "Data Profiler Task". It solve th…
Recently we ran in to an issue while running some SQL jobs where we were trying to process the cubes.  We got an error saying failure stating 'NT SERVICE\SQLSERVERAGENT does not have access to Analysis Services. So this is a way to automate that wit…
Via a live example, show how to backup a database, simulate a failure backup the tail of the database transaction log and perform the restore.
Viewers will learn how the fundamental information of how to create a table.

687 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question