Fuzzy Matching

Posted on 2004-10-19
Last Modified: 2012-08-13
I have some data in my database which contains address data.  The problem with it is that it is of very terrible quality.  Say I have 3 orders placed by the same woman.  Those orders could be in the database as:

Carol Smith 101 State St Boston Masachusets
Carel Smythe One Hundred One State Street Boston Massachusetts
C. Smth One Zero One State Boston MA

I even have country data with some records.  I tried using soundex, but I get tons of false positives.  Is there any other way to query this data so that I get all of poor Carol's orders!!!  I am willing to purchase ANY third part tool which will help with this.

Question by:superfly18
LVL 19

Accepted Solution

grant300 earned 500 total points
ID: 12348651
Careful what you ask for, you may get it.  You are, indeed in need of a third party tool to do address cleansing and normalization.  The probelm is, they are not cheap.  Check out <> for a range of PC-based solutions.   (No, I have no affiliation.)  The very least you will spend for a PC-based solution is $4,000 (up to maybe $25,000) plus you will need a regular subscription service to keep the postal ZIP code and other information up to date.  There are other products out there as well; citation just popped up first in the google search I did.

Several notes:
1)  I can only assume that with the numbers spelled out that many of these addresses are entered over the web.  Having built database applications for use by professional data entry people, I can assure you they would never make the extra keystrokes.  Given that, you need to do two things to make certain that that kind of un-parsable crap doesn't get in the database:
  A) Add examples and/or instructions to the entry section of the WEB page so people get a visual image before they start typing.
  B) Incorporate address verification at the time of entry.  To do this, you will have to take the address info from the page they submit, run it though
      the address cleansing software API in real time, and reject anything you cannot successfully process.  You will either have to try and translate
      error messages from the API (assuming they are any use at all anyway) into something granny can understand and/or pop up a window with
      a list of suggested problems and the correct form to enter the information with.
2)  Make certain that third party software you pick has an API so you are not stuck doing things after the fact in batch
3)  Make certain that the software generate 11 digit zipcodes.  There are an additional 2 digits that can be encoded on the zip that are the
     Carrier Walk Sequence.  That is the actual order in which the addresses physically occur on a side of the street for one block.  The nice thing
     about the 11 digit zip is that it is unique for all deliverable addresses.  This gives you a unique serial number for every address and makes finding
     dupes about as easy as one can imagine.
4)  If you truley have international addresses in any volume, you will have to spend the extra green to get a fully internationalized version of the software.
     You will not get the 11 digit zipcode for international addresses; however, your address density for most other countries will be very low anyway.
     For duplicate elimination, you might try the soundex of the last name, the postal code, and the street number.  That gives you a pretty good (but not
     certain) shot at uniqueness.

One other approach you can take is to send your address list out to a service that will normalize it and clean it up for you.  They are usually not too unreasonably priced and can turn this sort of thing around fairly quickly.  This could give you some immediate relief while you are tracking down and implementing the third party software solution.

Hope that helps,

Featured Post

Complete VMware vSphere® ESX(i) & Hyper-V Backup

Capture your entire system, including the host, with patented disk imaging integrated with VMware VADP / Microsoft VSS and RCT. RTOs is as low as 15 seconds with Acronis Active Restore™. You can enjoy unlimited P2V/V2V migrations from any source (even from a different hypervisor)

Join & Write a Comment

In this article—a derivative of my blog post (—I will explore a few different perspectives on which week today's date falls within using Microsoft SQL Server. First, to frame this stu…
Performance is the key factor for any successful data integration project, knowing the type of transformation that you’re using is the first step on optimizing the SSIS flow performance, by utilizing the correct transformation or the design alternat…
Using examples as well as descriptions, and references to Books Online, show the documentation available for date manipulation functions and by using a select few of these functions, show how date based data can be manipulated with these functions.
Via a live example, show how to set up a backup for SQL Server using a Maintenance Plan and how to schedule the job into SQL Server Agent.

760 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

19 Experts available now in Live!

Get 1:1 Help Now