Fuzzy Matching

Posted on 2004-10-19
Last Modified: 2012-08-13
I have some data in my database which contains address data.  The problem with it is that it is of very terrible quality.  Say I have 3 orders placed by the same woman.  Those orders could be in the database as:

Carol Smith 101 State St Boston Masachusets
Carel Smythe One Hundred One State Street Boston Massachusetts
C. Smth One Zero One State Boston MA

I even have country data with some records.  I tried using soundex, but I get tons of false positives.  Is there any other way to query this data so that I get all of poor Carol's orders!!!  I am willing to purchase ANY third part tool which will help with this.

Question by:superfly18
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
LVL 19

Accepted Solution

grant300 earned 500 total points
ID: 12348651
Careful what you ask for, you may get it.  You are, indeed in need of a third party tool to do address cleansing and normalization.  The probelm is, they are not cheap.  Check out <> for a range of PC-based solutions.   (No, I have no affiliation.)  The very least you will spend for a PC-based solution is $4,000 (up to maybe $25,000) plus you will need a regular subscription service to keep the postal ZIP code and other information up to date.  There are other products out there as well; citation just popped up first in the google search I did.

Several notes:
1)  I can only assume that with the numbers spelled out that many of these addresses are entered over the web.  Having built database applications for use by professional data entry people, I can assure you they would never make the extra keystrokes.  Given that, you need to do two things to make certain that that kind of un-parsable crap doesn't get in the database:
  A) Add examples and/or instructions to the entry section of the WEB page so people get a visual image before they start typing.
  B) Incorporate address verification at the time of entry.  To do this, you will have to take the address info from the page they submit, run it though
      the address cleansing software API in real time, and reject anything you cannot successfully process.  You will either have to try and translate
      error messages from the API (assuming they are any use at all anyway) into something granny can understand and/or pop up a window with
      a list of suggested problems and the correct form to enter the information with.
2)  Make certain that third party software you pick has an API so you are not stuck doing things after the fact in batch
3)  Make certain that the software generate 11 digit zipcodes.  There are an additional 2 digits that can be encoded on the zip that are the
     Carrier Walk Sequence.  That is the actual order in which the addresses physically occur on a side of the street for one block.  The nice thing
     about the 11 digit zip is that it is unique for all deliverable addresses.  This gives you a unique serial number for every address and makes finding
     dupes about as easy as one can imagine.
4)  If you truley have international addresses in any volume, you will have to spend the extra green to get a fully internationalized version of the software.
     You will not get the 11 digit zipcode for international addresses; however, your address density for most other countries will be very low anyway.
     For duplicate elimination, you might try the soundex of the last name, the postal code, and the street number.  That gives you a pretty good (but not
     certain) shot at uniqueness.

One other approach you can take is to send your address list out to a service that will normalize it and clean it up for you.  They are usually not too unreasonably priced and can turn this sort of thing around fairly quickly.  This could give you some immediate relief while you are tracking down and implementing the third party software solution.

Hope that helps,

Featured Post

Free Tool: ZipGrep

ZipGrep is a utility that can list and search zip (.war, .ear, .jar, etc) archives for text patterns, without the need to extract the archive's contents.

One of a set of tools we're offering as a way to say thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
Can I skip a node in XML? 9 36
ODBC settings not showng in FileMaker External Data Sources 6 24
Related to SQL Query 5 21
Neglected Questions 3 14
When you hear the word proxy, you may become apprehensive. This article will help you to understand Proxy and when it is useful. Let's talk Proxy for SQL Server. (Not in terms of Internet access.) Typically, you'll run into this type of problem w…
International Data Corporation (IDC) prognosticates that before the current the year gets over disbursing on IT framework products to be sent in cloud environs will be $37.1B.
Via a live example, show how to setup several different housekeeping processes for a SQL Server.
Viewers will learn how to use the UPDATE and DELETE statements to change or remove existing data from their tables. Make a table: Update a specific column given a specific row using the UPDATE statement: Remove a set of values using the DELETE s…

730 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question