Solved

Fuzzy Matching

Posted on 2004-10-19
3
1,029 Views
Last Modified: 2012-08-13
I have some data in my database which contains address data.  The problem with it is that it is of very terrible quality.  Say I have 3 orders placed by the same woman.  Those orders could be in the database as:

Carol Smith 101 State St Boston Masachusets
Carel Smythe One Hundred One State Street Boston Massachusetts
C. Smth One Zero One State Boston MA

I even have country data with some records.  I tried using soundex, but I get tons of false positives.  Is there any other way to query this data so that I get all of poor Carol's orders!!!  I am willing to purchase ANY third part tool which will help with this.

Thanks,
Matt
0
Comment
Question by:superfly18
3 Comments
 
LVL 19

Accepted Solution

by:
grant300 earned 500 total points
ID: 12348651
Careful what you ask for, you may get it.  You are, indeed in need of a third party tool to do address cleansing and normalization.  The probelm is, they are not cheap.  Check out <www.citationsoftware.com> for a range of PC-based solutions.   (No, I have no affiliation.)  The very least you will spend for a PC-based solution is $4,000 (up to maybe $25,000) plus you will need a regular subscription service to keep the postal ZIP code and other information up to date.  There are other products out there as well; citation just popped up first in the google search I did.

Several notes:
1)  I can only assume that with the numbers spelled out that many of these addresses are entered over the web.  Having built database applications for use by professional data entry people, I can assure you they would never make the extra keystrokes.  Given that, you need to do two things to make certain that that kind of un-parsable crap doesn't get in the database:
  A) Add examples and/or instructions to the entry section of the WEB page so people get a visual image before they start typing.
  B) Incorporate address verification at the time of entry.  To do this, you will have to take the address info from the page they submit, run it though
      the address cleansing software API in real time, and reject anything you cannot successfully process.  You will either have to try and translate
      error messages from the API (assuming they are any use at all anyway) into something granny can understand and/or pop up a window with
      a list of suggested problems and the correct form to enter the information with.
2)  Make certain that third party software you pick has an API so you are not stuck doing things after the fact in batch
3)  Make certain that the software generate 11 digit zipcodes.  There are an additional 2 digits that can be encoded on the zip that are the
     Carrier Walk Sequence.  That is the actual order in which the addresses physically occur on a side of the street for one block.  The nice thing
     about the 11 digit zip is that it is unique for all deliverable addresses.  This gives you a unique serial number for every address and makes finding
     dupes about as easy as one can imagine.
4)  If you truley have international addresses in any volume, you will have to spend the extra green to get a fully internationalized version of the software.
     You will not get the 11 digit zipcode for international addresses; however, your address density for most other countries will be very low anyway.
     For duplicate elimination, you might try the soundex of the last name, the postal code, and the street number.  That gives you a pretty good (but not
     certain) shot at uniqueness.

One other approach you can take is to send your address list out to a service that will normalize it and clean it up for you.  They are usually not too unreasonably priced and can turn this sort of thing around fairly quickly.  This could give you some immediate relief while you are tracking down and implementing the third party software solution.

Hope that helps,
Bill
0

Featured Post

Efficient way to get backups off site to Azure

This user guide provides instructions on how to deploy and configure both a StoneFly Scale Out NAS Enterprise Cloud Drive virtual machine and Veeam Cloud Connect in the Microsoft Azure Cloud.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Slowly Changing Dimension Transformation component in data task flow is very useful for us to manage and control how data changes in SSIS.
For both online and offline retail, the cross-channel business is the most recent pattern in the B2C trade space.
Familiarize people with the process of retrieving data from SQL Server using an Access pass-thru query. Microsoft Access is a very powerful client/server development tool. One of the ways that you can retrieve data from a SQL Server is by using a pa…
This videos aims to give the viewer a basic demonstration of how a user can query current session information by using the SYS_CONTEXT function

803 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question