Improve company productivity with a Business Account.Sign Up

  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 1039
  • Last Modified:

Fuzzy Matching

I have some data in my database which contains address data.  The problem with it is that it is of very terrible quality.  Say I have 3 orders placed by the same woman.  Those orders could be in the database as:

Carol Smith 101 State St Boston Masachusets
Carel Smythe One Hundred One State Street Boston Massachusetts
C. Smth One Zero One State Boston MA

I even have country data with some records.  I tried using soundex, but I get tons of false positives.  Is there any other way to query this data so that I get all of poor Carol's orders!!!  I am willing to purchase ANY third part tool which will help with this.

1 Solution
Careful what you ask for, you may get it.  You are, indeed in need of a third party tool to do address cleansing and normalization.  The probelm is, they are not cheap.  Check out <> for a range of PC-based solutions.   (No, I have no affiliation.)  The very least you will spend for a PC-based solution is $4,000 (up to maybe $25,000) plus you will need a regular subscription service to keep the postal ZIP code and other information up to date.  There are other products out there as well; citation just popped up first in the google search I did.

Several notes:
1)  I can only assume that with the numbers spelled out that many of these addresses are entered over the web.  Having built database applications for use by professional data entry people, I can assure you they would never make the extra keystrokes.  Given that, you need to do two things to make certain that that kind of un-parsable crap doesn't get in the database:
  A) Add examples and/or instructions to the entry section of the WEB page so people get a visual image before they start typing.
  B) Incorporate address verification at the time of entry.  To do this, you will have to take the address info from the page they submit, run it though
      the address cleansing software API in real time, and reject anything you cannot successfully process.  You will either have to try and translate
      error messages from the API (assuming they are any use at all anyway) into something granny can understand and/or pop up a window with
      a list of suggested problems and the correct form to enter the information with.
2)  Make certain that third party software you pick has an API so you are not stuck doing things after the fact in batch
3)  Make certain that the software generate 11 digit zipcodes.  There are an additional 2 digits that can be encoded on the zip that are the
     Carrier Walk Sequence.  That is the actual order in which the addresses physically occur on a side of the street for one block.  The nice thing
     about the 11 digit zip is that it is unique for all deliverable addresses.  This gives you a unique serial number for every address and makes finding
     dupes about as easy as one can imagine.
4)  If you truley have international addresses in any volume, you will have to spend the extra green to get a fully internationalized version of the software.
     You will not get the 11 digit zipcode for international addresses; however, your address density for most other countries will be very low anyway.
     For duplicate elimination, you might try the soundex of the last name, the postal code, and the street number.  That gives you a pretty good (but not
     certain) shot at uniqueness.

One other approach you can take is to send your address list out to a service that will normalize it and clean it up for you.  They are usually not too unreasonably priced and can turn this sort of thing around fairly quickly.  This could give you some immediate relief while you are tracking down and implementing the third party software solution.

Hope that helps,
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

Join & Write a Comment

Featured Post

What Kind of Coding Program is Right for You?

There are many ways to learn to code these days. From coding bootcamps like Flatiron School to online courses to totally free beginner resources. The best way to learn to code depends on many factors, but the most important one is you. See what course is best for you.

Tackle projects and never again get stuck behind a technical roadblock.
Join Now