Pattern Matching

I need to match inconsistant data in the sql server database.
when people are filling in their company names in a registration form, they all put in different names for the same company.

eg: Intel
      Intel Corp
      Intel Corporation
      Intel Corporation.
...... so many other different ways

So, the column saves so many different names for the same company, later when I want to search for total number of people who attended from Intel, my pattern matching string is not able to consolodate all the different ways, I am using Trim(substring(company_name,.....)) in the where clause with SELECT statement.I am hardly getting 40% of the people that need to be counted.

There is really no consistant pattern to the way people write their company names, I have 40,000 different company names in my list, and with this inconsistant names, the list goes up to 60,000. I cannot stop people from giving me inconsistant names, I will have to come up with a pattern matching query which can pull out company synonyms
80-90% of the time.

How can I solve this issue.


Who is Participating?
jrb1Connect With a Mentor Commented:
One thing that may help is getting rid of the extra stuff in the field.

update your table set messycompanyfield = replace(messycompanyfield, 'Incorporated','');

Continue with the things that cause you problems, but you don't really need in the field:
   replace(messycompanyfield, 'Inc','')
   replace(messycompanyfield, '.','')

Hopefully, if you can clean this up, it will help some.  After that, you can use a product that does this matching, or implement an algorith yourself:

Try this:

FROM YourTable
WHERE Company_Name LIKE 'Intel%'
If you just want to count the number of attendees from a given company, you can use the LIKE operator as I've shown above.  If you want to count even those were the company name you are looking for is in the middle of the company name, you can also use this:

FROM YourTable
WHERE Company_Name LIKE '%Intel%'

This will count all companies with Intel in their name, whether in front, middle or last.
Upgrade your Question Security!

Your question, your audience. Choose who sees your identity—and your question—with question security.

Jim P.Commented:
Another possibility - break out the company listing into a separate table. Then clean the data up.

Then on the data entry form do a combo box/lookup. And you would need to account for multiple company locations, Such as
CompIdxNum      CompRef                  Name            Addr
1                  Intel Corp HQ - AZ            Intel Corp      123 Pheonix
2                  Intel Corp - Silicon Valey Intel Corp      345 Pine
3                  IBM - NY                  IBM            456 Broad
4                  IBM - CA                  IBM            567 West

And store the foriegn key.  Make sure that the combo box shows enough information that the data entry clerk(s) can tell the matching address. If they don't show up put a button/form to enter the new address.

Remember GIGO - Garbage In, Garbage Out.
The only way to successfully make this happen is to create a seive to first standardize into a known pattern. For example: "ABC XYZ" will always be different than "ABC RST". However, if you can build a seive substitition dictionary, then you might stand a chance. Given the above example your seive dictionary would contain a transformation where all occurrences of "RST" should be substituted with "XYZ". Then, "ABC XYZ" will equal "ABC XYZ".

I have done some programming work for the U.S.Postal service. This technique is how they distill "like" things like street addresses, city names, state abbreviations, etc.

Here's a seive dictionary I used recently for "UNKNOWN".

Get the idea?
The first real step, is to burst any existing data into it's component parts and sort it. Then the human eye and mind can map out all seive substitutions. This is NOT a trivial task. Good luck.
One last thing. In my experience, trying to use SOUNDEX or DIFFERENCE never has acceptable results. Because "Petal Company" and "Peddle Company" have the same Soundex, but in reality, they're two distinctly different entities.
Jim P.Commented:
I agree that working through the 60,000 records is not going to be easy. But this should have been considered in design phase.  Now he is stuck cleaning up the mess. But by breaking out the companies now and making them there own subset as opposed to atomic to the employee, he can work through the problem.

If he holds off, and just does the pattern matching it will just get worse when he has a typo of Inte1 with a 1 or 1BM. Or MicrOs0ft.

Just my $0.02
AnuPutchaAuthor Commented:

  Are you telling me to mak a synonym directory for company names? I din't quite understand your example of UNKNOWN above.

Can you make it more clear? All those are different forms of the word UNKNOWN?


Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.