Link to home
Start Free TrialLog in
Avatar of AnuPutcha
AnuPutcha

asked on

Pattern Matching

I need to match inconsistant data in the sql server database.
when people are filling in their company names in a registration form, they all put in different names for the same company.

eg: Intel
      Intel Corp
      Intel Corporation
      Intel Corporation.
...... so many other different ways


So, the column saves so many different names for the same company, later when I want to search for total number of people who attended from Intel, my pattern matching string is not able to consolodate all the different ways, I am using Trim(substring(company_name,.....)) in the where clause with SELECT statement.I am hardly getting 40% of the people that need to be counted.

There is really no consistant pattern to the way people write their company names, I have 40,000 different company names in my list, and with this inconsistant names, the list goes up to 60,000. I cannot stop people from giving me inconsistant names, I will have to come up with a pattern matching query which can pull out company synonyms
80-90% of the time.

How can I solve this issue.

Anu.

     
Avatar of rafrancisco
rafrancisco

Try this:

SELECT COUNT(*)
FROM YourTable
WHERE Company_Name LIKE 'Intel%'
If you just want to count the number of attendees from a given company, you can use the LIKE operator as I've shown above.  If you want to count even those were the company name you are looking for is in the middle of the company name, you can also use this:

SELECT COUNT(*)
FROM YourTable
WHERE Company_Name LIKE '%Intel%'

This will count all companies with Intel in their name, whether in front, middle or last.
Another possibility - break out the company listing into a separate table. Then clean the data up.

Then on the data entry form do a combo box/lookup. And you would need to account for multiple company locations, Such as
CompIdxNum      CompRef                  Name            Addr
1                  Intel Corp HQ - AZ            Intel Corp      123 Pheonix
2                  Intel Corp - Silicon Valey Intel Corp      345 Pine
3                  IBM - NY                  IBM            456 Broad
4                  IBM - CA                  IBM            567 West

And store the foriegn key.  Make sure that the combo box shows enough information that the data entry clerk(s) can tell the matching address. If they don't show up put a button/form to enter the new address.

Remember GIGO - Garbage In, Garbage Out.
The only way to successfully make this happen is to create a seive to first standardize into a known pattern. For example: "ABC XYZ" will always be different than "ABC RST". However, if you can build a seive substitition dictionary, then you might stand a chance. Given the above example your seive dictionary would contain a transformation where all occurrences of "RST" should be substituted with "XYZ". Then, "ABC XYZ" will equal "ABC XYZ".

I have done some programming work for the U.S.Postal service. This technique is how they distill "like" things like street addresses, city names, state abbreviations, etc.

Here's a seive dictionary I used recently for "UNKNOWN".
|UKN|
|UNCERTAIN|
|UNK.|
|UNK|
|UNKN|
|UNKNONW|
|UNKNONWN|
|UNKNOW|
|UNKNOWN|
|UNKNOWNB|
|UNKNWON|
|UNKONWN|
|UNKOWN|
|UNSURE|

Get the idea?
The first real step, is to burst any existing data into it's component parts and sort it. Then the human eye and mind can map out all seive substitutions. This is NOT a trivial task. Good luck.
One last thing. In my experience, trying to use SOUNDEX or DIFFERENCE never has acceptable results. Because "Petal Company" and "Peddle Company" have the same Soundex, but in reality, they're two distinctly different entities.
I agree that working through the 60,000 records is not going to be easy. But this should have been considered in design phase.  Now he is stuck cleaning up the mess. But by breaking out the companies now and making them there own subset as opposed to atomic to the employee, he can work through the problem.

If he holds off, and just does the pattern matching it will just get worse when he has a typo of Inte1 with a 1 or 1BM. Or MicrOs0ft.

Just my $0.02
ASKER CERTIFIED SOLUTION
Avatar of jrb1
jrb1
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of AnuPutcha

ASKER

Yolking,

  Are you telling me to mak a synonym directory for company names? I din't quite understand your example of UNKNOWN above.

Can you make it more clear? All those are different forms of the word UNKNOWN?

Anu.