?
Solved

Fuzzy match of two table fields

Posted on 2012-08-29
5
Medium Priority
?
358 Views
Last Modified: 2012-09-27
I've got two tables, one that holds a company id, the name, and address, and another table that holds a company name and plant id.

The IDs are two different IDs and I need to write an update to the table to fill in the plant id to the main table.

The problem is, I really only have an address, phone number, and company name to work with.

I was able to match about 260 records by using the address and phone fields, but that leaves me with 3,000 records unmatched.

An example follows:

In one table, the company name is Weleetka Power Station
In the other table, the company is Weleetka Generating Station.

That is a pretty simple example

Then I run into an example where it's not always the first set of words that doesnt match.  

APCOMPOWER AT COMANCH is stored in one table, and Comanche Power Station.

Any help in the right direction is greatly appreciated.
0
Comment
Question by:t3chguy
  • 2
  • 2
5 Comments
 
LVL 11

Expert Comment

by:Slimshaneey
ID: 38345881
Well, I had a similar issue with deduplicating records on a table once. The solution I came up with that worked in 95% of cases for me was to create a hash, using The first 3-4 letters of the name, the postcode and domainname.com (these were fields that were present in all records). What that allowed me to do was ignore spelling differences, and make good guesses  on what were similar. You would also get the soundex values for the company name, and attach that to extra identifying data and match on that. There are some instances where you will need to manually intervene, as you higlighted one in your own example. It took me a few hours to do the remainders from the initially 1million or so records, which isnt that bad. Hopefully that has given you some idea of how to possibly approach this!
0
 
LVL 11

Expert Comment

by:Slimshaneey
ID: 38345890
You would also get the soundex values for the company name should read:
You COULD also get the soundex/metaphone (functions that give a phonetic value to words) values for the company name
0
 
LVL 111

Accepted Solution

by:
Ray Paseur earned 2000 total points
ID: 38350214
There are four PHP functions that can be helpful.  They are soundex(), metaphone, levenshtein() and similar_text().  You may want to read about these on PHP.net and consider how combinations of these can be useful.  You may also want to normalize any input strings to upper case.  MySQL is not case-sensitive by default, but the levenshtein() distance from 'abc' to 'ABC' is equal to strlen('abc') and when levenshtein() equals strlen() there is zero similarity.

Comanche Power Station will almost certainly be an outlier.

Here is a rough-cut at a comparison algorithm.  It uses equality of words, but you could add other tests, perhaps weighting the string equality highly and weighting the levenshtein distance of the metaphone strings at half of the string equality.

See http://www.laprbass.com/RAY_temp_t3chguy.php

<?php // RAY_temp_t3chguy.php
error_reporting(E_ALL);
echo '<pre>';

// A FUNCTION TO APPROXIMATE THE SIMILARITY
function similarity($a, $b)
{
    // NORMALIZE THE STRINGS
    $a = strtoupper($a);
    $b = strtoupper($b);

    $a = preg_replace('#[^A-Z0-9 ]#', NULL, $a);
    $b = preg_replace('#[^A-Z0-9 ]#', NULL, $b);

    $a = preg_replace('#\s\s+#', ' ', $a);
    $b = preg_replace('#\s\s+#', ' ', $b);

    // A MATCH RETURNS 100%
    if ($a == $b) return array('pct' => '1.0', 'mis' => '');

    // MAKE ARRAYS TO ISOLATE INDIVIDUAL WORDS
    $arr_a = explode(' ', $a);
    $arr_b = explode(' ', $b);

    // SCORE THE SIMILARITY
    $max = ( count($arr_a) + count($arr_b) ) / 2.0;
    $cnt = 0.0;
    foreach ($arr_a as $key_a => $a)
    {
        foreach ($arr_b as $key_b => $b)
        {
            if ($a == $b)
            {
                $cnt++;
                unset($arr_a[$key_a]);
                unset($arr_b[$key_b]);
            }
        }
    }

    // RETURN PERCENT AND MISMATCH
    $pct = number_format( ($cnt / $max), 2);
    $mis = array_merge($arr_a, $arr_b);
    $mis = implode(' ', $mis);
    return array('pct' => $pct, 'mis' => $mis);
}

// RUN SOME TESTS
$a = 'Weleetka Power Station';
$b = 'Weleetka Generating Station';

$x = similarity($a, $b);
echo PHP_EOL
. '<b>'
. $a
. '</b>'
. ' COMPARED TO '
. '<b>'
. $b
. '</b>'
. ' IS '
. $x['pct']
. '% SIMILAR, DIFFERING IN '
. '<i>'
. $x['mis']
. '</i>'
. PHP_EOL;
;

$x = similarity('Weleetka Power Station', 'Weleetka Station');
var_dump($x);

$x = similarity('Ray', 'RaY');
var_dump($x);

$x = similarity('Black Dog', 'Little white dog');
var_dump($x);

$x = similarity('my dog has fleas', 'chase the squirrels, Dog!');
var_dump($x);

$x = similarity('my dog has fleas', 'fleas has my dog!');
var_dump($x);

Open in new window

HTH, ~Ray
0
 
LVL 1

Author Comment

by:t3chguy
ID: 38351469
Thanks for that code sample, Ray.

I've modified it a bit to pull out of the database, but now even if there is a match, it either shows a 100% match or a 50% match, but I'm figuring this can't be fully accurate?

// A FUNCTION TO APPROXIMATE THE SIMILARITY
function similarity($a, $b)
{
    // NORMALIZE THE STRINGS
    $a = strtoupper($a);
    $b = strtoupper($b);

    $a = preg_replace('#[^A-Z0-9 ]#', NULL, $a);
    $b = preg_replace('#[^A-Z0-9 ]#', NULL, $b);

    $a = preg_replace('#\s\s+#', ' ', $a);
    $b = preg_replace('#\s\s+#', ' ', $b);

    // A MATCH RETURNS 100%
    if ($a == $b) return array('pct' => '1.0', 'mis' => '');

    // MAKE ARRAYS TO ISOLATE INDIVIDUAL WORDS
    $arr_a = explode(' ', $a);
    $arr_b = explode(' ', $b);

    // SCORE THE SIMILARITY
    $max = ( count($arr_a) + count($arr_b) ) / 2.0;
    $cnt = 0.0;
    foreach ($arr_a as $key_a => $a)
    {
        foreach ($arr_b as $key_b => $b)
        {
            if ($a == $b)
            {
                $cnt++;
                unset($arr_a[$key_a]);
                unset($arr_b[$key_b]);
            }
        }
    }

    // RETURN PERCENT AND MISMATCH
    $pct = number_format( ($cnt / $max), 2);
    $mis = array_merge($arr_a, $arr_b);
    $mis = implode(' ', $mis);
    return array('pct' => $pct, 'mis' => $mis);
}

$sql = "SELECT plant_name, plantid, phys_addr, phys_city, phys_zip FROM pr_plants_tempbuild WHERE nameid = '0'";
$res = mysql_query($sql) or die(mysql_error());
$num = mysql_num_rows($res);

echo '<table width="100%" cellspacing="0" cellpadding="0" border="0">';

for($z = 0; $z < $num; ++$z)
      {
      $plantname = mysql_result($res, $z, 0);
      $plantid = mysql_result($res, $z, 1);
      $physaddr = mysql_result($res, $z, 2);
      $physcity = mysql_result($res, $z, 3);
      $physzip = mysql_result($res, $z, 4);
      
      $plantname2 = $plantname;
      
      $plantname = explode(' ',trim($plantname));
        $a = $plantname[0];
        $a = $plantname[0] . ' ' . $plantname[1];
      
      $sql2 = "SELECT nameid, name, add1, town, postcode FROM dw_names WHERE name LIKE '%".mysql_real_escape_string($a)."%'";
      
      $res2 = mysql_query($sql2) or die(mysql_error());
      $row2 = mysql_fetch_row($res2);
      $num2 = mysql_num_rows($res2);

      
      $nameid = $row2[0];
      $coname = $row2[1];
      $dwadd = $row2[2];
      $dwcit = $row2[3];
      $dwzip = $row2[4];
      
      $coname2 = $coname;

      
      $coname = explode(' ',trim($coname));      
      $b = $coname[0];
      $b = $coname[0] . ' ' . $coname[1];
      
// RUN SOME TESTS
//$a = 'Weleetka Power Station';
//$b = 'Weleetka Generating Station';


$x = similarity($a, $b);
echo PHP_EOL

. '<td>' . $plantid . '</td><td><b>'
. $plantname2
. '</b></td>'
. ' <td>'.$physaddr . '</td><td>' . $physcity . '</td><td>' . $physzip . '</td>'
. '<td><b>'
. $coname2
. '</td></b>'
. '<td>' . $dwadd . '</td><td> ' . $dwcit . '</td><td> ' . $dw_zip . '</td><td> '
. $x['pct']
. '%</td><td>SIMILAR, DIFFERING IN </td>'
. '<td><i>'
. $x['mis']
. '</i></td><td>' . $nameid . '</td><td> ' . $num2 . '</td>'

. PHP_EOL;
;
echo '</tr>';
0
 
LVL 111

Expert Comment

by:Ray Paseur
ID: 38352973
soundex(), metaphone(), levenshtein() and similar_text()
These can be used in some combination instead of a test for string equality.  There is a lot of programming to write if you want to use these function values, but my sense is that the programming is worth the effort.  I just don't have the time to write it all out right now.  But here is an offer:

Please post a CSV file containing two columns.  I will use your test data to generate a similarity value, and I'll post the code I used to do it.  It may take a day or two since I have out-of-town commitments over the holiday weekend, but I think we can come up with a good-enough algorithm when we replace string equality with other kinds of matching.

I would think a few dozen rows would be sufficient to make a good test.  Thanks and regards, ~Ray
0

Featured Post

Prep for the ITIL® Foundation Certification Exam

December’s Course of the Month is now available! Enroll to learn ITIL® Foundation best practices for delivering IT services effectively and efficiently.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Password hashing is better than message digests or encryption, and you should be using it instead of message digests or encryption.  Find out why and how in this article, which supplements the original article on PHP Client Registration, Login, Logo…
This holiday season, we’re giving away the gift of knowledge—tech knowledge, that is. Keep reading to see what hacks, tips, and trends we have wrapped and waiting for you under the tree.
The viewer will learn how to count occurrences of each item in an array.
The viewer will learn how to create a basic form using some HTML5 and PHP for later processing. Set up your basic HTML file. Open your form tag and set the method and action attributes.: (CODE) Set up your first few inputs one for the name and …
Suggested Courses
Course of the Month16 days, 8 hours left to enroll

864 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question