Link to home
Start Free TrialLog in
Avatar of t3chguy
t3chguyFlag for United States of America

asked on

Fuzzy match of two table fields

I've got two tables, one that holds a company id, the name, and address, and another table that holds a company name and plant id.

The IDs are two different IDs and I need to write an update to the table to fill in the plant id to the main table.

The problem is, I really only have an address, phone number, and company name to work with.

I was able to match about 260 records by using the address and phone fields, but that leaves me with 3,000 records unmatched.

An example follows:

In one table, the company name is Weleetka Power Station
In the other table, the company is Weleetka Generating Station.

That is a pretty simple example

Then I run into an example where it's not always the first set of words that doesnt match.  

APCOMPOWER AT COMANCH is stored in one table, and Comanche Power Station.

Any help in the right direction is greatly appreciated.
Avatar of Slimshaneey
Slimshaneey
Flag of United Kingdom of Great Britain and Northern Ireland image

Well, I had a similar issue with deduplicating records on a table once. The solution I came up with that worked in 95% of cases for me was to create a hash, using The first 3-4 letters of the name, the postcode and domainname.com (these were fields that were present in all records). What that allowed me to do was ignore spelling differences, and make good guesses  on what were similar. You would also get the soundex values for the company name, and attach that to extra identifying data and match on that. There are some instances where you will need to manually intervene, as you higlighted one in your own example. It took me a few hours to do the remainders from the initially 1million or so records, which isnt that bad. Hopefully that has given you some idea of how to possibly approach this!
You would also get the soundex values for the company name should read:
You COULD also get the soundex/metaphone (functions that give a phonetic value to words) values for the company name
ASKER CERTIFIED SOLUTION
Avatar of Ray Paseur
Ray Paseur
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of t3chguy

ASKER

Thanks for that code sample, Ray.

I've modified it a bit to pull out of the database, but now even if there is a match, it either shows a 100% match or a 50% match, but I'm figuring this can't be fully accurate?

// A FUNCTION TO APPROXIMATE THE SIMILARITY
function similarity($a, $b)
{
    // NORMALIZE THE STRINGS
    $a = strtoupper($a);
    $b = strtoupper($b);

    $a = preg_replace('#[^A-Z0-9 ]#', NULL, $a);
    $b = preg_replace('#[^A-Z0-9 ]#', NULL, $b);

    $a = preg_replace('#\s\s+#', ' ', $a);
    $b = preg_replace('#\s\s+#', ' ', $b);

    // A MATCH RETURNS 100%
    if ($a == $b) return array('pct' => '1.0', 'mis' => '');

    // MAKE ARRAYS TO ISOLATE INDIVIDUAL WORDS
    $arr_a = explode(' ', $a);
    $arr_b = explode(' ', $b);

    // SCORE THE SIMILARITY
    $max = ( count($arr_a) + count($arr_b) ) / 2.0;
    $cnt = 0.0;
    foreach ($arr_a as $key_a => $a)
    {
        foreach ($arr_b as $key_b => $b)
        {
            if ($a == $b)
            {
                $cnt++;
                unset($arr_a[$key_a]);
                unset($arr_b[$key_b]);
            }
        }
    }

    // RETURN PERCENT AND MISMATCH
    $pct = number_format( ($cnt / $max), 2);
    $mis = array_merge($arr_a, $arr_b);
    $mis = implode(' ', $mis);
    return array('pct' => $pct, 'mis' => $mis);
}

$sql = "SELECT plant_name, plantid, phys_addr, phys_city, phys_zip FROM pr_plants_tempbuild WHERE nameid = '0'";
$res = mysql_query($sql) or die(mysql_error());
$num = mysql_num_rows($res);

echo '<table width="100%" cellspacing="0" cellpadding="0" border="0">';

for($z = 0; $z < $num; ++$z)
      {
      $plantname = mysql_result($res, $z, 0);
      $plantid = mysql_result($res, $z, 1);
      $physaddr = mysql_result($res, $z, 2);
      $physcity = mysql_result($res, $z, 3);
      $physzip = mysql_result($res, $z, 4);
      
      $plantname2 = $plantname;
      
      $plantname = explode(' ',trim($plantname));
        $a = $plantname[0];
        $a = $plantname[0] . ' ' . $plantname[1];
      
      $sql2 = "SELECT nameid, name, add1, town, postcode FROM dw_names WHERE name LIKE '%".mysql_real_escape_string($a)."%'";
      
      $res2 = mysql_query($sql2) or die(mysql_error());
      $row2 = mysql_fetch_row($res2);
      $num2 = mysql_num_rows($res2);

      
      $nameid = $row2[0];
      $coname = $row2[1];
      $dwadd = $row2[2];
      $dwcit = $row2[3];
      $dwzip = $row2[4];
      
      $coname2 = $coname;

      
      $coname = explode(' ',trim($coname));      
      $b = $coname[0];
      $b = $coname[0] . ' ' . $coname[1];
      
// RUN SOME TESTS
//$a = 'Weleetka Power Station';
//$b = 'Weleetka Generating Station';


$x = similarity($a, $b);
echo PHP_EOL

. '<td>' . $plantid . '</td><td><b>'
. $plantname2
. '</b></td>'
. ' <td>'.$physaddr . '</td><td>' . $physcity . '</td><td>' . $physzip . '</td>'
. '<td><b>'
. $coname2
. '</td></b>'
. '<td>' . $dwadd . '</td><td> ' . $dwcit . '</td><td> ' . $dw_zip . '</td><td> '
. $x['pct']
. '%</td><td>SIMILAR, DIFFERING IN </td>'
. '<td><i>'
. $x['mis']
. '</i></td><td>' . $nameid . '</td><td> ' . $num2 . '</td>'

. PHP_EOL;
;
echo '</tr>';
soundex(), metaphone(), levenshtein() and similar_text()
These can be used in some combination instead of a test for string equality.  There is a lot of programming to write if you want to use these function values, but my sense is that the programming is worth the effort.  I just don't have the time to write it all out right now.  But here is an offer:

Please post a CSV file containing two columns.  I will use your test data to generate a similarity value, and I'll post the code I used to do it.  It may take a day or two since I have out-of-town commitments over the holiday weekend, but I think we can come up with a good-enough algorithm when we replace string equality with other kinds of matching.

I would think a few dozen rows would be sufficient to make a good test.  Thanks and regards, ~Ray