How can I use PHP to find the closest match?

Due to some naming convention inconsistencies, I'm in a situation where I have a list of company names that I pulled from a database and an array of [alleged] matching companies.

I'm aware of the levenshtein() function, but that compares the characters, not the words as a whole. I need to make sure that when a word matches, it is given more weight than a similar matching word that is not exact.

For example, ABC Technology should match ABC Technologies in whatever solution you experts suggest.
Using levenshtein(), however, CBAtechnolygo would be a better match than 'ABC Technologies' because it is closer in letters instead of words.

Hopefully this makes sense. Does PHP have any functions that can help me with this problem?
LVL 20
DVation191Asked:
Who is Participating?
 
Beverley PortlockCommented:
Here you go, here is a fragment of some code from a working script and it works a little differently from what I described, but you'll get the idea.

I have added your examples as test data.

Read the warnings in http://www.php.net/similar_text
<?php



function test( $string1, $string2 ) {
     $t1 = strtoupper( $string1 );
     $t2 = strtoupper( $string2 );
     
     $t1 = preg_replace('/[^0-9A-Z]+/', '', $t1 );
     $t2 = preg_replace('/[^0-9A-Z]+/', '', $t2 );
     
     // Chop both to match the shortest
     //
     $len = ( strlen($t1) < strlen($t2) ) ? strlen($t1) : strlen($t2);
     
     $t1 = substr( $t1, 0, $len );
     $t2 = substr( $t2, 0, $len );

     // This can get VERY expensive on large string and the larger
     // the string the worse it gets
     //
     $p = 0;
     similar_text( $t1, $t2, $p );

     // $p is a percentage, the bigger the levenshtein the worse the match
     // therefore dividing the percentage by the distance gives a factor
     //
     $factor = $p / levenshtein( $t1, $t2 );

     // Replace this with "return $factor". This is just for testing
     //
     echo "Factor is $factor for '$string1' and '$string2'<br/>";
}


test("ABC Technology", "ABC Technologies");
test("ABC Technology", "CBAtecnolygo");
test("ABC", "ZXY");

?>

Open in new window

0
 
Beverley PortlockCommented:
You need to accept that there is no perfect way of doing this, but you can improve the odds. For instance.

1. for comparison, convert all strings to the same case using strtoupper or strtolower - whichever you prefer

2. remove all non character data with preg_replace $a = preg_replace('/[^0-9A-Z]/', '', $data );  assuming you converted everything to uppercase

3. Chop them both to be the same length - whichever is shorter

4. Levenshtein the remaining strings and divide the distance by the length to get a number that allows more errors in a longer string.

Do this and run it against the dataset. Establish a cut-off value and print out the rejects. Adjust the cutoff value until you get a rejection/acceptance level you are happy with.
0
 
Beverley PortlockCommented:
Just realised I made a mistake when copying the code. Use this fragment, not the previous one.

Sorry...
<?php



function test( $string1, $string2 ) {
     $t1 = strtoupper( $string1 );
     $t2 = strtoupper( $string2 );
     
     $t1 = preg_replace('/[^0-9A-Z]+/', '', $t1 );
     $t2 = preg_replace('/[^0-9A-Z]+/', '', $t2 );
     
     // Chop both to match the shortest
     //
     $len = ( strlen($t1) < strlen($t2) ) ? strlen($t1) : strlen($t2);
     
     $t1 = substr( $t1, 0, $len );
     $t2 = substr( $t2, 0, $len );

     // This can get VERY expensive on large string and the larger
     // the string the worse it gets
     //
     $p = 0;
     similar_text( $t1, $t2, $p );

     // $p is a percentage, the bigger the levenshtein the worse the match
     // therefore dividing the percentage by the distance gives a factor
     //
     $factor = levenshtein( $t1, $t2 );
     if ( $factor == 0 )
          $factor = $p;
     else
          $factor = $p / $factor;     

     // Replace this with "return $factor". This is just for testing
     //
     echo "Factor is $factor for '$string1' and '$string2'<br/>";
}


test("ABC Technology", "ABC Technologies");
test("ABC Technology", "CBAtecnolygo");
test("ABC", "ZXY");

?>

Open in new window

0
Get your problem seen by more experts

Be seen. Boost your question’s priority for more expert views and faster solutions

 
Ray PaseurCommented:
Consider adding these three functions together:

soundex()
metaphone()
levenshtein()

I can't find my code snippet about this, but the concept is pretty simple.

Take each of the soundex and metaphone strings and run those into levenshtein() - you will be pleased to see how well the matches will work out.

Best, ~Ray
0
 
Ray PaseurCommented:
@DVation191, Can you please post some test data for us?  Thanks, ~Ray
0
 
Ray PaseurCommented:
There may be a strategy here...

If the levenshtein distance of both soundex and metaphone is zero, you likely have a good match.

If the levenshtein distance of both soundex and metaphone is positive, you likely have a mismatch.  

If levenshtein of soundex is zero, manual inspection makes sense.

Best wishes for the new year, ~Ray
<?php // RAY_soundalike_matching.php
error_reporting(E_ALL);
echo "<pre>\n";


// SOME ARE OBVIOUS
soundalike_matching("ABC Technology", "ABC Technologies");
soundalike_matching("ABC Technology", "CBAtecnolygo");
soundalike_matching("ABC", "ZXY");

// SOME LESS SO
soundalike_matching("VOLKSWAGEN AKTIENGESELLSCHAFT", "VOLKSWAGEN, A.G.");
soundalike_matching("Abbott Laboratories", "Abbott Labs");



// FUNCTION TO TRY TO MATCH NAMES
function soundalike_matching($str1, $str2)
{
    // MAKE ALL PLURALS
    $s1  = $str1 . 's';
    $s2  = $str2 . 's';

    // MAKE SOUNDEX
    $sx1 = soundex($s1);
    $sx2 = soundex($s2);

    // MAKE METAPHONE
    $mx1 = metaphone($s1);
    $mx2 = metaphone($s2);

    // USE LEVENSHTEIN
    $ls  = levenshtein($sx1, $sx2);
    $lm  = levenshtein($mx1, $mx2);

    // REPORT THE RESULTS
    echo "<table>\n";
    echo "<tr><td>WORDS: &nbsp;</td><td><strong>$str1 &nbsp;</strong></td><td><strong>$str2 &nbsp;</strong></td><td>LEVENSHTEIN</td></tr>\n";
    echo "<tr><td>SOUNDEX &nbsp;</td><td>$sx1 &nbsp;</td><td>$sx2 &nbsp;</td><td>$ls</td></tr>\n";
    echo "<tr><td>METAPHONE &nbsp;</td><td>$mx1 &nbsp;</td><td>$mx2 &nbsp;</td><td>$lm</td></tr>\n";
    echo "</table>\n";

}

Open in new window

0
 
DVation191Author Commented:
A great start - I'll take a look at building functions out of both suggestions and run it against my data and report back with the results.
@Ray_Paseur since I'm dealing with actual client company names, I need to keep the data private. I will post back with my findings however.

One additional note: In addition to my example in the original post, I'm also faced with scenarios where words are out of order. For example, I may be trying to find "Town of MyCity", but the data I'm searching has it listed as "MyCity, Town of". I look forward to seeing how soundex() and metaphone() will work in such cases.
0
 
Ray PaseurCommented:
Regarding this: "since I'm dealing with actual client company names" - how can you work without publicly available test data?  What are these companies that want to keep their names a secret? ;-)

How many such rows of data do you have?

Regarding this new line of inquiry, "Town of MyCity" vs "MyCity, Town of" - that sounds like a completely different question when compared to the original post which said, "example, ABC Technology should match ABC Technologies..."  You might want to post a different question about how to handle this new case.  It will not be with the algorithm I posted above, which is intended to be responsive to the original question.

For any of these ideas, I suggest you read the PHP man pages on the functions before you use them.  There is a wealth of information at your fingertips.

Best regards, ~Ray
0
 
Ray PaseurCommented:
@DVation191: Any updates for us here?  Have you tried the code sample I posted above? Thanks, ~Ray
0
 
DVation191Author Commented:
Sorry for the delay, I didn't use any specific solution but did take the ideas to build my own which seems to work well for the data set I have.

		$dir_result =  explode("\n", trim($dir_list));
		$highCount = 0;
		$qArr = explode(" ",$q); // company name into array
		foreach ($dir_result as $company) { 
			$companyArr = explode(" ",$company);
			$numIntersections = array_intersect($qArr,$companyArr);
			if ($highCount < count($numIntersections))
			{
				$highCount = count($numIntersections);
				$highArr = $companyArr;
			}			
		}		
		if(!isset($highArr)){ $highArr = array('No','matches','found'); }
		$acctMatch = implode(" ",$highArr);
		echo 'The best match I found was: ' . $acctMatch;	

Open in new window

0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.