Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium

x
?
Solved

How can I use PHP to find the closest match?

Posted on 2009-12-30
10
Medium Priority
?
417 Views
Last Modified: 2012-06-27
Due to some naming convention inconsistencies, I'm in a situation where I have a list of company names that I pulled from a database and an array of [alleged] matching companies.

I'm aware of the levenshtein() function, but that compares the characters, not the words as a whole. I need to make sure that when a word matches, it is given more weight than a similar matching word that is not exact.

For example, ABC Technology should match ABC Technologies in whatever solution you experts suggest.
Using levenshtein(), however, CBAtechnolygo would be a better match than 'ABC Technologies' because it is closer in letters instead of words.

Hopefully this makes sense. Does PHP have any functions that can help me with this problem?
0
Comment
Question by:DVation191
  • 5
  • 3
  • 2
10 Comments
 
LVL 34

Expert Comment

by:Beverley Portlock
ID: 26145944
You need to accept that there is no perfect way of doing this, but you can improve the odds. For instance.

1. for comparison, convert all strings to the same case using strtoupper or strtolower - whichever you prefer

2. remove all non character data with preg_replace $a = preg_replace('/[^0-9A-Z]/', '', $data );  assuming you converted everything to uppercase

3. Chop them both to be the same length - whichever is shorter

4. Levenshtein the remaining strings and divide the distance by the length to get a number that allows more errors in a longer string.

Do this and run it against the dataset. Establish a cut-off value and print out the rejects. Adjust the cutoff value until you get a rejection/acceptance level you are happy with.
0
 
LVL 34

Accepted Solution

by:
Beverley Portlock earned 1000 total points
ID: 26146195
Here you go, here is a fragment of some code from a working script and it works a little differently from what I described, but you'll get the idea.

I have added your examples as test data.

Read the warnings in http://www.php.net/similar_text
<?php



function test( $string1, $string2 ) {
     $t1 = strtoupper( $string1 );
     $t2 = strtoupper( $string2 );
     
     $t1 = preg_replace('/[^0-9A-Z]+/', '', $t1 );
     $t2 = preg_replace('/[^0-9A-Z]+/', '', $t2 );
     
     // Chop both to match the shortest
     //
     $len = ( strlen($t1) < strlen($t2) ) ? strlen($t1) : strlen($t2);
     
     $t1 = substr( $t1, 0, $len );
     $t2 = substr( $t2, 0, $len );

     // This can get VERY expensive on large string and the larger
     // the string the worse it gets
     //
     $p = 0;
     similar_text( $t1, $t2, $p );

     // $p is a percentage, the bigger the levenshtein the worse the match
     // therefore dividing the percentage by the distance gives a factor
     //
     $factor = $p / levenshtein( $t1, $t2 );

     // Replace this with "return $factor". This is just for testing
     //
     echo "Factor is $factor for '$string1' and '$string2'<br/>";
}


test("ABC Technology", "ABC Technologies");
test("ABC Technology", "CBAtecnolygo");
test("ABC", "ZXY");

?>

Open in new window

0
 
LVL 34

Expert Comment

by:Beverley Portlock
ID: 26146239
Just realised I made a mistake when copying the code. Use this fragment, not the previous one.

Sorry...
<?php



function test( $string1, $string2 ) {
     $t1 = strtoupper( $string1 );
     $t2 = strtoupper( $string2 );
     
     $t1 = preg_replace('/[^0-9A-Z]+/', '', $t1 );
     $t2 = preg_replace('/[^0-9A-Z]+/', '', $t2 );
     
     // Chop both to match the shortest
     //
     $len = ( strlen($t1) < strlen($t2) ) ? strlen($t1) : strlen($t2);
     
     $t1 = substr( $t1, 0, $len );
     $t2 = substr( $t2, 0, $len );

     // This can get VERY expensive on large string and the larger
     // the string the worse it gets
     //
     $p = 0;
     similar_text( $t1, $t2, $p );

     // $p is a percentage, the bigger the levenshtein the worse the match
     // therefore dividing the percentage by the distance gives a factor
     //
     $factor = levenshtein( $t1, $t2 );
     if ( $factor == 0 )
          $factor = $p;
     else
          $factor = $p / $factor;     

     // Replace this with "return $factor". This is just for testing
     //
     echo "Factor is $factor for '$string1' and '$string2'<br/>";
}


test("ABC Technology", "ABC Technologies");
test("ABC Technology", "CBAtecnolygo");
test("ABC", "ZXY");

?>

Open in new window

0
Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

 
LVL 111

Expert Comment

by:Ray Paseur
ID: 26146338
Consider adding these three functions together:

soundex()
metaphone()
levenshtein()

I can't find my code snippet about this, but the concept is pretty simple.

Take each of the soundex and metaphone strings and run those into levenshtein() - you will be pleased to see how well the matches will work out.

Best, ~Ray
0
 
LVL 111

Expert Comment

by:Ray Paseur
ID: 26146474
@DVation191, Can you please post some test data for us?  Thanks, ~Ray
0
 
LVL 111

Assisted Solution

by:Ray Paseur
Ray Paseur earned 1000 total points
ID: 26146705
There may be a strategy here...

If the levenshtein distance of both soundex and metaphone is zero, you likely have a good match.

If the levenshtein distance of both soundex and metaphone is positive, you likely have a mismatch.  

If levenshtein of soundex is zero, manual inspection makes sense.

Best wishes for the new year, ~Ray
<?php // RAY_soundalike_matching.php
error_reporting(E_ALL);
echo "<pre>\n";


// SOME ARE OBVIOUS
soundalike_matching("ABC Technology", "ABC Technologies");
soundalike_matching("ABC Technology", "CBAtecnolygo");
soundalike_matching("ABC", "ZXY");

// SOME LESS SO
soundalike_matching("VOLKSWAGEN AKTIENGESELLSCHAFT", "VOLKSWAGEN, A.G.");
soundalike_matching("Abbott Laboratories", "Abbott Labs");



// FUNCTION TO TRY TO MATCH NAMES
function soundalike_matching($str1, $str2)
{
    // MAKE ALL PLURALS
    $s1  = $str1 . 's';
    $s2  = $str2 . 's';

    // MAKE SOUNDEX
    $sx1 = soundex($s1);
    $sx2 = soundex($s2);

    // MAKE METAPHONE
    $mx1 = metaphone($s1);
    $mx2 = metaphone($s2);

    // USE LEVENSHTEIN
    $ls  = levenshtein($sx1, $sx2);
    $lm  = levenshtein($mx1, $mx2);

    // REPORT THE RESULTS
    echo "<table>\n";
    echo "<tr><td>WORDS: &nbsp;</td><td><strong>$str1 &nbsp;</strong></td><td><strong>$str2 &nbsp;</strong></td><td>LEVENSHTEIN</td></tr>\n";
    echo "<tr><td>SOUNDEX &nbsp;</td><td>$sx1 &nbsp;</td><td>$sx2 &nbsp;</td><td>$ls</td></tr>\n";
    echo "<tr><td>METAPHONE &nbsp;</td><td>$mx1 &nbsp;</td><td>$mx2 &nbsp;</td><td>$lm</td></tr>\n";
    echo "</table>\n";

}

Open in new window

0
 
LVL 20

Author Comment

by:DVation191
ID: 26147250
A great start - I'll take a look at building functions out of both suggestions and run it against my data and report back with the results.
@Ray_Paseur since I'm dealing with actual client company names, I need to keep the data private. I will post back with my findings however.

One additional note: In addition to my example in the original post, I'm also faced with scenarios where words are out of order. For example, I may be trying to find "Town of MyCity", but the data I'm searching has it listed as "MyCity, Town of". I look forward to seeing how soundex() and metaphone() will work in such cases.
0
 
LVL 111

Expert Comment

by:Ray Paseur
ID: 26147467
Regarding this: "since I'm dealing with actual client company names" - how can you work without publicly available test data?  What are these companies that want to keep their names a secret? ;-)

How many such rows of data do you have?

Regarding this new line of inquiry, "Town of MyCity" vs "MyCity, Town of" - that sounds like a completely different question when compared to the original post which said, "example, ABC Technology should match ABC Technologies..."  You might want to post a different question about how to handle this new case.  It will not be with the algorithm I posted above, which is intended to be responsive to the original question.

For any of these ideas, I suggest you read the PHP man pages on the functions before you use them.  There is a wealth of information at your fingertips.

Best regards, ~Ray
0
 
LVL 111

Expert Comment

by:Ray Paseur
ID: 26167324
@DVation191: Any updates for us here?  Have you tried the code sample I posted above? Thanks, ~Ray
0
 
LVL 20

Author Comment

by:DVation191
ID: 26304698
Sorry for the delay, I didn't use any specific solution but did take the ideas to build my own which seems to work well for the data set I have.

		$dir_result =  explode("\n", trim($dir_list));
		$highCount = 0;
		$qArr = explode(" ",$q); // company name into array
		foreach ($dir_result as $company) { 
			$companyArr = explode(" ",$company);
			$numIntersections = array_intersect($qArr,$companyArr);
			if ($highCount < count($numIntersections))
			{
				$highCount = count($numIntersections);
				$highArr = $companyArr;
			}			
		}		
		if(!isset($highArr)){ $highArr = array('No','matches','found'); }
		$acctMatch = implode(" ",$highArr);
		echo 'The best match I found was: ' . $acctMatch;	

Open in new window

0

Featured Post

Free Tool: Path Explorer

An intuitive utility to help find the CSS path to UI elements on a webpage. These paths are used frequently in a variety of front-end development and QA automation tasks.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Password hashing is better than message digests or encryption, and you should be using it instead of message digests or encryption.  Find out why and how in this article, which supplements the original article on PHP Client Registration, Login, Logo‚Ķ
Many old projects have bad code, but the budget doesn't exist to rewrite the codebase. You can update this code to be safer by introducing contemporary input validation, sanitation, and safer database queries.
The viewer will learn how to look for a specific file type in a local or remote server directory using PHP.
This tutorial will teach you the core code needed to finalize the addition of a watermark to your image. The viewer will use a small PHP class to learn and create a watermark.
Suggested Courses

564 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question