anamarif
asked on
Best way to find duplicate message in database
Hi,
I need to check if a similar message already exist in the database before inserting new record/message.
What's the fastest way to do so ?
also keep in mind I don't have to check exact same copy by similar record I mean two text which are over 90% identical.
Thanks in advance
I need to check if a similar message already exist in the database before inserting new record/message.
What's the fastest way to do so ?
also keep in mind I don't have to check exact same copy by similar record I mean two text which are over 90% identical.
Thanks in advance
ASKER
What I consider similar depends on what's lightweight to server, I am not very strict here to go through all complex algorithims to detect similarities.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
that's very nice, is it possible to include numbers as well,like in case of this :-
$msg1 = "Please deposit $10000 into my account.";
$msg2 = "Please deposit $1 into my account.";
these aren't considered similar due to different numbers, can we use metaphone approach with numbers as well?
$msg1 = "Please deposit $10000 into my account.";
$msg2 = "Please deposit $1 into my account.";
these aren't considered similar due to different numbers, can we use metaphone approach with numbers as well?
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Some time back I made up some thought exercises about this subject. Not all of this script will work, but I think you can see where the idea was going.
<?php // demo/similar_strings.php
error_reporting(E_ALL);
echo "<pre>" . PHP_EOL;
// SHOW SOME WAYS OF LOOKING AT IMPRECISELY MATCHED INFORMATION
// THINGS TO COMPARE
$strings = array
( array( '12345.1', '12345. 1' )
, array( 'kitten', 'kitty' )
, array( 'CATALOG', 'Catalog' )
, array( 'cell', 'sell' )
, array( 'super', 'souper' )
, array( 'mi niña', 'mi nina' )
, array( 'mi niña', 'mi ninia' )
, array( 'ça va!', 'ca va!' )
, array( 'Plaçe', 'Place' )
, array( 'ça va!', 'sa va!' )
, array( 'ca va!', 'sa va!' )
, array( 'Yeehah', 'Yee-hah' )
, array( 'toxic', 'poisonous' )
, array( 'glad', 'unglad' )
, array( 'McLean', 'MacLean' )
)
;
foreach ($strings as $string)
{
$x = $string[0];
$y = $string[1];
echo PHP_EOL;
echo "<b>COMPARING $x TO $y</b>";
echo PHP_EOL;
// http://php.net/manual/en/function.soundex.php
echo PHP_EOL;
echo 'SOUNDEX()';
echo PHP_EOL;
var_dump(soundex($x));
var_dump(soundex($y));
// http://php.net/manual/en/function.metaphone.php
echo PHP_EOL;
echo 'METAPHONE()';
echo PHP_EOL;
var_dump(metaphone($x));
var_dump(metaphone($y));
// http://php.net/manual/en/function.levenshtein.php
echo PHP_EOL;
echo 'LEVENSHTEIN()';
echo PHP_EOL;
var_dump(levenshtein($x, $y));
// http://php.net/manual/en/function.similar-text.php
echo PHP_EOL;
echo 'SIMILAR_TEXT()';
echo PHP_EOL;
echo "COMMON CHARACTERS: " . similar_text($x, $y);
echo PHP_EOL;
similar_text($x, $y, &$pct);
echo "PERCENT ALIKE: " . number_format($pct,1) . '%';
echo PHP_EOL;
}
// WORK IN PROGRESS -- INCOMPLETE ON 2012-06-14
// A FUNCTION TO COMPUTE A CONSOLIDATED SCORE FOR SIMILARITY
function sameThing($x, $y, $normal=TRUE)
{
// NORMALIZE = UPPER CASE LETTERS ONLY, SINGLE SPACES
if ($normal)
{
$x = strtoupper($x);
$y = strtoupper($y);
$r = '/[^A-Z ]/';
$x = preg_replace($r, ' ', $x);
$y = preg_replace($r, ' ', $y);
$r = '/\s\s+/';
$x = preg_replace($r, ' ', $x);
$y = preg_replace($r, ' ', $y);
$x = trim($x);
$y = trim($y);
}
$sxl = ( levenshtein(soundex($x), soundex($y)) );
$sxm = ( levenshtein(metaphone($x), metaphone($y)) );
echo PHP_EOL . "$x vs $y Soundex Levenshtein=$sxl Metaphone Levenshtein=$sxm";
}
foreach ($strings as $string)
{
sameThing($string[0], $string[1]);
}
HTH, ~Ray
please clarify this, as that is the key to whatever solution you want to apply.