Link to home
Start Free TrialLog in
Avatar of anamarif
anamarif

asked on

Best way to find duplicate message in database

Hi,

I need to check if a similar message already exist in the database before inserting new record/message.

What's the fastest way to do so ?

also keep in mind I don't have to check exact same copy by similar record I mean two text which are over 90% identical.

Thanks in advance
Avatar of Guy Hengel [angelIII / a3]
Guy Hengel [angelIII / a3]
Flag of Luxembourg image

it really depends on what you (want to) consider as "duplicate/similar"
please clarify this, as that is the key to whatever solution you want to apply.
Avatar of anamarif
anamarif

ASKER

What I consider similar depends on what's lightweight to server, I am not very strict here to go through all complex algorithims to detect similarities.
ASKER CERTIFIED SOLUTION
Avatar of gr8gonzo
gr8gonzo
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
that's very nice, is it possible to include numbers as well,like in case of this :-

$msg1 = "Please deposit $10000 into my account.";
$msg2 = "Please deposit $1 into my account.";

these aren't considered similar due to different numbers, can we use metaphone approach with numbers as well?
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Some time back I made up some thought exercises about this subject.  Not all of this script will work, but I think you can see where the idea was going.

<?php // demo/similar_strings.php
error_reporting(E_ALL);
echo "<pre>" . PHP_EOL;


// SHOW SOME WAYS OF LOOKING AT IMPRECISELY MATCHED INFORMATION


// THINGS TO COMPARE
$strings = array
( array( '12345.1', '12345. 1'  )
, array( 'kitten',  'kitty'     )
, array( 'CATALOG', 'Catalog'   )
, array( 'cell',    'sell'      )
, array( 'super',   'souper'    )
, array( 'mi niña', 'mi nina'   )
, array( 'mi niña', 'mi ninia'  )
, array( 'ça va!',  'ca va!'    )
, array( 'Plaçe',   'Place'     )
, array( 'ça va!',  'sa va!'    )
, array( 'ca va!',  'sa va!'    )
, array( 'Yeehah',  'Yee-hah'   )
, array( 'toxic',   'poisonous' )
, array( 'glad',    'unglad'    )
, array( 'McLean',  'MacLean'   )
)
;

foreach ($strings as $string)
{
    $x = $string[0];
    $y = $string[1];
    echo PHP_EOL;
    echo "<b>COMPARING $x TO $y</b>";
    echo PHP_EOL;


    // http://php.net/manual/en/function.soundex.php
    echo PHP_EOL;
    echo 'SOUNDEX()';
    echo PHP_EOL;
    var_dump(soundex($x));
    var_dump(soundex($y));

    // http://php.net/manual/en/function.metaphone.php
    echo PHP_EOL;
    echo 'METAPHONE()';
    echo PHP_EOL;
    var_dump(metaphone($x));
    var_dump(metaphone($y));

    // http://php.net/manual/en/function.levenshtein.php
    echo PHP_EOL;
    echo 'LEVENSHTEIN()';
    echo PHP_EOL;
    var_dump(levenshtein($x, $y));

    // http://php.net/manual/en/function.similar-text.php
    echo PHP_EOL;
    echo 'SIMILAR_TEXT()';
    echo PHP_EOL;
    echo "COMMON CHARACTERS: " . similar_text($x, $y);
    echo PHP_EOL;
    similar_text($x, $y, &$pct);
    echo "PERCENT ALIKE: " . number_format($pct,1) . '%';

    echo PHP_EOL;
}


// WORK IN PROGRESS -- INCOMPLETE ON 2012-06-14
// A FUNCTION TO COMPUTE A CONSOLIDATED SCORE FOR SIMILARITY
function sameThing($x, $y, $normal=TRUE)
{
    // NORMALIZE = UPPER CASE LETTERS ONLY, SINGLE SPACES
    if ($normal)
    {
        $x = strtoupper($x);
        $y = strtoupper($y);
        $r = '/[^A-Z ]/';
        $x = preg_replace($r, ' ', $x);
        $y = preg_replace($r, ' ', $y);
        $r = '/\s\s+/';
        $x = preg_replace($r, ' ', $x);
        $y = preg_replace($r, ' ', $y);
        $x = trim($x);
        $y = trim($y);
    }

    $sxl = ( levenshtein(soundex($x),   soundex($y))   );
    $sxm = ( levenshtein(metaphone($x), metaphone($y)) );

    echo PHP_EOL . "$x vs $y Soundex Levenshtein=$sxl Metaphone Levenshtein=$sxm";
}

foreach ($strings as $string)
{
    sameThing($string[0], $string[1]);
}

Open in new window

HTH, ~Ray