Solved

Best way to find duplicate message in database

Posted on 2014-03-10
7
35 Views
Last Modified: 2016-05-18
Hi,

I need to check if a similar message already exist in the database before inserting new record/message.

What's the fastest way to do so ?

also keep in mind I don't have to check exact same copy by similar record I mean two text which are over 90% identical.

Thanks in advance
0
Comment
Question by:anamarif
7 Comments
 
LVL 143

Expert Comment

by:Guy Hengel [angelIII / a3]
ID: 39917943
it really depends on what you (want to) consider as "duplicate/similar"
please clarify this, as that is the key to whatever solution you want to apply.
0
 

Author Comment

by:anamarif
ID: 39918019
What I consider similar depends on what's lightweight to server, I am not very strict here to go through all complex algorithims to detect similarities.
0
 
LVL 34

Accepted Solution

by:
gr8gonzo earned 500 total points
ID: 39918167
1. You shouldn't try to figure out an algorithm based on "what's lightweight" - you might find an algorithm that does something lightweight but it almost never works, so you just added another unnecessary straw to the camel's back and wasted coding time.

You should know upfront how similar two messages should be before they can be considered duplicates, and whether or not certain data can be excluded. For example, I could send two identical email messages to someone back to back and they would still be different because of the email headers (date/times, etc...).

2. You might also have two messages that are identical except for numbers ("Deposit $100000 into my account" and "Deposit $1" into my account"), so are those going to be duplicates or do numbers count?

3. When all you are analyzing is comprised of words, I typically approach it with a metaphone and a hash.

Metaphone is an algorithm that basically produces a code that represents what a string sounds like, which can help detect similarities between two strings, even if one is slightly different You could loop through a message and build a metaphone version of the message:

<?php

$msg1 = "Please deposit $10000 into my account.";
$msg2 = "Please deposit $1 into my account.";

echo metaphone($msg1) . "\n"; // Produces: PLSTPSTNTMKKNT
echo metaphone($msg2) . "\n"; // Produces: PLSTPSTNTMKKNT (identical)

?>

Open in new window


Of course, the metaphone could get REALLY long if you're processing a long message, so hashing it with SHA-1 will produce a 40-character string that -represents- the full metaphone:


<?php

$msg1 = "Please deposit $10000 into my account. Also, Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.";

$msg2 = "Please deposit $1 into my account.. Also, lorem IPSUM dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.";

echo sha1(metaphone($msg1)) . "\n"; // Produces: 8b49c819cbc1f4667fff40f758643f0d6b6e0f13
echo sha1(metaphone($msg2)) . "\n"; // Produces: 8b49c819cbc1f4667fff40f758643f0d6b6e0f13 (identical)

?>

Open in new window


You can store that hash in a separate column (indexed!!!) in the same record as the message in the database. Then, later, before you insert another message, just do a search against that column for the hash of your new message:

SELECT id FROM your_messages_table WHERE message_hash = '<SHA-1 hash of the new message's metaphone>';

If any IDs come back, you have a VERY high probability of having a duplicate message.
0
PRTG Network Monitor: Intuitive Network Monitoring

Network Monitoring is essential to ensure that computer systems and network devices are running. Use PRTG to monitor LANs, servers, websites, applications and devices, bandwidth, virtual environments, remote systems, IoT, and many more. PRTG is easy to set up & use.

 

Author Comment

by:anamarif
ID: 39918179
that's very nice, is it possible to include numbers as well,like in case of this :-

$msg1 = "Please deposit $10000 into my account.";
$msg2 = "Please deposit $1 into my account.";

these aren't considered similar due to different numbers, can we use metaphone approach with numbers as well?
0
 
LVL 34

Assisted Solution

by:gr8gonzo
gr8gonzo earned 500 total points
ID: 39918365
Sort of, but you also have to make sure you're not analyzing content that will ALWAYS be different (e.g. automatically-generated dates/times / timestamps) otherwise you will always have a different result, even if the actual message content is the same.

I would probably take one of two approaches:

APPROACH 1: LESS-FLEXIBLE, BUT SIMPLER
Convert all your numbers into words. This person created a function called "convert_number_to_words" for doing this:
http://www.karlrixon.co.uk/writing/convert-numbers-to-words-with-php/

Drop that into a script and then add this function:
function convert_all_numbers_to_words($str)
{
	return preg_replace_callback("/\d+/",function ($num) { return convert_number_to_words($num[0]); },$str);
}

Open in new window


Then you call the convert_all_numbers_to_words on the original string before you do the metaphone stuff. The result is that $10000 will change into "$ten thousand" and $1 will change into "$one". The resulting metaphone and hashes will be different:

<?php

$msg1 = "Please deposit $10000 into my account.";
$msg2 = "Please deposit $1 into my account.";

echo sha1(metaphone(convert_all_numbers_to_words($msg1))) . "\n"; // Produces: 890ba738be10fdc7e69192777b064ad2b685c030
echo sha1(metaphone(convert_all_numbers_to_words($msg2))) . "\n"; // Produces: 419f214dbedae8defccfa14dd1e067850ec924db (different)

function convert_all_numbers_to_words($str)
{
	return preg_replace_callback("/\d+/",function ($num) { return convert_number_to_words($num[0]); },$str);
}

function convert_number_to_words($number) {
    
    $hyphen      = '-';
    $conjunction = ' and ';
    $separator   = ', ';
    $negative    = 'negative ';
    $decimal     = ' point ';
    $dictionary  = array(
        0                   => 'zero',
        1                   => 'one',
        2                   => 'two',
        3                   => 'three',
        4                   => 'four',
        5                   => 'five',
        6                   => 'six',
        7                   => 'seven',
        8                   => 'eight',
        9                   => 'nine',
        10                  => 'ten',
        11                  => 'eleven',
        12                  => 'twelve',
        13                  => 'thirteen',
        14                  => 'fourteen',
        15                  => 'fifteen',
        16                  => 'sixteen',
        17                  => 'seventeen',
        18                  => 'eighteen',
        19                  => 'nineteen',
        20                  => 'twenty',
        30                  => 'thirty',
        40                  => 'fourty',
        50                  => 'fifty',
        60                  => 'sixty',
        70                  => 'seventy',
        80                  => 'eighty',
        90                  => 'ninety',
        100                 => 'hundred',
        1000                => 'thousand',
        1000000             => 'million',
        1000000000          => 'billion',
        1000000000000       => 'trillion',
        1000000000000000    => 'quadrillion',
        1000000000000000000 => 'quintillion'
    );
    
    if (!is_numeric($number)) {
        return false;
    }
    
    if (($number >= 0 && (int) $number < 0) || (int) $number < 0 - PHP_INT_MAX) {
        // overflow
        trigger_error(
            'convert_number_to_words only accepts numbers between -' . PHP_INT_MAX . ' and ' . PHP_INT_MAX,
            E_USER_WARNING
        );
        return false;
    }

    if ($number < 0) {
        return $negative . convert_number_to_words(abs($number));
    }
    
    $string = $fraction = null;
    
    if (strpos($number, '.') !== false) {
        list($number, $fraction) = explode('.', $number);
    }
    
    switch (true) {
        case $number < 21:
            $string = $dictionary[$number];
            break;
        case $number < 100:
            $tens   = ((int) ($number / 10)) * 10;
            $units  = $number % 10;
            $string = $dictionary[$tens];
            if ($units) {
                $string .= $hyphen . $dictionary[$units];
            }
            break;
        case $number < 1000:
            $hundreds  = $number / 100;
            $remainder = $number % 100;
            $string = $dictionary[$hundreds] . ' ' . $dictionary[100];
            if ($remainder) {
                $string .= $conjunction . convert_number_to_words($remainder);
            }
            break;
        default:
            $baseUnit = pow(1000, floor(log($number, 1000)));
            $numBaseUnits = (int) ($number / $baseUnit);
            $remainder = $number % $baseUnit;
            $string = convert_number_to_words($numBaseUnits) . ' ' . $dictionary[$baseUnit];
            if ($remainder) {
                $string .= $remainder < 100 ? $conjunction : $separator;
                $string .= convert_number_to_words($remainder);
            }
            break;
    }
    
    if (null !== $fraction && is_numeric($fraction)) {
        $string .= $decimal;
        $words = array();
        foreach (str_split((string) $fraction) as $number) {
            $words[] = $dictionary[$number];
        }
        $string .= implode(' ', $words);
    }
    
    return $string;
}

Open in new window


APPROACH 2: MORE FLEXIBLE, DEEPER PROCESSING
A Levenshtein distance is basically the number of changes that have to occur for one string to turn into another. So the Levenshtein difference between "apple" and "apples" is 1 (because one change is necessary - to add an "s"). By extracting all the numbers from a string and making them all into one giant string, you can compare the numbers to determine actual numeric similarity:

Please deposit $10000 and $20 into my account.
Please deposit $1000 and $50 into my account.

Based on this algorithm, the messages are identical, but the numeric values are only 75% similar:

10000,50
1000,20

...has a Levenshtein distance of 2 (divided by the length of the longest string, 8 = 2/8 = 1/4 = 25% different = 75% the same).

In some cases, it can help to identify the overall message WORD similarity first (so you can identify whether or not you're getting bulk messages that all follow the same pattern except for special characters and numbers and such), and then if you find any duplicates, pull those duplicates to check the numeric similarity (or even store the extracted numbers separately to save processing time).

For example:
<?php

$msg1 = "Please deposit $10000 and $20 into my account.";
$msg2 = "Please deposit $1000 and $50 into my account.";

echo sha1(metaphone($msg1)) . "\n"; // Produces: 518a8ac831932b655ca775ca14300df152d36ef7
echo sha1(metaphone($msg2)) . "\n"; // Produces: 518a8ac831932b655ca775ca14300df152d36ef7 (identical)

echo numeric_similarity($msg1,$msg2) . "% similar numeric values.\n"; // Produces: 75%  similar numeric values

function numeric_similarity($str1,$str2)
{
	$allNumbers1 = extract_all_numbers($str1);
	$allNumbers2 = extract_all_numbers($str2);
	$maxStrLen = max(strlen($allNumbers1),strlen($allNumbers2));
	return abs(1-(levenshtein($allNumbers1,$allNumbers2)/$maxStrLen))*100;
}

function extract_all_numbers($str)
{
	if(preg_match_all("/\d+/",$str,$matches))
	{
		return implode(",",$matches[0]);
	}
	return "";
}

Open in new window


Again, it depends on your situation. The second approach is a little heavier since there's a second process after initial duplicates are detected, but having the information separated can be useful later on.
0
 
LVL 109

Expert Comment

by:Ray Paseur
ID: 39918403
Some time back I made up some thought exercises about this subject.  Not all of this script will work, but I think you can see where the idea was going.

<?php // demo/similar_strings.php
error_reporting(E_ALL);
echo "<pre>" . PHP_EOL;


// SHOW SOME WAYS OF LOOKING AT IMPRECISELY MATCHED INFORMATION


// THINGS TO COMPARE
$strings = array
( array( '12345.1', '12345. 1'  )
, array( 'kitten',  'kitty'     )
, array( 'CATALOG', 'Catalog'   )
, array( 'cell',    'sell'      )
, array( 'super',   'souper'    )
, array( 'mi niña', 'mi nina'   )
, array( 'mi niña', 'mi ninia'  )
, array( 'ça va!',  'ca va!'    )
, array( 'Plaçe',   'Place'     )
, array( 'ça va!',  'sa va!'    )
, array( 'ca va!',  'sa va!'    )
, array( 'Yeehah',  'Yee-hah'   )
, array( 'toxic',   'poisonous' )
, array( 'glad',    'unglad'    )
, array( 'McLean',  'MacLean'   )
)
;

foreach ($strings as $string)
{
    $x = $string[0];
    $y = $string[1];
    echo PHP_EOL;
    echo "<b>COMPARING $x TO $y</b>";
    echo PHP_EOL;


    // http://php.net/manual/en/function.soundex.php
    echo PHP_EOL;
    echo 'SOUNDEX()';
    echo PHP_EOL;
    var_dump(soundex($x));
    var_dump(soundex($y));

    // http://php.net/manual/en/function.metaphone.php
    echo PHP_EOL;
    echo 'METAPHONE()';
    echo PHP_EOL;
    var_dump(metaphone($x));
    var_dump(metaphone($y));

    // http://php.net/manual/en/function.levenshtein.php
    echo PHP_EOL;
    echo 'LEVENSHTEIN()';
    echo PHP_EOL;
    var_dump(levenshtein($x, $y));

    // http://php.net/manual/en/function.similar-text.php
    echo PHP_EOL;
    echo 'SIMILAR_TEXT()';
    echo PHP_EOL;
    echo "COMMON CHARACTERS: " . similar_text($x, $y);
    echo PHP_EOL;
    similar_text($x, $y, &$pct);
    echo "PERCENT ALIKE: " . number_format($pct,1) . '%';

    echo PHP_EOL;
}


// WORK IN PROGRESS -- INCOMPLETE ON 2012-06-14
// A FUNCTION TO COMPUTE A CONSOLIDATED SCORE FOR SIMILARITY
function sameThing($x, $y, $normal=TRUE)
{
    // NORMALIZE = UPPER CASE LETTERS ONLY, SINGLE SPACES
    if ($normal)
    {
        $x = strtoupper($x);
        $y = strtoupper($y);
        $r = '/[^A-Z ]/';
        $x = preg_replace($r, ' ', $x);
        $y = preg_replace($r, ' ', $y);
        $r = '/\s\s+/';
        $x = preg_replace($r, ' ', $x);
        $y = preg_replace($r, ' ', $y);
        $x = trim($x);
        $y = trim($y);
    }

    $sxl = ( levenshtein(soundex($x),   soundex($y))   );
    $sxm = ( levenshtein(metaphone($x), metaphone($y)) );

    echo PHP_EOL . "$x vs $y Soundex Levenshtein=$sxl Metaphone Levenshtein=$sxm";
}

foreach ($strings as $string)
{
    sameThing($string[0], $string[1]);
}

Open in new window

HTH, ~Ray
0

Featured Post

PRTG Network Monitor: Intuitive Network Monitoring

Network Monitoring is essential to ensure that computer systems and network devices are running. Use PRTG to monitor LANs, servers, websites, applications and devices, bandwidth, virtual environments, remote systems, IoT, and many more. PRTG is easy to set up & use.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Preface This is the third article about the EE Collaborative Login Project. A Better Website Login System (http://www.experts-exchange.com/A_2902.html) introduces the Login System and shows how to implement a login page. The EE Collaborative Logi…
Styling your websites can become very complex. Here I'll show how SASS can help you better organize, maintain and reuse your CSS code.
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
The viewer will learn how to count occurrences of each item in an array.

856 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question