Solved

Best way to find duplicate message in database

Posted on 2014-03-10
7
27 Views
Last Modified: 2016-05-18
Hi,

I need to check if a similar message already exist in the database before inserting new record/message.

What's the fastest way to do so ?

also keep in mind I don't have to check exact same copy by similar record I mean two text which are over 90% identical.

Thanks in advance
0
Comment
Question by:anamarif
7 Comments
 
LVL 142

Expert Comment

by:Guy Hengel [angelIII / a3]
ID: 39917943
it really depends on what you (want to) consider as "duplicate/similar"
please clarify this, as that is the key to whatever solution you want to apply.
0
 

Author Comment

by:anamarif
ID: 39918019
What I consider similar depends on what's lightweight to server, I am not very strict here to go through all complex algorithims to detect similarities.
0
 
LVL 34

Accepted Solution

by:
gr8gonzo earned 500 total points
ID: 39918167
1. You shouldn't try to figure out an algorithm based on "what's lightweight" - you might find an algorithm that does something lightweight but it almost never works, so you just added another unnecessary straw to the camel's back and wasted coding time.

You should know upfront how similar two messages should be before they can be considered duplicates, and whether or not certain data can be excluded. For example, I could send two identical email messages to someone back to back and they would still be different because of the email headers (date/times, etc...).

2. You might also have two messages that are identical except for numbers ("Deposit $100000 into my account" and "Deposit $1" into my account"), so are those going to be duplicates or do numbers count?

3. When all you are analyzing is comprised of words, I typically approach it with a metaphone and a hash.

Metaphone is an algorithm that basically produces a code that represents what a string sounds like, which can help detect similarities between two strings, even if one is slightly different You could loop through a message and build a metaphone version of the message:

<?php

$msg1 = "Please deposit $10000 into my account.";
$msg2 = "Please deposit $1 into my account.";

echo metaphone($msg1) . "\n"; // Produces: PLSTPSTNTMKKNT
echo metaphone($msg2) . "\n"; // Produces: PLSTPSTNTMKKNT (identical)

?>

Open in new window


Of course, the metaphone could get REALLY long if you're processing a long message, so hashing it with SHA-1 will produce a 40-character string that -represents- the full metaphone:


<?php

$msg1 = "Please deposit $10000 into my account. Also, Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.";

$msg2 = "Please deposit $1 into my account.. Also, lorem IPSUM dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.";

echo sha1(metaphone($msg1)) . "\n"; // Produces: 8b49c819cbc1f4667fff40f758643f0d6b6e0f13
echo sha1(metaphone($msg2)) . "\n"; // Produces: 8b49c819cbc1f4667fff40f758643f0d6b6e0f13 (identical)

?>

Open in new window


You can store that hash in a separate column (indexed!!!) in the same record as the message in the database. Then, later, before you insert another message, just do a search against that column for the hash of your new message:

SELECT id FROM your_messages_table WHERE message_hash = '<SHA-1 hash of the new message's metaphone>';

If any IDs come back, you have a VERY high probability of having a duplicate message.
0
Maximize Your Threat Intelligence Reporting

Reporting is one of the most important and least talked about aspects of a world-class threat intelligence program. Here’s how to do it right.

 

Author Comment

by:anamarif
ID: 39918179
that's very nice, is it possible to include numbers as well,like in case of this :-

$msg1 = "Please deposit $10000 into my account.";
$msg2 = "Please deposit $1 into my account.";

these aren't considered similar due to different numbers, can we use metaphone approach with numbers as well?
0
 
LVL 34

Assisted Solution

by:gr8gonzo
gr8gonzo earned 500 total points
ID: 39918365
Sort of, but you also have to make sure you're not analyzing content that will ALWAYS be different (e.g. automatically-generated dates/times / timestamps) otherwise you will always have a different result, even if the actual message content is the same.

I would probably take one of two approaches:

APPROACH 1: LESS-FLEXIBLE, BUT SIMPLER
Convert all your numbers into words. This person created a function called "convert_number_to_words" for doing this:
http://www.karlrixon.co.uk/writing/convert-numbers-to-words-with-php/

Drop that into a script and then add this function:
function convert_all_numbers_to_words($str)
{
	return preg_replace_callback("/\d+/",function ($num) { return convert_number_to_words($num[0]); },$str);
}

Open in new window


Then you call the convert_all_numbers_to_words on the original string before you do the metaphone stuff. The result is that $10000 will change into "$ten thousand" and $1 will change into "$one". The resulting metaphone and hashes will be different:

<?php

$msg1 = "Please deposit $10000 into my account.";
$msg2 = "Please deposit $1 into my account.";

echo sha1(metaphone(convert_all_numbers_to_words($msg1))) . "\n"; // Produces: 890ba738be10fdc7e69192777b064ad2b685c030
echo sha1(metaphone(convert_all_numbers_to_words($msg2))) . "\n"; // Produces: 419f214dbedae8defccfa14dd1e067850ec924db (different)

function convert_all_numbers_to_words($str)
{
	return preg_replace_callback("/\d+/",function ($num) { return convert_number_to_words($num[0]); },$str);
}

function convert_number_to_words($number) {
    
    $hyphen      = '-';
    $conjunction = ' and ';
    $separator   = ', ';
    $negative    = 'negative ';
    $decimal     = ' point ';
    $dictionary  = array(
        0                   => 'zero',
        1                   => 'one',
        2                   => 'two',
        3                   => 'three',
        4                   => 'four',
        5                   => 'five',
        6                   => 'six',
        7                   => 'seven',
        8                   => 'eight',
        9                   => 'nine',
        10                  => 'ten',
        11                  => 'eleven',
        12                  => 'twelve',
        13                  => 'thirteen',
        14                  => 'fourteen',
        15                  => 'fifteen',
        16                  => 'sixteen',
        17                  => 'seventeen',
        18                  => 'eighteen',
        19                  => 'nineteen',
        20                  => 'twenty',
        30                  => 'thirty',
        40                  => 'fourty',
        50                  => 'fifty',
        60                  => 'sixty',
        70                  => 'seventy',
        80                  => 'eighty',
        90                  => 'ninety',
        100                 => 'hundred',
        1000                => 'thousand',
        1000000             => 'million',
        1000000000          => 'billion',
        1000000000000       => 'trillion',
        1000000000000000    => 'quadrillion',
        1000000000000000000 => 'quintillion'
    );
    
    if (!is_numeric($number)) {
        return false;
    }
    
    if (($number >= 0 && (int) $number < 0) || (int) $number < 0 - PHP_INT_MAX) {
        // overflow
        trigger_error(
            'convert_number_to_words only accepts numbers between -' . PHP_INT_MAX . ' and ' . PHP_INT_MAX,
            E_USER_WARNING
        );
        return false;
    }

    if ($number < 0) {
        return $negative . convert_number_to_words(abs($number));
    }
    
    $string = $fraction = null;
    
    if (strpos($number, '.') !== false) {
        list($number, $fraction) = explode('.', $number);
    }
    
    switch (true) {
        case $number < 21:
            $string = $dictionary[$number];
            break;
        case $number < 100:
            $tens   = ((int) ($number / 10)) * 10;
            $units  = $number % 10;
            $string = $dictionary[$tens];
            if ($units) {
                $string .= $hyphen . $dictionary[$units];
            }
            break;
        case $number < 1000:
            $hundreds  = $number / 100;
            $remainder = $number % 100;
            $string = $dictionary[$hundreds] . ' ' . $dictionary[100];
            if ($remainder) {
                $string .= $conjunction . convert_number_to_words($remainder);
            }
            break;
        default:
            $baseUnit = pow(1000, floor(log($number, 1000)));
            $numBaseUnits = (int) ($number / $baseUnit);
            $remainder = $number % $baseUnit;
            $string = convert_number_to_words($numBaseUnits) . ' ' . $dictionary[$baseUnit];
            if ($remainder) {
                $string .= $remainder < 100 ? $conjunction : $separator;
                $string .= convert_number_to_words($remainder);
            }
            break;
    }
    
    if (null !== $fraction && is_numeric($fraction)) {
        $string .= $decimal;
        $words = array();
        foreach (str_split((string) $fraction) as $number) {
            $words[] = $dictionary[$number];
        }
        $string .= implode(' ', $words);
    }
    
    return $string;
}

Open in new window


APPROACH 2: MORE FLEXIBLE, DEEPER PROCESSING
A Levenshtein distance is basically the number of changes that have to occur for one string to turn into another. So the Levenshtein difference between "apple" and "apples" is 1 (because one change is necessary - to add an "s"). By extracting all the numbers from a string and making them all into one giant string, you can compare the numbers to determine actual numeric similarity:

Please deposit $10000 and $20 into my account.
Please deposit $1000 and $50 into my account.

Based on this algorithm, the messages are identical, but the numeric values are only 75% similar:

10000,50
1000,20

...has a Levenshtein distance of 2 (divided by the length of the longest string, 8 = 2/8 = 1/4 = 25% different = 75% the same).

In some cases, it can help to identify the overall message WORD similarity first (so you can identify whether or not you're getting bulk messages that all follow the same pattern except for special characters and numbers and such), and then if you find any duplicates, pull those duplicates to check the numeric similarity (or even store the extracted numbers separately to save processing time).

For example:
<?php

$msg1 = "Please deposit $10000 and $20 into my account.";
$msg2 = "Please deposit $1000 and $50 into my account.";

echo sha1(metaphone($msg1)) . "\n"; // Produces: 518a8ac831932b655ca775ca14300df152d36ef7
echo sha1(metaphone($msg2)) . "\n"; // Produces: 518a8ac831932b655ca775ca14300df152d36ef7 (identical)

echo numeric_similarity($msg1,$msg2) . "% similar numeric values.\n"; // Produces: 75%  similar numeric values

function numeric_similarity($str1,$str2)
{
	$allNumbers1 = extract_all_numbers($str1);
	$allNumbers2 = extract_all_numbers($str2);
	$maxStrLen = max(strlen($allNumbers1),strlen($allNumbers2));
	return abs(1-(levenshtein($allNumbers1,$allNumbers2)/$maxStrLen))*100;
}

function extract_all_numbers($str)
{
	if(preg_match_all("/\d+/",$str,$matches))
	{
		return implode(",",$matches[0]);
	}
	return "";
}

Open in new window


Again, it depends on your situation. The second approach is a little heavier since there's a second process after initial duplicates are detected, but having the information separated can be useful later on.
0
 
LVL 108

Expert Comment

by:Ray Paseur
ID: 39918403
Some time back I made up some thought exercises about this subject.  Not all of this script will work, but I think you can see where the idea was going.

<?php // demo/similar_strings.php
error_reporting(E_ALL);
echo "<pre>" . PHP_EOL;


// SHOW SOME WAYS OF LOOKING AT IMPRECISELY MATCHED INFORMATION


// THINGS TO COMPARE
$strings = array
( array( '12345.1', '12345. 1'  )
, array( 'kitten',  'kitty'     )
, array( 'CATALOG', 'Catalog'   )
, array( 'cell',    'sell'      )
, array( 'super',   'souper'    )
, array( 'mi niña', 'mi nina'   )
, array( 'mi niña', 'mi ninia'  )
, array( 'ça va!',  'ca va!'    )
, array( 'Plaçe',   'Place'     )
, array( 'ça va!',  'sa va!'    )
, array( 'ca va!',  'sa va!'    )
, array( 'Yeehah',  'Yee-hah'   )
, array( 'toxic',   'poisonous' )
, array( 'glad',    'unglad'    )
, array( 'McLean',  'MacLean'   )
)
;

foreach ($strings as $string)
{
    $x = $string[0];
    $y = $string[1];
    echo PHP_EOL;
    echo "<b>COMPARING $x TO $y</b>";
    echo PHP_EOL;


    // http://php.net/manual/en/function.soundex.php
    echo PHP_EOL;
    echo 'SOUNDEX()';
    echo PHP_EOL;
    var_dump(soundex($x));
    var_dump(soundex($y));

    // http://php.net/manual/en/function.metaphone.php
    echo PHP_EOL;
    echo 'METAPHONE()';
    echo PHP_EOL;
    var_dump(metaphone($x));
    var_dump(metaphone($y));

    // http://php.net/manual/en/function.levenshtein.php
    echo PHP_EOL;
    echo 'LEVENSHTEIN()';
    echo PHP_EOL;
    var_dump(levenshtein($x, $y));

    // http://php.net/manual/en/function.similar-text.php
    echo PHP_EOL;
    echo 'SIMILAR_TEXT()';
    echo PHP_EOL;
    echo "COMMON CHARACTERS: " . similar_text($x, $y);
    echo PHP_EOL;
    similar_text($x, $y, &$pct);
    echo "PERCENT ALIKE: " . number_format($pct,1) . '%';

    echo PHP_EOL;
}


// WORK IN PROGRESS -- INCOMPLETE ON 2012-06-14
// A FUNCTION TO COMPUTE A CONSOLIDATED SCORE FOR SIMILARITY
function sameThing($x, $y, $normal=TRUE)
{
    // NORMALIZE = UPPER CASE LETTERS ONLY, SINGLE SPACES
    if ($normal)
    {
        $x = strtoupper($x);
        $y = strtoupper($y);
        $r = '/[^A-Z ]/';
        $x = preg_replace($r, ' ', $x);
        $y = preg_replace($r, ' ', $y);
        $r = '/\s\s+/';
        $x = preg_replace($r, ' ', $x);
        $y = preg_replace($r, ' ', $y);
        $x = trim($x);
        $y = trim($y);
    }

    $sxl = ( levenshtein(soundex($x),   soundex($y))   );
    $sxm = ( levenshtein(metaphone($x), metaphone($y)) );

    echo PHP_EOL . "$x vs $y Soundex Levenshtein=$sxl Metaphone Levenshtein=$sxm";
}

foreach ($strings as $string)
{
    sameThing($string[0], $string[1]);
}

Open in new window

HTH, ~Ray
0

Featured Post

Highfive Gives IT Their Time Back

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

Join & Write a Comment

I found this questions asking how to do this in many different forums, so I will describe here how to implement a solution using PHP and AJAX. The logical flow for the problem should be: Write an event handler for the first drop down box to get …
Nothing in an HTTP request can be trusted, including HTTP headers and form data.  A form token is a tool that can be used to guard against request forgeries (CSRF).  This article shows an improved approach to form tokens, making it more difficult to…
Viewers will learn about arithmetic and Boolean expressions in Java and the logical operators used to create Boolean expressions. We will cover the symbols used for arithmetic expressions and define each logical operator and how to use them in Boole…
HTML5 has deprecated a few of the older ways of showing media as well as offering up a new way to create games and animations. Audio, video, and canvas are just a few of the adjustments made between XHTML and HTML5. As we learned in our last micr…

744 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

11 Experts available now in Live!

Get 1:1 Help Now