Solved

Best way to find duplicate message in database

Posted on 2014-03-10
7
37 Views
Last Modified: 2016-05-18
Hi,

I need to check if a similar message already exist in the database before inserting new record/message.

What's the fastest way to do so ?

also keep in mind I don't have to check exact same copy by similar record I mean two text which are over 90% identical.

Thanks in advance
0
Comment
Question by:anamarif
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
7 Comments
 
LVL 143

Expert Comment

by:Guy Hengel [angelIII / a3]
ID: 39917943
it really depends on what you (want to) consider as "duplicate/similar"
please clarify this, as that is the key to whatever solution you want to apply.
0
 

Author Comment

by:anamarif
ID: 39918019
What I consider similar depends on what's lightweight to server, I am not very strict here to go through all complex algorithims to detect similarities.
0
 
LVL 35

Accepted Solution

by:
gr8gonzo earned 500 total points
ID: 39918167
1. You shouldn't try to figure out an algorithm based on "what's lightweight" - you might find an algorithm that does something lightweight but it almost never works, so you just added another unnecessary straw to the camel's back and wasted coding time.

You should know upfront how similar two messages should be before they can be considered duplicates, and whether or not certain data can be excluded. For example, I could send two identical email messages to someone back to back and they would still be different because of the email headers (date/times, etc...).

2. You might also have two messages that are identical except for numbers ("Deposit $100000 into my account" and "Deposit $1" into my account"), so are those going to be duplicates or do numbers count?

3. When all you are analyzing is comprised of words, I typically approach it with a metaphone and a hash.

Metaphone is an algorithm that basically produces a code that represents what a string sounds like, which can help detect similarities between two strings, even if one is slightly different You could loop through a message and build a metaphone version of the message:

<?php

$msg1 = "Please deposit $10000 into my account.";
$msg2 = "Please deposit $1 into my account.";

echo metaphone($msg1) . "\n"; // Produces: PLSTPSTNTMKKNT
echo metaphone($msg2) . "\n"; // Produces: PLSTPSTNTMKKNT (identical)

?>

Open in new window


Of course, the metaphone could get REALLY long if you're processing a long message, so hashing it with SHA-1 will produce a 40-character string that -represents- the full metaphone:


<?php

$msg1 = "Please deposit $10000 into my account. Also, Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.";

$msg2 = "Please deposit $1 into my account.. Also, lorem IPSUM dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.";

echo sha1(metaphone($msg1)) . "\n"; // Produces: 8b49c819cbc1f4667fff40f758643f0d6b6e0f13
echo sha1(metaphone($msg2)) . "\n"; // Produces: 8b49c819cbc1f4667fff40f758643f0d6b6e0f13 (identical)

?>

Open in new window


You can store that hash in a separate column (indexed!!!) in the same record as the message in the database. Then, later, before you insert another message, just do a search against that column for the hash of your new message:

SELECT id FROM your_messages_table WHERE message_hash = '<SHA-1 hash of the new message's metaphone>';

If any IDs come back, you have a VERY high probability of having a duplicate message.
0
NEW Veeam Agent for Microsoft Windows

Backup and recover physical and cloud-based servers and workstations, as well as endpoint devices that belong to remote users. Avoid downtime and data loss quickly and easily for Windows-based physical or public cloud-based workloads!

 

Author Comment

by:anamarif
ID: 39918179
that's very nice, is it possible to include numbers as well,like in case of this :-

$msg1 = "Please deposit $10000 into my account.";
$msg2 = "Please deposit $1 into my account.";

these aren't considered similar due to different numbers, can we use metaphone approach with numbers as well?
0
 
LVL 35

Assisted Solution

by:gr8gonzo
gr8gonzo earned 500 total points
ID: 39918365
Sort of, but you also have to make sure you're not analyzing content that will ALWAYS be different (e.g. automatically-generated dates/times / timestamps) otherwise you will always have a different result, even if the actual message content is the same.

I would probably take one of two approaches:

APPROACH 1: LESS-FLEXIBLE, BUT SIMPLER
Convert all your numbers into words. This person created a function called "convert_number_to_words" for doing this:
http://www.karlrixon.co.uk/writing/convert-numbers-to-words-with-php/

Drop that into a script and then add this function:
function convert_all_numbers_to_words($str)
{
	return preg_replace_callback("/\d+/",function ($num) { return convert_number_to_words($num[0]); },$str);
}

Open in new window


Then you call the convert_all_numbers_to_words on the original string before you do the metaphone stuff. The result is that $10000 will change into "$ten thousand" and $1 will change into "$one". The resulting metaphone and hashes will be different:

<?php

$msg1 = "Please deposit $10000 into my account.";
$msg2 = "Please deposit $1 into my account.";

echo sha1(metaphone(convert_all_numbers_to_words($msg1))) . "\n"; // Produces: 890ba738be10fdc7e69192777b064ad2b685c030
echo sha1(metaphone(convert_all_numbers_to_words($msg2))) . "\n"; // Produces: 419f214dbedae8defccfa14dd1e067850ec924db (different)

function convert_all_numbers_to_words($str)
{
	return preg_replace_callback("/\d+/",function ($num) { return convert_number_to_words($num[0]); },$str);
}

function convert_number_to_words($number) {
    
    $hyphen      = '-';
    $conjunction = ' and ';
    $separator   = ', ';
    $negative    = 'negative ';
    $decimal     = ' point ';
    $dictionary  = array(
        0                   => 'zero',
        1                   => 'one',
        2                   => 'two',
        3                   => 'three',
        4                   => 'four',
        5                   => 'five',
        6                   => 'six',
        7                   => 'seven',
        8                   => 'eight',
        9                   => 'nine',
        10                  => 'ten',
        11                  => 'eleven',
        12                  => 'twelve',
        13                  => 'thirteen',
        14                  => 'fourteen',
        15                  => 'fifteen',
        16                  => 'sixteen',
        17                  => 'seventeen',
        18                  => 'eighteen',
        19                  => 'nineteen',
        20                  => 'twenty',
        30                  => 'thirty',
        40                  => 'fourty',
        50                  => 'fifty',
        60                  => 'sixty',
        70                  => 'seventy',
        80                  => 'eighty',
        90                  => 'ninety',
        100                 => 'hundred',
        1000                => 'thousand',
        1000000             => 'million',
        1000000000          => 'billion',
        1000000000000       => 'trillion',
        1000000000000000    => 'quadrillion',
        1000000000000000000 => 'quintillion'
    );
    
    if (!is_numeric($number)) {
        return false;
    }
    
    if (($number >= 0 && (int) $number < 0) || (int) $number < 0 - PHP_INT_MAX) {
        // overflow
        trigger_error(
            'convert_number_to_words only accepts numbers between -' . PHP_INT_MAX . ' and ' . PHP_INT_MAX,
            E_USER_WARNING
        );
        return false;
    }

    if ($number < 0) {
        return $negative . convert_number_to_words(abs($number));
    }
    
    $string = $fraction = null;
    
    if (strpos($number, '.') !== false) {
        list($number, $fraction) = explode('.', $number);
    }
    
    switch (true) {
        case $number < 21:
            $string = $dictionary[$number];
            break;
        case $number < 100:
            $tens   = ((int) ($number / 10)) * 10;
            $units  = $number % 10;
            $string = $dictionary[$tens];
            if ($units) {
                $string .= $hyphen . $dictionary[$units];
            }
            break;
        case $number < 1000:
            $hundreds  = $number / 100;
            $remainder = $number % 100;
            $string = $dictionary[$hundreds] . ' ' . $dictionary[100];
            if ($remainder) {
                $string .= $conjunction . convert_number_to_words($remainder);
            }
            break;
        default:
            $baseUnit = pow(1000, floor(log($number, 1000)));
            $numBaseUnits = (int) ($number / $baseUnit);
            $remainder = $number % $baseUnit;
            $string = convert_number_to_words($numBaseUnits) . ' ' . $dictionary[$baseUnit];
            if ($remainder) {
                $string .= $remainder < 100 ? $conjunction : $separator;
                $string .= convert_number_to_words($remainder);
            }
            break;
    }
    
    if (null !== $fraction && is_numeric($fraction)) {
        $string .= $decimal;
        $words = array();
        foreach (str_split((string) $fraction) as $number) {
            $words[] = $dictionary[$number];
        }
        $string .= implode(' ', $words);
    }
    
    return $string;
}

Open in new window


APPROACH 2: MORE FLEXIBLE, DEEPER PROCESSING
A Levenshtein distance is basically the number of changes that have to occur for one string to turn into another. So the Levenshtein difference between "apple" and "apples" is 1 (because one change is necessary - to add an "s"). By extracting all the numbers from a string and making them all into one giant string, you can compare the numbers to determine actual numeric similarity:

Please deposit $10000 and $20 into my account.
Please deposit $1000 and $50 into my account.

Based on this algorithm, the messages are identical, but the numeric values are only 75% similar:

10000,50
1000,20

...has a Levenshtein distance of 2 (divided by the length of the longest string, 8 = 2/8 = 1/4 = 25% different = 75% the same).

In some cases, it can help to identify the overall message WORD similarity first (so you can identify whether or not you're getting bulk messages that all follow the same pattern except for special characters and numbers and such), and then if you find any duplicates, pull those duplicates to check the numeric similarity (or even store the extracted numbers separately to save processing time).

For example:
<?php

$msg1 = "Please deposit $10000 and $20 into my account.";
$msg2 = "Please deposit $1000 and $50 into my account.";

echo sha1(metaphone($msg1)) . "\n"; // Produces: 518a8ac831932b655ca775ca14300df152d36ef7
echo sha1(metaphone($msg2)) . "\n"; // Produces: 518a8ac831932b655ca775ca14300df152d36ef7 (identical)

echo numeric_similarity($msg1,$msg2) . "% similar numeric values.\n"; // Produces: 75%  similar numeric values

function numeric_similarity($str1,$str2)
{
	$allNumbers1 = extract_all_numbers($str1);
	$allNumbers2 = extract_all_numbers($str2);
	$maxStrLen = max(strlen($allNumbers1),strlen($allNumbers2));
	return abs(1-(levenshtein($allNumbers1,$allNumbers2)/$maxStrLen))*100;
}

function extract_all_numbers($str)
{
	if(preg_match_all("/\d+/",$str,$matches))
	{
		return implode(",",$matches[0]);
	}
	return "";
}

Open in new window


Again, it depends on your situation. The second approach is a little heavier since there's a second process after initial duplicates are detected, but having the information separated can be useful later on.
0
 
LVL 110

Expert Comment

by:Ray Paseur
ID: 39918403
Some time back I made up some thought exercises about this subject.  Not all of this script will work, but I think you can see where the idea was going.

<?php // demo/similar_strings.php
error_reporting(E_ALL);
echo "<pre>" . PHP_EOL;


// SHOW SOME WAYS OF LOOKING AT IMPRECISELY MATCHED INFORMATION


// THINGS TO COMPARE
$strings = array
( array( '12345.1', '12345. 1'  )
, array( 'kitten',  'kitty'     )
, array( 'CATALOG', 'Catalog'   )
, array( 'cell',    'sell'      )
, array( 'super',   'souper'    )
, array( 'mi niña', 'mi nina'   )
, array( 'mi niña', 'mi ninia'  )
, array( 'ça va!',  'ca va!'    )
, array( 'Plaçe',   'Place'     )
, array( 'ça va!',  'sa va!'    )
, array( 'ca va!',  'sa va!'    )
, array( 'Yeehah',  'Yee-hah'   )
, array( 'toxic',   'poisonous' )
, array( 'glad',    'unglad'    )
, array( 'McLean',  'MacLean'   )
)
;

foreach ($strings as $string)
{
    $x = $string[0];
    $y = $string[1];
    echo PHP_EOL;
    echo "<b>COMPARING $x TO $y</b>";
    echo PHP_EOL;


    // http://php.net/manual/en/function.soundex.php
    echo PHP_EOL;
    echo 'SOUNDEX()';
    echo PHP_EOL;
    var_dump(soundex($x));
    var_dump(soundex($y));

    // http://php.net/manual/en/function.metaphone.php
    echo PHP_EOL;
    echo 'METAPHONE()';
    echo PHP_EOL;
    var_dump(metaphone($x));
    var_dump(metaphone($y));

    // http://php.net/manual/en/function.levenshtein.php
    echo PHP_EOL;
    echo 'LEVENSHTEIN()';
    echo PHP_EOL;
    var_dump(levenshtein($x, $y));

    // http://php.net/manual/en/function.similar-text.php
    echo PHP_EOL;
    echo 'SIMILAR_TEXT()';
    echo PHP_EOL;
    echo "COMMON CHARACTERS: " . similar_text($x, $y);
    echo PHP_EOL;
    similar_text($x, $y, &$pct);
    echo "PERCENT ALIKE: " . number_format($pct,1) . '%';

    echo PHP_EOL;
}


// WORK IN PROGRESS -- INCOMPLETE ON 2012-06-14
// A FUNCTION TO COMPUTE A CONSOLIDATED SCORE FOR SIMILARITY
function sameThing($x, $y, $normal=TRUE)
{
    // NORMALIZE = UPPER CASE LETTERS ONLY, SINGLE SPACES
    if ($normal)
    {
        $x = strtoupper($x);
        $y = strtoupper($y);
        $r = '/[^A-Z ]/';
        $x = preg_replace($r, ' ', $x);
        $y = preg_replace($r, ' ', $y);
        $r = '/\s\s+/';
        $x = preg_replace($r, ' ', $x);
        $y = preg_replace($r, ' ', $y);
        $x = trim($x);
        $y = trim($y);
    }

    $sxl = ( levenshtein(soundex($x),   soundex($y))   );
    $sxm = ( levenshtein(metaphone($x), metaphone($y)) );

    echo PHP_EOL . "$x vs $y Soundex Levenshtein=$sxl Metaphone Levenshtein=$sxm";
}

foreach ($strings as $string)
{
    sameThing($string[0], $string[1]);
}

Open in new window

HTH, ~Ray
0

Featured Post

Migrating Your Company's PCs

To keep pace with competitors, businesses must keep employees productive, and that means providing them with the latest technology. This document provides the tips and tricks you need to help you migrate an outdated PC fleet to new desktops, laptops, and tablets.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Creating and Managing Databases with phpMyAdmin in cPanel.
Nothing in an HTTP request can be trusted, including HTTP headers and form data.  A form token is a tool that can be used to guard against request forgeries (CSRF).  This article shows an improved approach to form tokens, making it more difficult to…
The viewer will learn how to look for a specific file type in a local or remote server directory using PHP.
The viewer will the learn the benefit of plain text editors and code an HTML5 based template for use in further tutorials.

735 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question