[2 days left] What’s wrong with your cloud strategy? Learn why multicloud solutions matter with Nimble Storage.Register Now


Best way to find duplicate message in database

Posted on 2014-03-10
Medium Priority
Last Modified: 2016-05-18

I need to check if a similar message already exist in the database before inserting new record/message.

What's the fastest way to do so ?

also keep in mind I don't have to check exact same copy by similar record I mean two text which are over 90% identical.

Thanks in advance
Question by:anamarif
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
LVL 143

Expert Comment

by:Guy Hengel [angelIII / a3]
ID: 39917943
it really depends on what you (want to) consider as "duplicate/similar"
please clarify this, as that is the key to whatever solution you want to apply.

Author Comment

ID: 39918019
What I consider similar depends on what's lightweight to server, I am not very strict here to go through all complex algorithims to detect similarities.
LVL 35

Accepted Solution

gr8gonzo earned 2000 total points
ID: 39918167
1. You shouldn't try to figure out an algorithm based on "what's lightweight" - you might find an algorithm that does something lightweight but it almost never works, so you just added another unnecessary straw to the camel's back and wasted coding time.

You should know upfront how similar two messages should be before they can be considered duplicates, and whether or not certain data can be excluded. For example, I could send two identical email messages to someone back to back and they would still be different because of the email headers (date/times, etc...).

2. You might also have two messages that are identical except for numbers ("Deposit $100000 into my account" and "Deposit $1" into my account"), so are those going to be duplicates or do numbers count?

3. When all you are analyzing is comprised of words, I typically approach it with a metaphone and a hash.

Metaphone is an algorithm that basically produces a code that represents what a string sounds like, which can help detect similarities between two strings, even if one is slightly different You could loop through a message and build a metaphone version of the message:


$msg1 = "Please deposit $10000 into my account.";
$msg2 = "Please deposit $1 into my account.";

echo metaphone($msg1) . "\n"; // Produces: PLSTPSTNTMKKNT
echo metaphone($msg2) . "\n"; // Produces: PLSTPSTNTMKKNT (identical)


Open in new window

Of course, the metaphone could get REALLY long if you're processing a long message, so hashing it with SHA-1 will produce a 40-character string that -represents- the full metaphone:


$msg1 = "Please deposit $10000 into my account. Also, Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.";

$msg2 = "Please deposit $1 into my account.. Also, lorem IPSUM dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.";

echo sha1(metaphone($msg1)) . "\n"; // Produces: 8b49c819cbc1f4667fff40f758643f0d6b6e0f13
echo sha1(metaphone($msg2)) . "\n"; // Produces: 8b49c819cbc1f4667fff40f758643f0d6b6e0f13 (identical)


Open in new window

You can store that hash in a separate column (indexed!!!) in the same record as the message in the database. Then, later, before you insert another message, just do a search against that column for the hash of your new message:

SELECT id FROM your_messages_table WHERE message_hash = '<SHA-1 hash of the new message's metaphone>';

If any IDs come back, you have a VERY high probability of having a duplicate message.
Learn how to optimize MySQL for your business need

With the increasing importance of apps & networks in both business & personal interconnections, perfor. has become one of the key metrics of successful communication. This ebook is a hands-on business-case-driven guide to understanding MySQL query parameter tuning & database perf


Author Comment

ID: 39918179
that's very nice, is it possible to include numbers as well,like in case of this :-

$msg1 = "Please deposit $10000 into my account.";
$msg2 = "Please deposit $1 into my account.";

these aren't considered similar due to different numbers, can we use metaphone approach with numbers as well?
LVL 35

Assisted Solution

gr8gonzo earned 2000 total points
ID: 39918365
Sort of, but you also have to make sure you're not analyzing content that will ALWAYS be different (e.g. automatically-generated dates/times / timestamps) otherwise you will always have a different result, even if the actual message content is the same.

I would probably take one of two approaches:

Convert all your numbers into words. This person created a function called "convert_number_to_words" for doing this:

Drop that into a script and then add this function:
function convert_all_numbers_to_words($str)
	return preg_replace_callback("/\d+/",function ($num) { return convert_number_to_words($num[0]); },$str);

Open in new window

Then you call the convert_all_numbers_to_words on the original string before you do the metaphone stuff. The result is that $10000 will change into "$ten thousand" and $1 will change into "$one". The resulting metaphone and hashes will be different:


$msg1 = "Please deposit $10000 into my account.";
$msg2 = "Please deposit $1 into my account.";

echo sha1(metaphone(convert_all_numbers_to_words($msg1))) . "\n"; // Produces: 890ba738be10fdc7e69192777b064ad2b685c030
echo sha1(metaphone(convert_all_numbers_to_words($msg2))) . "\n"; // Produces: 419f214dbedae8defccfa14dd1e067850ec924db (different)

function convert_all_numbers_to_words($str)
	return preg_replace_callback("/\d+/",function ($num) { return convert_number_to_words($num[0]); },$str);

function convert_number_to_words($number) {
    $hyphen      = '-';
    $conjunction = ' and ';
    $separator   = ', ';
    $negative    = 'negative ';
    $decimal     = ' point ';
    $dictionary  = array(
        0                   => 'zero',
        1                   => 'one',
        2                   => 'two',
        3                   => 'three',
        4                   => 'four',
        5                   => 'five',
        6                   => 'six',
        7                   => 'seven',
        8                   => 'eight',
        9                   => 'nine',
        10                  => 'ten',
        11                  => 'eleven',
        12                  => 'twelve',
        13                  => 'thirteen',
        14                  => 'fourteen',
        15                  => 'fifteen',
        16                  => 'sixteen',
        17                  => 'seventeen',
        18                  => 'eighteen',
        19                  => 'nineteen',
        20                  => 'twenty',
        30                  => 'thirty',
        40                  => 'fourty',
        50                  => 'fifty',
        60                  => 'sixty',
        70                  => 'seventy',
        80                  => 'eighty',
        90                  => 'ninety',
        100                 => 'hundred',
        1000                => 'thousand',
        1000000             => 'million',
        1000000000          => 'billion',
        1000000000000       => 'trillion',
        1000000000000000    => 'quadrillion',
        1000000000000000000 => 'quintillion'
    if (!is_numeric($number)) {
        return false;
    if (($number >= 0 && (int) $number < 0) || (int) $number < 0 - PHP_INT_MAX) {
        // overflow
            'convert_number_to_words only accepts numbers between -' . PHP_INT_MAX . ' and ' . PHP_INT_MAX,
        return false;

    if ($number < 0) {
        return $negative . convert_number_to_words(abs($number));
    $string = $fraction = null;
    if (strpos($number, '.') !== false) {
        list($number, $fraction) = explode('.', $number);
    switch (true) {
        case $number < 21:
            $string = $dictionary[$number];
        case $number < 100:
            $tens   = ((int) ($number / 10)) * 10;
            $units  = $number % 10;
            $string = $dictionary[$tens];
            if ($units) {
                $string .= $hyphen . $dictionary[$units];
        case $number < 1000:
            $hundreds  = $number / 100;
            $remainder = $number % 100;
            $string = $dictionary[$hundreds] . ' ' . $dictionary[100];
            if ($remainder) {
                $string .= $conjunction . convert_number_to_words($remainder);
            $baseUnit = pow(1000, floor(log($number, 1000)));
            $numBaseUnits = (int) ($number / $baseUnit);
            $remainder = $number % $baseUnit;
            $string = convert_number_to_words($numBaseUnits) . ' ' . $dictionary[$baseUnit];
            if ($remainder) {
                $string .= $remainder < 100 ? $conjunction : $separator;
                $string .= convert_number_to_words($remainder);
    if (null !== $fraction && is_numeric($fraction)) {
        $string .= $decimal;
        $words = array();
        foreach (str_split((string) $fraction) as $number) {
            $words[] = $dictionary[$number];
        $string .= implode(' ', $words);
    return $string;

Open in new window

A Levenshtein distance is basically the number of changes that have to occur for one string to turn into another. So the Levenshtein difference between "apple" and "apples" is 1 (because one change is necessary - to add an "s"). By extracting all the numbers from a string and making them all into one giant string, you can compare the numbers to determine actual numeric similarity:

Please deposit $10000 and $20 into my account.
Please deposit $1000 and $50 into my account.

Based on this algorithm, the messages are identical, but the numeric values are only 75% similar:


...has a Levenshtein distance of 2 (divided by the length of the longest string, 8 = 2/8 = 1/4 = 25% different = 75% the same).

In some cases, it can help to identify the overall message WORD similarity first (so you can identify whether or not you're getting bulk messages that all follow the same pattern except for special characters and numbers and such), and then if you find any duplicates, pull those duplicates to check the numeric similarity (or even store the extracted numbers separately to save processing time).

For example:

$msg1 = "Please deposit $10000 and $20 into my account.";
$msg2 = "Please deposit $1000 and $50 into my account.";

echo sha1(metaphone($msg1)) . "\n"; // Produces: 518a8ac831932b655ca775ca14300df152d36ef7
echo sha1(metaphone($msg2)) . "\n"; // Produces: 518a8ac831932b655ca775ca14300df152d36ef7 (identical)

echo numeric_similarity($msg1,$msg2) . "% similar numeric values.\n"; // Produces: 75%  similar numeric values

function numeric_similarity($str1,$str2)
	$allNumbers1 = extract_all_numbers($str1);
	$allNumbers2 = extract_all_numbers($str2);
	$maxStrLen = max(strlen($allNumbers1),strlen($allNumbers2));
	return abs(1-(levenshtein($allNumbers1,$allNumbers2)/$maxStrLen))*100;

function extract_all_numbers($str)
		return implode(",",$matches[0]);
	return "";

Open in new window

Again, it depends on your situation. The second approach is a little heavier since there's a second process after initial duplicates are detected, but having the information separated can be useful later on.
LVL 111

Expert Comment

by:Ray Paseur
ID: 39918403
Some time back I made up some thought exercises about this subject.  Not all of this script will work, but I think you can see where the idea was going.

<?php // demo/similar_strings.php
echo "<pre>" . PHP_EOL;


$strings = array
( array( '12345.1', '12345. 1'  )
, array( 'kitten',  'kitty'     )
, array( 'CATALOG', 'Catalog'   )
, array( 'cell',    'sell'      )
, array( 'super',   'souper'    )
, array( 'mi niña', 'mi nina'   )
, array( 'mi niña', 'mi ninia'  )
, array( 'ça va!',  'ca va!'    )
, array( 'Plaçe',   'Place'     )
, array( 'ça va!',  'sa va!'    )
, array( 'ca va!',  'sa va!'    )
, array( 'Yeehah',  'Yee-hah'   )
, array( 'toxic',   'poisonous' )
, array( 'glad',    'unglad'    )
, array( 'McLean',  'MacLean'   )

foreach ($strings as $string)
    $x = $string[0];
    $y = $string[1];
    echo PHP_EOL;
    echo "<b>COMPARING $x TO $y</b>";
    echo PHP_EOL;

    // http://php.net/manual/en/function.soundex.php
    echo PHP_EOL;
    echo 'SOUNDEX()';
    echo PHP_EOL;

    // http://php.net/manual/en/function.metaphone.php
    echo PHP_EOL;
    echo 'METAPHONE()';
    echo PHP_EOL;

    // http://php.net/manual/en/function.levenshtein.php
    echo PHP_EOL;
    echo 'LEVENSHTEIN()';
    echo PHP_EOL;
    var_dump(levenshtein($x, $y));

    // http://php.net/manual/en/function.similar-text.php
    echo PHP_EOL;
    echo 'SIMILAR_TEXT()';
    echo PHP_EOL;
    echo "COMMON CHARACTERS: " . similar_text($x, $y);
    echo PHP_EOL;
    similar_text($x, $y, &$pct);
    echo "PERCENT ALIKE: " . number_format($pct,1) . '%';

    echo PHP_EOL;

function sameThing($x, $y, $normal=TRUE)
    if ($normal)
        $x = strtoupper($x);
        $y = strtoupper($y);
        $r = '/[^A-Z ]/';
        $x = preg_replace($r, ' ', $x);
        $y = preg_replace($r, ' ', $y);
        $r = '/\s\s+/';
        $x = preg_replace($r, ' ', $x);
        $y = preg_replace($r, ' ', $y);
        $x = trim($x);
        $y = trim($y);

    $sxl = ( levenshtein(soundex($x),   soundex($y))   );
    $sxm = ( levenshtein(metaphone($x), metaphone($y)) );

    echo PHP_EOL . "$x vs $y Soundex Levenshtein=$sxl Metaphone Levenshtein=$sxm";

foreach ($strings as $string)
    sameThing($string[0], $string[1]);

Open in new window

HTH, ~Ray

Featured Post

Moving data to the cloud? Find out if you’re ready

Before moving to the cloud, it is important to carefully define your db needs, plan for the migration & understand prod. environment. This wp explains how to define what you need from a cloud provider, plan for the migration & what putting a cloud solution into practice entails.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

In this blog, we’ll look at how improvements to Percona XtraDB Cluster improved IST performance.
Without even knowing it, most of us are using web applications on a daily basis.  In fact, Gmail and Yahoo email, Twitter, Facebook, and eBay are used by most of us daily—and they are web applications. We generally confuse these web applications to…
Viewers will learn about the regular for loop in Java and how to use it. Definition: Break the for loop down into 3 parts: Syntax when using for loops: Example using a for loop:
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
Suggested Courses

656 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question