Prevent Duplicate Articles


I am working on a project which involves storing articles (50 words to 1500 words) in MySql database. Is setting unique index on the column (TEXT field) the best solution? I currently have around 80,000 rows in the table with more to come.
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

I would perhaps rather consider using a hash. You could generate an MD5 hash of your articles and index that instead. The MD5 hashing algorithm is generally considered good enough for creating unique hashes for you case. You could create a persisted virtual/computed column with the MD5 function and index that.

That said, it is worth mentioning that whether you use an index on the TEXT column (which might not even be supported to begin with, don't remember) or on the MD5 column, both these schemes fall flat on their faces if someone makes even a subtle change to the article. You could have a 1500 word article, and if I just add a space character to the end of the article neither the MD5 nor the index on TEXT would pick up that it is in fact a duplicate. For that, a more complex approach is required. Our scheme will only detect EXACT duplicates. There is no scoring to find articles that are 90% the same for example.

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
These articles might provide extra leads in how one also solves the similar articles issues. Some of them are SQL Server (TSQL)... BUT they give ideas around algorithm names (Jaro-Winkler, Jaccard, Levenshtein, etc) and approaches used elsewhere. Distill what you need from them:
Ray PaseurCommented:
PHP has some functions that might be helpful.  Here is a script that I started a few years ago.  It shows some of the ways of looking at similarities and differences in content. 

There are also plagiarism detectors that might identify unwanted similarities.
<?php // demo/similar_strings.php

 * Show some ways of looking at imprecisely matched strings
echo "<pre>" . PHP_EOL;

$strings = array
( array( '12345.1', '12345. 1'  )
, array( 'kitten',  'kitty'     )
, array( 'CATALOG', 'Catalog'   )
, array( 'cell',    'sell'      )
, array( 'super',   'souper'    )
, array( 'mi niña', 'mi nina'   )
, array( 'mi niña', 'mi ninia'  )
, array( 'ça va!',  'ca va!'    )
, array( 'Plaçe',   'Place'     )
, array( 'ça va!',  'sa va!'    )
, array( 'ca va!',  'sa va!'    )
, array( 'Yeehah',  'Yee-hah'   )
, array( 'toxic',   'poisonous' )
, array( 'glad',    'unglad'    )
, array( 'McLean',  'MacLean'   )

foreach ($strings as $string)
    $x = $string[0];
    $y = $string[1];
    echo PHP_EOL;
    echo "<b>COMPARING $x TO $y</b>";
    echo PHP_EOL;

    echo PHP_EOL;
    echo 'SOUNDEX()';
    echo PHP_EOL;

    echo PHP_EOL;
    echo 'METAPHONE()';
    echo PHP_EOL;

    echo PHP_EOL;
    echo 'LEVENSHTEIN()';
    echo PHP_EOL;
    var_dump(levenshtein($x, $y));

    echo PHP_EOL;
    echo 'SIMILAR_TEXT()';
    echo PHP_EOL;
    echo "COMMON CHARACTERS: " . similar_text($x, $y);
    echo PHP_EOL;

    // FAILS: similar_text($x, $y, &$pct);
    similar_text($x, $y, $pct);
    echo "PERCENT ALIKE: " . number_format($pct,1) . '%';

    echo PHP_EOL;

// **************** BELOW THIS LINE: WORK IN PROGRESS -- INCOMPLETE ON 2012-06-14
function sameThing($x, $y, $normal=TRUE)
    if ($normal)
        $x = strtoupper($x);
        $y = strtoupper($y);
        $r = '/[^A-Z ]/';
        $x = preg_replace($r, ' ', $x);
        $y = preg_replace($r, ' ', $y);
        $r = '/\s\s+/';
        $x = preg_replace($r, ' ', $x);
        $y = preg_replace($r, ' ', $y);
        $x = trim($x);
        $y = trim($y);

    $sxl = ( levenshtein(soundex($x),   soundex($y))   );
    $sxm = ( levenshtein(metaphone($x), metaphone($y)) );

    echo PHP_EOL . "$x vs $y Soundex Levenshtein=$sxl Metaphone Levenshtein=$sxm";

foreach ($strings as $string)
    sameThing($string[0], $string[1]);

Open in new window

SolarWinds® VoIP and Network Quality Manager(VNQM)

WAN and VoIP monitoring tools that can help with troubleshooting via an intuitive web interface. Review quality of service data, including jitter, latency, packet loss, and MOS. Troubleshoot call performance and correlate call issues with WAN performance for Cisco and Avaya calls

what is your goal ?

if your idea is to prevent people from posting twice by reloading a page accidentally, you can quite easily give ids to posts before they are written, so double-posting cannot be performed accidentally

if you try to prevent bots, it would help to have the previous ids be random and checked when the data is posted, use captchas, and other regular stuff

it is quite possible that looking for similitudes in a text is not the easiest way to achieve whatever you need to do

then if you really need to work on similitudes, database backends are quite bad at performing such tasks. performance wise, it is much longer to search for a string in all your posts using a database than in a flat text file and looking for multiple strings at once in an efficient manner on a great number of strings is much worse. there may be more efficient approaches based on word indexation that could more or less detect similar topics efficiently, but they'll most definitely be a pain to code and still lame performance-wise. (in my opinion)

one idea might be to work on a 2-level approach :

first build a series of tokens based on the recognition of unfrequent words in a specific spots : for example : concatenate any 5 first words not in a basic dictionnary that appear in the first 2000 characters after the beginning of the second sentence, hash the concatenation of all words that have their first letter and only that one capitalized, hash a string based on the length of all words of the longest paragraph, hash all urls ... ( you'll definitely find smarter ones by looking at a bunch of posts )

when a user posts, you calculate those tokens, and compare them with the tokens associated with existing posts (using a fast index search). if  you encounter a certain number of matching tokens you perform fuzzy-text comparisons on the matching posts based on the techniques evocated in the above posts in order to determine if things are too similar.

baware that if your strings are "articles", it could be frequent for legit writers to quote one another and this should be taken into account
RobOwner (Aidellio)Commented:
I've requested that this question be deleted for the following reason:

Not enough information to confirm an answer.
This is not the sort of question that gets a 1+1=2 kind of answer.  I think the suggestions to use plagiarism detectors and/or fuzzy text comparisons are valid input towards resolving the authors problem.
Ray PaseurCommented:
Agree with MlandaT - these are all excellent suggestions to a question that is more about design patterns than code examples.
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
MySQL Server

From novice to tech pro — start learning today.