Detecting Duplicate Image Data Types

Hi,

I am using an "image" data type column to store files inside a database. Its normalized to a document record which other records can reference as a foreign key. A document should not occur twice in that table

Therefore, once an image is in there, I dont want to upload a duplicate.

Is there a way of comparing the first 8000 bytes so I can say

"select DocumentID As PossibleDuplicate
From DocumentObject
Where left(myimagefield, 8000) = " & quote(strMyVBString)

then I can loop through these possible duplicates and compare the full image.

Without this I would have to loop through the entire table searching for duplicates which is gonna be heavy on memory and cpu.

Any other ideas or standard practice I have missed ?

thanks

LVL 8
plqAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

RiteshShahCommented:
question: why you want to compare only first 8000 character only?
0
plqAuthor Commented:
If we can compare the full thing thats fine but I dont think its possible.

First 8000 is fine because that gives me a list of candidates for being duplicates. then I can loop through that dataset and compare each document fully by comparing byte arrays in vb.net, instead of comparing every document in the database.

BTW this has to work on sql 2000 and later

thanks
0
Racim BOUDJAKDJIDatabase Architect - Dba - Data ScientistCommented:

Since it is rare that two files have the same size to the nearest byte, I'd suggest you use byte size during the upload process to check for dupplicates.  For instance, you could keep a track of the byte size of all files uploaded then do a compare *only* if the file size already exists.  Not a perfect solution but it may help you save resources.  The algorythm would be something like.

--> Extract filesize from client file
--> Run a query to check for existence of a similar size file
--> If exists --> check some random bytes in random position (if same then don't upload) if not same upload
--> if not exists --upload

Doiing a byte to byte compare will simply kill your concurrency performance.
HTH
0
Determine the Perfect Price for Your IT Services

Do you wonder if your IT business is truly profitable or if you should raise your prices? Learn how to calculate your overhead burden with our free interactive tool and use it to determine the right price for your IT services. Download your free eBook now!

Anthony PerkinsCommented:
I would go a step further and use your front-end app to save a check sum value in the same row.  That way you can compare that value rather then the raw binary data.  Unfortunately, since you are using image you will not be able to use the T-SQL CHECKSUM() function.
0
plqAuthor Commented:
Thanks folks,

.NET has a string to MD5 function so we'll use that in a separate column
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Anthony PerkinsCommented:
>>.NET has a string to MD5 function so we'll use that in a separate column<<
But that is just an encrypted version of the data, right?  If so that would basically double the length of the row and would seem overkill.  What you really want is a check sum similar to the way that Winzip does it.
0
plqAuthor Commented:
No its an MD5 checksum, thats what i meant - sorry ! It would just be a 24 byte field i think

string hash = Convert.ToBase64String(new System.Security.Cryptography.MD5CryptoServiceProvider().ComputeHash(System.Text.Encoding.Default.GetBytes(SomeString)));

Seems to return 24 bytes

0
Anthony PerkinsCommented:
According to BOL it is 16 bytes, but who is counting :)

<quote>
Hash functions map binary strings of an arbitrary length to small binary strings of a fixed length. A cryptographic hash function has the property that it is computationally infeasible to find two distinct inputs that hash to the same value; that is, hashes of two sets of data should match if the corresponding data also matches. Small changes to the data result in large, unpredictable changes in the hash.

The hash size for the MD5CryptoServiceProvider class is 128 bits.

The ComputeHash methods of the MD5CryptoServiceProvider class return the hash as an array of 16 bytes. Note that some MD5 implementations produce a 32-character, hexadecimal-formatted hash. To interoperate with such implementations, format the return value of the ComputeHash methods as a hexadecimal value.
</quote>
0
Anthony PerkinsCommented:
Make sure to index that column as you will probably have it in some WHERE or OUTER JOIIN clause.
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Microsoft SQL Server

From novice to tech pro — start learning today.