• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 436
  • Last Modified:

Detecting Duplicate Image Data Types


I am using an "image" data type column to store files inside a database. Its normalized to a document record which other records can reference as a foreign key. A document should not occur twice in that table

Therefore, once an image is in there, I dont want to upload a duplicate.

Is there a way of comparing the first 8000 bytes so I can say

"select DocumentID As PossibleDuplicate
From DocumentObject
Where left(myimagefield, 8000) = " & quote(strMyVBString)

then I can loop through these possible duplicates and compare the full image.

Without this I would have to loop through the entire table searching for duplicates which is gonna be heavy on memory and cpu.

Any other ideas or standard practice I have missed ?


3 Solutions
question: why you want to compare only first 8000 character only?
plqAuthor Commented:
If we can compare the full thing thats fine but I dont think its possible.

First 8000 is fine because that gives me a list of candidates for being duplicates. then I can loop through that dataset and compare each document fully by comparing byte arrays in vb.net, instead of comparing every document in the database.

BTW this has to work on sql 2000 and later

Racim BOUDJAKDJIDatabase Architect - Dba - Data ScientistCommented:

Since it is rare that two files have the same size to the nearest byte, I'd suggest you use byte size during the upload process to check for dupplicates.  For instance, you could keep a track of the byte size of all files uploaded then do a compare *only* if the file size already exists.  Not a perfect solution but it may help you save resources.  The algorythm would be something like.

--> Extract filesize from client file
--> Run a query to check for existence of a similar size file
--> If exists --> check some random bytes in random position (if same then don't upload) if not same upload
--> if not exists --upload

Doiing a byte to byte compare will simply kill your concurrency performance.
Build your data science skills into a career

Are you ready to take your data science career to the next step, or break into data science? With Springboard’s Data Science Career Track, you’ll master data science topics, have personalized career guidance, weekly calls with a data science expert, and a job guarantee.

Anthony PerkinsCommented:
I would go a step further and use your front-end app to save a check sum value in the same row.  That way you can compare that value rather then the raw binary data.  Unfortunately, since you are using image you will not be able to use the T-SQL CHECKSUM() function.
plqAuthor Commented:
Thanks folks,

.NET has a string to MD5 function so we'll use that in a separate column
Anthony PerkinsCommented:
>>.NET has a string to MD5 function so we'll use that in a separate column<<
But that is just an encrypted version of the data, right?  If so that would basically double the length of the row and would seem overkill.  What you really want is a check sum similar to the way that Winzip does it.
plqAuthor Commented:
No its an MD5 checksum, thats what i meant - sorry ! It would just be a 24 byte field i think

string hash = Convert.ToBase64String(new System.Security.Cryptography.MD5CryptoServiceProvider().ComputeHash(System.Text.Encoding.Default.GetBytes(SomeString)));

Seems to return 24 bytes

Anthony PerkinsCommented:
According to BOL it is 16 bytes, but who is counting :)

Hash functions map binary strings of an arbitrary length to small binary strings of a fixed length. A cryptographic hash function has the property that it is computationally infeasible to find two distinct inputs that hash to the same value; that is, hashes of two sets of data should match if the corresponding data also matches. Small changes to the data result in large, unpredictable changes in the hash.

The hash size for the MD5CryptoServiceProvider class is 128 bits.

The ComputeHash methods of the MD5CryptoServiceProvider class return the hash as an array of 16 bytes. Note that some MD5 implementations produce a 32-character, hexadecimal-formatted hash. To interoperate with such implementations, format the return value of the ComputeHash methods as a hexadecimal value.
Anthony PerkinsCommented:
Make sure to index that column as you will probably have it in some WHERE or OUTER JOIIN clause.
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

Join & Write a Comment

Featured Post

Free Tool: Subnet Calculator

The subnet calculator helps you design networks by taking an IP address and network mask and returning information such as network, broadcast address, and host range.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

Tackle projects and never again get stuck behind a technical roadblock.
Join Now