Solved

Detecting Duplicate Image Data Types

Posted on 2009-07-08
9
421 Views
Last Modified: 2012-05-07
Hi,

I am using an "image" data type column to store files inside a database. Its normalized to a document record which other records can reference as a foreign key. A document should not occur twice in that table

Therefore, once an image is in there, I dont want to upload a duplicate.

Is there a way of comparing the first 8000 bytes so I can say

"select DocumentID As PossibleDuplicate
From DocumentObject
Where left(myimagefield, 8000) = " & quote(strMyVBString)

then I can loop through these possible duplicates and compare the full image.

Without this I would have to loop through the entire table searching for duplicates which is gonna be heavy on memory and cpu.

Any other ideas or standard practice I have missed ?

thanks

0
Comment
Question by:plq
9 Comments
 
LVL 31

Expert Comment

by:RiteshShah
ID: 24802467
question: why you want to compare only first 8000 character only?
0
 
LVL 8

Author Comment

by:plq
ID: 24802478
If we can compare the full thing thats fine but I dont think its possible.

First 8000 is fine because that gives me a list of candidates for being duplicates. then I can loop through that dataset and compare each document fully by comparing byte arrays in vb.net, instead of comparing every document in the database.

BTW this has to work on sql 2000 and later

thanks
0
 
LVL 23

Assisted Solution

by:Racim BOUDJAKDJI
Racim BOUDJAKDJI earned 200 total points
ID: 24802794

Since it is rare that two files have the same size to the nearest byte, I'd suggest you use byte size during the upload process to check for dupplicates.  For instance, you could keep a track of the byte size of all files uploaded then do a compare *only* if the file size already exists.  Not a perfect solution but it may help you save resources.  The algorythm would be something like.

--> Extract filesize from client file
--> Run a query to check for existence of a similar size file
--> If exists --> check some random bytes in random position (if same then don't upload) if not same upload
--> if not exists --upload

Doiing a byte to byte compare will simply kill your concurrency performance.
HTH
0
 
LVL 75

Assisted Solution

by:Anthony Perkins
Anthony Perkins earned 300 total points
ID: 24810526
I would go a step further and use your front-end app to save a check sum value in the same row.  That way you can compare that value rather then the raw binary data.  Unfortunately, since you are using image you will not be able to use the T-SQL CHECKSUM() function.
0
6 Surprising Benefits of Threat Intelligence

All sorts of threat intelligence is available on the web. Intelligence you can learn from, and use to anticipate and prepare for future attacks.

 
LVL 8

Accepted Solution

by:
plq earned 0 total points
ID: 24811377
Thanks folks,

.NET has a string to MD5 function so we'll use that in a separate column
0
 
LVL 75

Expert Comment

by:Anthony Perkins
ID: 24814190
>>.NET has a string to MD5 function so we'll use that in a separate column<<
But that is just an encrypted version of the data, right?  If so that would basically double the length of the row and would seem overkill.  What you really want is a check sum similar to the way that Winzip does it.
0
 
LVL 8

Author Comment

by:plq
ID: 24814342
No its an MD5 checksum, thats what i meant - sorry ! It would just be a 24 byte field i think

string hash = Convert.ToBase64String(new System.Security.Cryptography.MD5CryptoServiceProvider().ComputeHash(System.Text.Encoding.Default.GetBytes(SomeString)));

Seems to return 24 bytes

0
 
LVL 75

Expert Comment

by:Anthony Perkins
ID: 24815650
According to BOL it is 16 bytes, but who is counting :)

<quote>
Hash functions map binary strings of an arbitrary length to small binary strings of a fixed length. A cryptographic hash function has the property that it is computationally infeasible to find two distinct inputs that hash to the same value; that is, hashes of two sets of data should match if the corresponding data also matches. Small changes to the data result in large, unpredictable changes in the hash.

The hash size for the MD5CryptoServiceProvider class is 128 bits.

The ComputeHash methods of the MD5CryptoServiceProvider class return the hash as an array of 16 bytes. Note that some MD5 implementations produce a 32-character, hexadecimal-formatted hash. To interoperate with such implementations, format the return value of the ComputeHash methods as a hexadecimal value.
</quote>
0
 
LVL 75

Expert Comment

by:Anthony Perkins
ID: 24815678
Make sure to index that column as you will probably have it in some WHERE or OUTER JOIIN clause.
0

Featured Post

IT, Stop Being Called Into Every Meeting

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

Join & Write a Comment

I wrote this interesting script that really help me find jobs or procedures when working in a huge environment. I could I have written it as a Procedure but then I would have to have it on each machine or have a link to a server-related search that …
When you hear the word proxy, you may become apprehensive. This article will help you to understand Proxy and when it is useful. Let's talk Proxy for SQL Server. (Not in terms of Internet access.) Typically, you'll run into this type of problem w…
Using examples as well as descriptions, and references to Books Online, show the documentation available for date manipulation functions and by using a select few of these functions, show how date based data can be manipulated with these functions.
Viewers will learn how to use the INSERT statement to insert data into their tables. It will also introduce the NULL statement, to show them what happens when no value is giving for any given column.

758 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

19 Experts available now in Live!

Get 1:1 Help Now