Solved

Detecting Duplicate Image Data Types

Posted on 2009-07-08
9
422 Views
Last Modified: 2012-05-07
Hi,

I am using an "image" data type column to store files inside a database. Its normalized to a document record which other records can reference as a foreign key. A document should not occur twice in that table

Therefore, once an image is in there, I dont want to upload a duplicate.

Is there a way of comparing the first 8000 bytes so I can say

"select DocumentID As PossibleDuplicate
From DocumentObject
Where left(myimagefield, 8000) = " & quote(strMyVBString)

then I can loop through these possible duplicates and compare the full image.

Without this I would have to loop through the entire table searching for duplicates which is gonna be heavy on memory and cpu.

Any other ideas or standard practice I have missed ?

thanks

0
Comment
Question by:plq
9 Comments
 
LVL 31

Expert Comment

by:RiteshShah
ID: 24802467
question: why you want to compare only first 8000 character only?
0
 
LVL 8

Author Comment

by:plq
ID: 24802478
If we can compare the full thing thats fine but I dont think its possible.

First 8000 is fine because that gives me a list of candidates for being duplicates. then I can loop through that dataset and compare each document fully by comparing byte arrays in vb.net, instead of comparing every document in the database.

BTW this has to work on sql 2000 and later

thanks
0
 
LVL 23

Assisted Solution

by:Racim BOUDJAKDJI
Racim BOUDJAKDJI earned 200 total points
ID: 24802794

Since it is rare that two files have the same size to the nearest byte, I'd suggest you use byte size during the upload process to check for dupplicates.  For instance, you could keep a track of the byte size of all files uploaded then do a compare *only* if the file size already exists.  Not a perfect solution but it may help you save resources.  The algorythm would be something like.

--> Extract filesize from client file
--> Run a query to check for existence of a similar size file
--> If exists --> check some random bytes in random position (if same then don't upload) if not same upload
--> if not exists --upload

Doiing a byte to byte compare will simply kill your concurrency performance.
HTH
0
 
LVL 75

Assisted Solution

by:Anthony Perkins
Anthony Perkins earned 300 total points
ID: 24810526
I would go a step further and use your front-end app to save a check sum value in the same row.  That way you can compare that value rather then the raw binary data.  Unfortunately, since you are using image you will not be able to use the T-SQL CHECKSUM() function.
0
PRTG Network Monitor: Intuitive Network Monitoring

Network Monitoring is essential to ensure that computer systems and network devices are running. Use PRTG to monitor LANs, servers, websites, applications and devices, bandwidth, virtual environments, remote systems, IoT, and many more. PRTG is easy to set up & use.

 
LVL 8

Accepted Solution

by:
plq earned 0 total points
ID: 24811377
Thanks folks,

.NET has a string to MD5 function so we'll use that in a separate column
0
 
LVL 75

Expert Comment

by:Anthony Perkins
ID: 24814190
>>.NET has a string to MD5 function so we'll use that in a separate column<<
But that is just an encrypted version of the data, right?  If so that would basically double the length of the row and would seem overkill.  What you really want is a check sum similar to the way that Winzip does it.
0
 
LVL 8

Author Comment

by:plq
ID: 24814342
No its an MD5 checksum, thats what i meant - sorry ! It would just be a 24 byte field i think

string hash = Convert.ToBase64String(new System.Security.Cryptography.MD5CryptoServiceProvider().ComputeHash(System.Text.Encoding.Default.GetBytes(SomeString)));

Seems to return 24 bytes

0
 
LVL 75

Expert Comment

by:Anthony Perkins
ID: 24815650
According to BOL it is 16 bytes, but who is counting :)

<quote>
Hash functions map binary strings of an arbitrary length to small binary strings of a fixed length. A cryptographic hash function has the property that it is computationally infeasible to find two distinct inputs that hash to the same value; that is, hashes of two sets of data should match if the corresponding data also matches. Small changes to the data result in large, unpredictable changes in the hash.

The hash size for the MD5CryptoServiceProvider class is 128 bits.

The ComputeHash methods of the MD5CryptoServiceProvider class return the hash as an array of 16 bytes. Note that some MD5 implementations produce a 32-character, hexadecimal-formatted hash. To interoperate with such implementations, format the return value of the ComputeHash methods as a hexadecimal value.
</quote>
0
 
LVL 75

Expert Comment

by:Anthony Perkins
ID: 24815678
Make sure to index that column as you will probably have it in some WHERE or OUTER JOIIN clause.
0

Featured Post

U.S. Department of Agriculture and Acronis Access

With the new era of mobile computing, smartphones and tablets, wireless communications and cloud services, the USDA sought to take advantage of a mobilized workforce and the blurring lines between personal and corporate computing resources.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Introduction In my previous article (http://www.experts-exchange.com/Microsoft/Development/MS-SQL-Server/SSIS/A_9150-Loading-XML-Using-SSIS.html) I showed you how the XML Source component can be used to load XML files into a SQL Server database, us…
This article shows gives you an overview on SQL Server 2016 row level security. You will also get to know the usages of row-level-security and how it works
This video shows how to set up a shell script to accept a positional parameter when called, pass that to a SQL script, accept the output from the statement back and then manipulate it in the Shell.
Via a live example, show how to set up a backup for SQL Server using a Maintenance Plan and how to schedule the job into SQL Server Agent.

867 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

19 Experts available now in Live!

Get 1:1 Help Now