Solved

Detecting Duplicate Image Data Types

Posted on 2009-07-08
9
427 Views
Last Modified: 2012-05-07
Hi,

I am using an "image" data type column to store files inside a database. Its normalized to a document record which other records can reference as a foreign key. A document should not occur twice in that table

Therefore, once an image is in there, I dont want to upload a duplicate.

Is there a way of comparing the first 8000 bytes so I can say

"select DocumentID As PossibleDuplicate
From DocumentObject
Where left(myimagefield, 8000) = " & quote(strMyVBString)

then I can loop through these possible duplicates and compare the full image.

Without this I would have to loop through the entire table searching for duplicates which is gonna be heavy on memory and cpu.

Any other ideas or standard practice I have missed ?

thanks

0
Comment
Question by:plq
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
9 Comments
 
LVL 31

Expert Comment

by:RiteshShah
ID: 24802467
question: why you want to compare only first 8000 character only?
0
 
LVL 8

Author Comment

by:plq
ID: 24802478
If we can compare the full thing thats fine but I dont think its possible.

First 8000 is fine because that gives me a list of candidates for being duplicates. then I can loop through that dataset and compare each document fully by comparing byte arrays in vb.net, instead of comparing every document in the database.

BTW this has to work on sql 2000 and later

thanks
0
 
LVL 23

Assisted Solution

by:Racim BOUDJAKDJI
Racim BOUDJAKDJI earned 200 total points
ID: 24802794

Since it is rare that two files have the same size to the nearest byte, I'd suggest you use byte size during the upload process to check for dupplicates.  For instance, you could keep a track of the byte size of all files uploaded then do a compare *only* if the file size already exists.  Not a perfect solution but it may help you save resources.  The algorythm would be something like.

--> Extract filesize from client file
--> Run a query to check for existence of a similar size file
--> If exists --> check some random bytes in random position (if same then don't upload) if not same upload
--> if not exists --upload

Doiing a byte to byte compare will simply kill your concurrency performance.
HTH
0
NEW Veeam Agent for Microsoft Windows

Backup and recover physical and cloud-based servers and workstations, as well as endpoint devices that belong to remote users. Avoid downtime and data loss quickly and easily for Windows-based physical or public cloud-based workloads!

 
LVL 75

Assisted Solution

by:Anthony Perkins
Anthony Perkins earned 300 total points
ID: 24810526
I would go a step further and use your front-end app to save a check sum value in the same row.  That way you can compare that value rather then the raw binary data.  Unfortunately, since you are using image you will not be able to use the T-SQL CHECKSUM() function.
0
 
LVL 8

Accepted Solution

by:
plq earned 0 total points
ID: 24811377
Thanks folks,

.NET has a string to MD5 function so we'll use that in a separate column
0
 
LVL 75

Expert Comment

by:Anthony Perkins
ID: 24814190
>>.NET has a string to MD5 function so we'll use that in a separate column<<
But that is just an encrypted version of the data, right?  If so that would basically double the length of the row and would seem overkill.  What you really want is a check sum similar to the way that Winzip does it.
0
 
LVL 8

Author Comment

by:plq
ID: 24814342
No its an MD5 checksum, thats what i meant - sorry ! It would just be a 24 byte field i think

string hash = Convert.ToBase64String(new System.Security.Cryptography.MD5CryptoServiceProvider().ComputeHash(System.Text.Encoding.Default.GetBytes(SomeString)));

Seems to return 24 bytes

0
 
LVL 75

Expert Comment

by:Anthony Perkins
ID: 24815650
According to BOL it is 16 bytes, but who is counting :)

<quote>
Hash functions map binary strings of an arbitrary length to small binary strings of a fixed length. A cryptographic hash function has the property that it is computationally infeasible to find two distinct inputs that hash to the same value; that is, hashes of two sets of data should match if the corresponding data also matches. Small changes to the data result in large, unpredictable changes in the hash.

The hash size for the MD5CryptoServiceProvider class is 128 bits.

The ComputeHash methods of the MD5CryptoServiceProvider class return the hash as an array of 16 bytes. Note that some MD5 implementations produce a 32-character, hexadecimal-formatted hash. To interoperate with such implementations, format the return value of the ComputeHash methods as a hexadecimal value.
</quote>
0
 
LVL 75

Expert Comment

by:Anthony Perkins
ID: 24815678
Make sure to index that column as you will probably have it in some WHERE or OUTER JOIIN clause.
0

Featured Post

Best Practices: Disaster Recovery Testing

Besides backup, any IT division should have a disaster recovery plan. You will find a few tips below relating to the development of such a plan and to what issues one should pay special attention in the course of backup planning.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
Display Date and Time 7 48
Removing SCCM 2016 4 48
Enable TLS 1.2 for SQL 2012 Web Edition 1 23
SQL Server Pivot 5 39
Ever needed a SQL 2008 Database replicated/mirrored/log shipped on another server but you can't take the downtime inflicted by initial snapshot or disconnect while T-logs are restored or mirror applied? You can use SQL Server Initialize from Backup…
Ever wondered why sometimes your SQL Server is slow or unresponsive with connections spiking up but by the time you go in, all is well? The following article will show you how to install and configure a SQL job that will send you email alerts includ…
Via a live example, show how to extract insert data into a SQL Server database table using the Import/Export option and Bulk Insert.
Viewers will learn how to use the SELECT statement in SQL to return specific rows and columns, with various degrees of sorting and limits in place.

737 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question