HASH and CRC comparision and details

Posted on 2004-09-15
Last Modified: 2008-02-01
Hi , we are working on a project which requires the de-duplication(identifying the duplicates) of files by comparing . so we got to know it can be done by creating a unique hexadecimal string using CRC32 and Hash functions(i.e,MD5 hash,sha-1,crc etc) so i would like to know which is better to use ,how does CRC and MD5 hash differ in their alogorithm and string creating ways ?how fast are they with respect to each other.
Question by:cmatian

Accepted Solution

Validor earned 250 total points
ID: 12079161
I've done this very task before (and many variations of it).  In your case, I would recommend CRC32.  

If I may make a recommendation, it is faster to check 3 things when searching for duplicate files.  The first two are very fast and may avoid a CRC check.

1) If file sizes differ, they are not identical.
2) If file timestamps differ, they may not be identical (you decide).
3) If CRC32 checksums differ, they are not identical.

Most of the time, a HASH is used to "represent" data for security purposes.  A CRC (checksum) is used in the same way, but usually in a different situation.  It is used where security is not an issue.  

MD5 is supposed to be difficult to reverse.  CRC is very easy to reverse using brute force.  Both can be used to see if two pieces of data are identical.  Checksums are usually faster than a good hash, though CRC32 is the fastest.   Adler32 is faster and easier to implement.  However, with a table-driven implementation, it's fast enough.  Most other checksums have a higher margin of error.  CRC32 has a relatively small margin of error, but MD5 and most hashes have a MUCH smaller margin of error.

MD5 is good for password verification (sometimes called an MD5 shared secret password) or as a proxy for private data, and CRC32 is good for comparing files or data blocks.  Where speed is more important.

CRC32 is also much smaller and stores better in a database.

Featured Post

DevOps Toolchain Recommendations

Read this Gartner Research Note and discover how your IT organization can automate and optimize DevOps processes using a toolchain architecture.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
topping2 challenge 13 91
word0 challenge 4 68
T-SQL: Preventing Duplicate Records from Being Returned When "Joining" Code Blocks 2 45
SQL Insert parts by customer 12 33
This article is meant to give a basic understanding of how to use R Sweave as a way to merge LaTeX and R code seamlessly into one presentable document.
A short article about problems I had with the new location API and permissions in Marshmallow
An introduction to basic programming syntax in Java by creating a simple program. Viewers can follow the tutorial as they create their first class in Java. Definitions and explanations about each element are given to help prepare viewers for future …
Viewers will learn how to properly install Eclipse with the necessary JDK, and will take a look at an introductory Java program. Download Eclipse installation zip file: Extract files from zip file: Download and install JDK 8: Open Eclipse and …

772 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question