Link to home
Start Free TrialLog in
Avatar of NTGuru705
NTGuru705Flag for United States of America

asked on

Unique File Identifier (.Net)

I am working with some rather large files (> 250GB in some cases) and I need to determine how I can best make a signature of these files. My initial plan was to do something like this.

    ' specify the path to a file and this routine will calculate your hash
    Public Function calcFileHash(ByVal filepath As String) As String
        ' open file (as read-only)
        Using reader As New System.IO.FileStream(filepath, IO.FileMode.Open, IO.FileAccess.Read)
            Using md5 As New System.Security.Cryptography.MD5CryptoServiceProvider
                ' hash contents of this stream
                Dim hash() As Byte = md5.ComputeHash(reader)
                ' return formatted hash
                Return ByteArrayToString(hash)
            End Using
        End Using
    End Function

Open in new window


The problem with this is that it is VERY slow when working with files that are this large. I am using MD5 because in my understanding (limited) I believe there to be much less of a collision risk than other methods... however the speed is a very real problem.  Does anyone know how I might be able to better generate a unique signature for a file of this size? The only real tools I have at my disposal are all in the .net realm but I could use a third part provided component if it fit the bill.

The goal: Generate a signature that is unique to the contents of a file that can be in excess of 250GB.
Thanks for your thoughts and direction here....


ASKER CERTIFIED SOLUTION
Avatar of phoffric
phoffric

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
I guess it depends what you're using this for.
The major methods are either an encrypted hash as you're using (sha or md5), or a simple crc32 checksum.

The crc32 though still high is as you stated more likely to have collisions, but what is the signature for?

If you're actually using this to verify that the file was not maliciously tampered with, then unfortunately you're going to have to bite the bullet and take the delay the md5 causes.

If on the other hand you're using this for an upload verification for example, to make sure bits weren't lost in transfer, then the likelyhood of a random false positive is virtually nill, and I'd use the crc32.
Avatar of NTGuru705

ASKER

I am looking to say that this file is the same as another file.... so I need to be sure that the file has a signature that is unique and can be compared to other signatures. I think the MD5 is the only way to do this. It would be ok (not favorable though) if I were to have a signature value report that the files were NOT the same but it would not be ok to have a value report they are the same when in truth they are not.... make sense?

So if I dump an 80GB file because I think I have another copy of it this is NOT ok.. what is OK is to keep the 80GB file that I dont think I have another copy of when in truth I do.

I think the random sample of chunks is ok but I am not sure it is a safe compare to make given my requirements.... thoughts?

Thanks for your help...
Ah, think I understand.

"So if I dump an 80GB file because I think I have another copy of it this is NOT ok.. "
meaning if you think you have another exact duplicate somewhere, then you'd dump the file.

Depending on how sensitive your data is, it's pretty much a judgement call.
The md5 is definately the safer route, but if you used the crc32 the chances are extremely low that you'd get a false positive.

The danger is more for secure scenarios, where you want to make sure no one has intentionally messed around with the file while trying to keep the same hash.
A random unintentional collision is very highly unlikely... but it is possible.
Ill speed test the CRC against MD5 and report back.
Thanks
Avatar of phoffric
phoffric

>> So if I dump an 80GB file because I think I have another copy of it this is NOT ok
     Then even if the files are identical at one point in time, it may still be true that they are not identical today due to hardware issues. It is not too uncommon for a file to be corrupted due to head impact. So, if you have an older file with the signature computed awhile ago, and now you have a copy, and you just computed the signature and it matches, then it would be safer to delete the older file. Using a disk check program to try to repair sectors should be done periodically on critical files (but just like excessive virus scans, this activity in itself wears out the disk).

     Although this part is obvious, it doesn't hurt to say it. Obviously the files are different if the file sizes are different. So, prefacing the filesize to your computed signature adds a bit more uniqueness to the computed signature.

     If you think that 5-10 chunks of data may be insufficient to guarantee uniqueness due to the structure of the data being somewhat repetitive at times, then consider 128k chunks of 32 bytes each to spread out the samples. But, if the reason for a difference could be tampering of sections, or bit-errors in transmission, then this approach does not work - the signature must be based upon every bit in the file.
What I would suggest you do without actually trying to understand what you are trying to do (lol) is to look into using rsync, but output the rsync list of modified files into a python script that does this job for you..  
I was looking for a wider standard crc check, and I see there is CRC-64-ECMA-182:
    http://en.wikipedia.org/wiki/Cyclic_redundancy_check
Some further research needed to confirm that the collision risk is less than with CRC-32. Note that CRC-64-ISO is considered weak for hashing.
    http://www.cs.ucl.ac.uk/staff/d.jones/crcnote.pdf
I am sorry if I have not been clear in what I am trying to do...
What I am trying to do is to compare files to see if they are identical.... these files can be large in size 80 - 250GB.

There are a large number of files so I am building a checksum of files on disk then looking for files that are duplicates of each other... the size is obviously something I could use but I am not comparing A to B I have N number of files and I need to say these files exist in more than one place. So I am storing this information and querying for hits.

Hope this helps...

MD5 on 80GB file in 47 mins... will check CRC and report back.

The rsync option sounds interesting.. but I am not sure I understand what you mean by output the list of modified files into script... How will rsync help me get a signature of the file?  Curious to investigate this.

Thanks for your support ill check out the links posted.
Just to clarify I understand that the only ones that need to be compared are the ones with the same size... without going into reasons that dont make any sense in this forum I really need a value that says myfile.ext has a signature of XYZ and it also exists HERE.

Thanks for your continued help.
I just re-read the post to make sure I wasnt misunderstanding.

"Although this part is obvious, it doesn't hurt to say it. Obviously the files are different if the file sizes are different. So, prefacing the filesize to your computed signature adds a bit more uniqueness to the computed signature." - very true and good idea. Just need to see how much faster the CRC is.

Stand by..
Looks like .net doesnt implement CRC? Seems strange but I'll keep digging..
I see lots of hits on folks that have written methods to do it.
There is a good post here
https://www.experts-exchange.com/questions/23907878/vb-net-crc32-function-for-files.html

I am trying this implementation now for speed... same file... same disk IO.
Will advise.

Thanks
I am a moron for asking this but Ill run that risk...
this is c++ and I have no idea how to make use of this in a .net application.. any direction to a .net implementation?
Thank you for your help.... the link I sent you is still running which makes it not really any faster so far.
I actually just realized that the other method may be much faster I forgot a comment in one line... oops.
Will post when I have results on the time the CRC calc takes.
Thanks for your patience.
What kind of I/O are you using?
See, for example, http://en.wikipedia.org/wiki/Memory-mapped_file, to see if this approach speeds up I/O.
I am using a filestream... I think .net uses memory mapping by default right?

[code]
Using reader As New System.IO.FileStream(filepath, IO.FileMode.Open, IO.FileAccess.Read, FileShare.Read, 8192)
            Dim crc As New CRC32
            Dim crchash As String = crc.GetCrc32(reader).ToString
         
            sw.Stop()
            MsgBox((sw.ElapsedMilliseconds / 1000).ToString + " seconds.")
            'Return ByteArrayToString(crchash)
            Return crchash
End Using
[/code]
well those code tags didnt work... ; )
As I don't know that language, I can't tell.

BTW - for code, just hit the "Code" button below the post.
Did you use the table driven CRC function to speed things up?
The table based example is c++ and I am a .net guy... I am not sure what I can do with it.. I would love to try it but I am sorry I dont know how.
I think you're Ok. I see that you are first building a lookup table.
Ok so I am leaning on the idea of phoffric now because all the other methods are just taking way to long.

I am thinking of reading the first and last X MB of the files and some samples in the middle that are X MB as well.  I then build a string of the MD5 for each of those segments and concatenate them together (adding the filesize as well) and once thats all done I get an MD5 of that big concatenated string and use that as my unique value.

Couple questions...

1. Is this a dumb idea?
2. I am trying to figure out how many samples I should take of the file but not really sure how to figure out how much is enough.

Thanks
Another idea is this.. I can roar through the file in 1 MB chunks and calculate an MD5 for each chunk at about the 225MB/s which is pretty respectable.  Would I be able to create an array of hashes and then hash those and if so would that get me a unique value for the file? This would give me a full pass through the file in 1 MB chunks.

Thoughts?

Thanks
With upfront timing analysis, it may be possible to add parallelism in your program. The issues to be considered are lost cache hits as a result of parallelism, and disk thrashing. So multi-threading with two cores is not likely to double the processing rate.

1. dumb - even though I suggested it, it may be dumb. I did have a number of caveats as to why you would need to examine every bit in the file. What is the mechanism for your file duplication? Are there any caveats that are a concern to you?

1a. Even if you didn't have the question about dumping a duplicate file, what do you do w.r.t. the practical matter of hard-disk failures - either bad sectors or complete disk failure?

2. How similar are the data from one file to another? How likely is it that 20 MB of data evenly spaced out in small chunks across the file would match and yet the files are different? That is your worst-case scenario. If, for example, you were recording streaming data from videos of live action, then if there were a protocol within the file that was very repetitive throughout the file, then there is a small probability that your could pick up the matching pieces. If there was no such protocol, and the data itself was fairly random, or at least fairly random from file to file, then the likelihood of find all of these chunks matching is extremely small.

What is the rate that you can read a file (e.g., in MB/sec) without any processing; just read into memory and throw away the data.
I was in no way trying to say your idea was dumb I meant my thoughts about how to implement it... ie not enough samples, samples are too small, etc.

There are some files that have video content and there are some files that are database files.. large database files (duplicate copies).  The DB files there is a much better chance to stumble upon the same info... so I need to make sure my number of samples is high enough - but this is what I am trying to think through at the moment.

Thanks again for your thoughts.. very helpful.
I knew you were not saying that my idea was dumb. :)
     - I was saying it could be dumb depending upon the nature of the data.
     - re: The database files that have large portions of duplication from one to another
        - yes, that is a concern.

One thing that I'm not clear on. Why is working on every 1 MB chunk to cover every byte in the file faster than your previous method of reading every byte in the file? What exactly was your previous slower method?
"One thing that I'm not clear on. Why is working on every 1 MB chunk to cover every byte in the file faster than your previous method of reading every byte in the file? What exactly was your previous slower method?"

I am not sure... I am testing to understand why it is slower now.

The slower method is to open the file with a filestream and pass this stream to System.Security.Cryptography.MD5CryptoServiceProvider.ComputeHash()

The way I can move through it quickly is to read 1MB at a time and place into a byte() and then pass that byte() to the same provider listed above.  I suppose there is more under the covers that I dont understand that makes this so.

I think the idea of taking random samples is still the most attractive to me - I think I can test this to satisfy my concerns.
One useful metric is to determine how long it takes to read in an entire file ignoring the buffers without any processing. Do you want to do that and get back to us?

What is the average length of the files processed per hour?
How long is "too long"?

What is the longest time per 100GB that you can tolerate? If you can consistently live with this number, then you can experiment with number of chunks and chunk size that take about as long as this toleration time value. You want the largest number of bytes checked that falls within this max time.

My own guess is that taking many small chunks rather than large big chunks may improve the likelihood that you have good coverage. But, you know your data better than anyone. If you do tests on data exactness (in the same locations in files of the same size), you may find that there are very few spans where the files are the same.

The larger your signature (by concatentation of sub-signatures), the less likely that two different files will have the same signature. The purpose of smaller signatures is usually related to transmission speed which is not a concern; neither is space. You could have a signature which is 64KB, and that may not be a problem for you.
Yes I will report back... one more question.
Is there any fault in concatenating these hashes together and then getting an MD5 of that long string just for the purposes of shortening the string?

Thanks
i can sample an 80GB file sampling 10% of the data across 100 reads (one required from front and one required from end) and the rest spread evenly across the remaining data and I can do that in 12 seconds.

The only remaining question here is can I take the very long contatenation of 100 hashes and hash it... I dont see why not but I am just checking to see if I am reducing something here that I shouldnt be.

Thanks
You can do hash on a concatenation; the shorter the hash, the more likely is the chance of collision. I'm not sure what the advantage is of doing a hash on a concatenation vs. just doing the has on whatever the chunk stream is. (leaving now - back tomorrow).
How you end up with duplicate files?

I am guessing it is not part of a backup scheme. If it were, then
 --  the files would be in the same folder structure in different disks
 --  identical files would have the same filesize and modification date
I just tried this command line program from Microsoft. Default is MD5.
        http://support.microsoft.com/default.aspx?scid=kb;en-us;841290

Try timing this against your large files and compare results. If this is substantially faster, then great.

Notice this quote:
"The reason it should not be used as a security measure is that it is possible for two different files to have the same MD5 checksum."
    http://www.thefreecountry.com/utilities/free-md5-sum-tools.shtml

But prefacing the filesize (and other file attributes; e.g., date/time) before the MD5 result further reduces the collision hit.  If you are going to compute a checksum yourself, then consider concatenating two or more partial checksum results (from partial file sections) to reduce collision.
my apologies for the delay.. I have been on vacation.

The last idea is what I decided to go with... sampling chunks of big files and prefixing filesize to the comparison. This works quickly and inside my code.  Thank you again for your time and assistance as I worked through this.
As for how I got multiple files its a much bigger story regarding multiple copies of the same database but suffice to say this has helped me in my venture.
You are very welcome. If there are any further questions/issues that arise, do not hesitate to post them here.
Great answer