?
Solved

Determine that 2 files are ABSOLUTELY same.

Posted on 2007-11-08
44
Medium Priority
?
691 Views
Last Modified: 2013-11-17
This may look idiotic question from a person ranked first in BCB area but it is needed as proof.
Windows Operating System.
We have 2 files, same icon, same size reported, same interface etc...
How can we be ABSOLUTELY sure that those files are the same?
Is there any programmaticaly method or ready made project to test that?

Thanks in advance,
George Tokas.
0
Comment
Question by:George Tokas
  • 12
  • 11
  • 7
  • +3
44 Comments
 
LVL 46

Accepted Solution

by:
Kent Olsen earned 400 total points
ID: 20245401
Hi gtokas,

Interesting question.  :)

Nothing in the file system or structure indicates a file's source.  That is, even if FILEB is created via 'COPY FILEA FILEB'', the O/S won't know that at its creation, FILEB is an exact copy of FILEA.

So given that the O/S never knows that two files are identical, it follows that the O/S can't know at random times, either.

It would seem that the only way to assert that two files are identical is to open them and read them, comparing every byte in the files.  Of course, one should also save and restore the "Last Access" timestamps for the files as comparing them for equality is an artificial access that most folks won't care about.


Good Luck,
Kent
0
 
LVL 16

Author Comment

by:George Tokas
ID: 20245474
I'm sorry I forgot.
The reason for this question is:
We have an application's executable on a PC named A.
We have the same file (name, size reported, etc ) on a PC named B.
We run both instances and the interface and the look is the same.
HOW CAN BE SURE THAT the file on A is the same with the one on B.
Any "fingerprint"?

George Tokas.
0
 
LVL 46

Expert Comment

by:Kent Olsen
ID: 20245529
Hi gtokas,

Are you controlling the executable?  If so, make sure that version numbers and build numbers are current.



Kent
0
Concerto Cloud for Software Providers & ISVs

Can Concerto Cloud Services help you focus on evolving your application offerings, while delivering the best cloud experience to your customers? From DevOps to revenue models and customer support, the answer is yes!

Learn how Concerto can help you.

 
LVL 86

Assisted Solution

by:jkr
jkr earned 400 total points
ID: 20245544
>>We have 2 files, same icon, same size reported, same interface etc...
>>How can we be ABSOLUTELY sure that those files are the same?

Create a MD5 hash for each file and compare them. You will find the complete description and source code at http://www.ietf.org/rfc/rfc3174.txt or a ready made tool at http://unxutils.sourceforge.net/ or http://www.pc-tools.net/win32/md5sums/
0
 
LVL 86

Expert Comment

by:jkr
ID: 20245549
BTW, just in case that was not clear - the hashes are identical if the files are identical, if not, the files are different.
0
 
LVL 4

Assisted Solution

by:Ichijo
Ichijo earned 400 total points
ID: 20245599
"the hashes are identical if the files are identical"

Or if an MD5 collision occurs. But that doesn't happen very often outside of hacking.
0
 
LVL 40

Assisted Solution

by:evilrix
evilrix earned 400 total points
ID: 20245625
Why can'[t you just diff then using a tool that supports binary diff?

http://winmerge.org/
0
 
LVL 16

Author Comment

by:George Tokas
ID: 20245641
>>"the hashes are identical if the files are identical"
Thanks jkr.
The above may I assume means same size not the reported one but the real?
And also don't forget that the interface and the look of both files when run is the same....

George Tokas.
0
 
LVL 86

Expert Comment

by:jkr
ID: 20245678
Well, if the files are not of the same size, you can skip the MD5 test, since then they aren't identical for sure. What do you mean by "real size" in that context?
0
 
LVL 16

Author Comment

by:George Tokas
ID: 20245741
>>What do you mean by "real size" in that context?
Size reported can be faked in PE header.

George Tokas.
0
 
LVL 86

Expert Comment

by:jkr
ID: 20245766
Ah, I see. Well, if that size and the size on disk is different, you should hassume that the file has been tampered with. Yet a MD5 check can give you more certainty about that.
0
 
LVL 28

Assisted Solution

by:2266180
2266180 earned 400 total points
ID: 20251298
depends what you understand with "absolutely the same". and what you understand by "file"

some people would consider that a hash will do it. but a hash has a value domain in the 256 bits value family or whatever the size of the hash is, and a file can take any size. which means that there can exist 2 different files with the same hash.

and then we add the "same size". is this enough? math demonstrates again, that this is not enough.

take for example 2 different files (by content) with the same size of 2 GB.  this means that you can have exactly combinations of 256 taken by 2*1024*1024*1024 which is a very big number. a hash again will only have at most combinations of 256 taken by 32 (or whatever number of bytes a hash has).

which means that there can be 2 different files with the same size and the same hash.

and this can go forever. no matter what restrictions you place in, the number of possible files will allways be bigger than the posisble numbers of hashes.

so that mathematically cuts the hash out :)

if you consider that a file is represented by
- name
- content
then 2 files are absolutely the same if they have the same name and the same conent (which implies the same size)
if you add to the file definition a time stame or any other stuff, then those need to be the same as well.

so, if you say that a file is defined by:
- name
- content
- creation datetime

and you copy a file via copy c:\dir1\filea c:\dir2\filea
then the 2 fiels are NOT the same, even though they have the same name and same content, since the creation datetime differs.

so, the right answer is: "it depends" :)
0
 
LVL 16

Author Comment

by:George Tokas
ID: 20251415
I am sorry for any missunderstanding...
The files are application's executables...
One is located on a PC named A and the other one on a PC named B.
Windows operating system.
LOOKS like they are the same. Icon at least.
Maybe when (and if) executed present the same result.
NOW how can we say that those files located on those two PC's is the same one WITHOUT ANY doubt.

George Tokas.
0
 
LVL 86

Expert Comment

by:jkr
ID: 20251475
>>how can we say that those files located on those two PC's is the same one
>>WITHOUT ANY doubt.

Comaring the MD5 signatures will leave no doubt. See http://en.wikipedia.org/wiki/MD5
0
 
LVL 28

Expert Comment

by:2266180
ID: 20251512
quote evilrix:
"Why can'[t you just diff then using a tool that supports binary diff?

http://winmerge.org/"

for your particular case, a binary diff will do it without any doubt. any hashing will leave doubts. but binary comparison will tell you 100% if those 2 FILES (not applications) are the same or not.

you can also use total commander if you are familiar with it, slect the 2 files and do ctrl+f,y

IF, you want a realtime comparison, without copying one of the files on the other computer, you can write a small program that does binary diff via network. it's really only a 1-2 our job.

however, it might be easier to just copy one of the files on the other PC.

I would like to state though (and I htink somebody else did too) that you can have 2 different files but the same APPLICATION.

I take an executable, I modify a few things (icon colors (human eye will not notice), other resources, (chane a few resource strings, etc), blablabla). basically, have the "same" application but different files.
0
 
LVL 28

Expert Comment

by:2266180
ID: 20251527
>> Comaring the MD5 signatures will leave no doubt

correction, comparing md5 signatures will leave MATHEMATICAL doubt. which stands in court :)
0
 
LVL 86

Expert Comment

by:jkr
ID: 20251546
>>correction, comparing md5 signatures will leave MATHEMATICAL doubt.

I think you might want to prove that with a link.
0
 
LVL 40

Expert Comment

by:evilrix
ID: 20251835
Since MD5 is a digest surely by its very definition there must be mathematical doubt since there is the probability of collision?
0
 
LVL 28

Expert Comment

by:2266180
ID: 20251902
>>I think you might want to prove that with a link

why? it's the math you learn in highschool for crying out loud.

md5sum is a function declared like:

md5:R->[0..max] where max is a 256 bit value

the domain, R is infinite. the codomain [0..max] is finite

what link do you want? it's highschool level math.
0
 
LVL 4

Expert Comment

by:Ichijo
ID: 20251967
In a hash function that produces a fixed size output from an arbitrarily long message (such as MD5 which produces 32-bit hashes), "there will always be collisions, because any given hash has to correspond to an infinite number of possible inputs." http://en.wikipedia.org/wiki/Hash_collision
0
 
LVL 16

Author Comment

by:George Tokas
ID: 20252030
Sorry for interrupt...
@jkr
Even though when I'm referencing to you as "the ultimate programmer entity" without any sense of a joke - because more than 75% of solutions I found here and helped me solve problems were your posts - I have to say that ciuly has a point.

I'm glad there is another sick mind like mine here allways trying to find weak spots...:-)
And also stick to his oppinion....

George Tokas.
btw...
@jkr
Is there ANY sector in windows API that you don't know??
I have this question more than 2 years.
0
 
LVL 46

Expert Comment

by:Kent Olsen
ID: 20252137
Hi George,

I've learned never to doubt a few folks on this board.  jkr is certainly one of them.

A layman (common person) won't understand the mathematics involved in the DM5 checksum and will likely have 1 of 2 reactions -- "I don't understand it so I don't trust it" or "Somebody a lot smarter than me figured that out so it must be ok".  Kind of like DNA, but being based on math instead of something physical, people's eyes will likely glaze over.

That said, if your audience is cooperative, you can find all kinds of documentation on MD5 that describes its use and effectiveness without explaining the math involved.  Will supporting documentation that the chances of the MD5 checksum fails less than 1 time in a billion (much better even than DNA testing) suffice?


Kent
0
 
LVL 16

Author Comment

by:George Tokas
ID: 20252296
@Kdo
Considering same size and same appearance I agree...

GENERALLY on the other hand when growing older and learned a few on the way, I found myself doubting everything before I search for flows....

First example:
1996. I buy a book for programming with Visual C++..
Inside there was the definition and explaination of TCP/IP datagram.
FIRST TIME I SAW IT!!!
My eye drops to the flag describing the internet/intranet...
Found a way to modify it at TCP/IP packets...
In every connection to remote networks I was accepted as a local node and with a few more moves I had full access...
Do you remember when Microsoft made a patch for this?
Of course no hack or crack meant from my side...
Just helped a friend ( the creator of Netgammon and now creator & owner of Gammonsite ) secure his servers...

That is why I respect sick minds like mine...:-)

George Tokas.
0
 
LVL 86

Expert Comment

by:jkr
ID: 20252332
>>what link do you want? it's highschool level math.

A practical demonstration of two files of the same size that are different and produce the same MD5 hash.. That would e.g. prove wrong all these stupid people out there that publish MD5 checksums to ensure the integrity of their files, including all major OpenSource projects.
0
 
LVL 46

Expert Comment

by:Kent Olsen
ID: 20252399

>>what link do you want? it's highschool level math.

Yes, but until you've looked at the algorithm, you don't know WHAT math is involved.  

An MD5 checksum is not a simple XORing of all of the data.  It explodes a bit across several bits (using different math for each resultant bit), and collapses those the exploded bits back using a different pattern.

It's pretty easy to see this.  Create a small test file with the letter 'B' somewhere in it.  Change the 'B' to a 'C'.  This sets one more bit in the file as the byte with the 'B' (0x42) becomes a 'C' (0x43).  The MD5 checksum on the two files are significantly different, even though the files vary by but a single bit.


Can you create more than one file with the same MD5 checksum?  Of course.  But given a file with a particular MD5 checksum, can you create a second file with the same checksum AND specific characteristics?  Unlikely.  An awful lot of people are stay up late at night trying to do that very thing.


Kent
0
 
LVL 4

Expert Comment

by:Ichijo
ID: 20252496
At http://www.cits.rub.de/MD5Collisions/ the letter of recommendation (letter_of_rec.ps) and order (order.ps) files are both exactly 2,029 bytes long and contain different content but both produce the MD5 hash a25f7f0b29ee0b3968c860738533a4b9.
0
 
LVL 86

Expert Comment

by:jkr
ID: 20252537
Any files of a larger size? I mean 2029 bytes is not really the typical size for an executable. Don't get me wrong, I am aware that this is possible, but let me put it like that: I consider a mechanism that is the de-facto standard for identifying and marking files as safe enough for that purpose.
0
 
LVL 4

Expert Comment

by:Ichijo
ID: 20252922
hello.exe and erase.exe are both 6,144 bytes: http://www.mscs.dal.ca/~selinger/md5collision/
0
 
LVL 28

Expert Comment

by:2266180
ID: 20252943
PEOPLE. I am talking about a simple problem a a finite function on an infinite domain. THAT is the math I am talking about. it doens't matter what that function do, mathematically, it will return the same value for an infinite number of input values. you learn that in 11th grade I think. or 12th. irrelevant

you all keep thinking about what md5 does. it doesn't matter. all it matters is that given 2 random files, it CAN get the same hash. and that's all the judge will care about.

the number of possible files is infinite. the number of possible hashes is very much finit. that is the only math you need to understand. a good lawyer will break your case with the reasonbable doubt.

"Is there a possibility for 2 different files to produce the same hash?"
"yes".

and your toast. it doesn't matter that in practice, md5sum is acceptable, because you take a file, you modify 1 BIT and the whole hash changes. that is irrelevant. what is relevant is that 2 different files CAN produce the same hash. and if you have an example (which you do) that is all the judge needs to dismiss your case. nad that is what counts. a dthat is without any complicated math behind the md5 sum, which btw, only proves that you cannot modify a file and get the same hash AND have the file still do something nice. that's why md5 sum is used for data corruption prevention.

the idea is that a hash function cannot **guarantee** that 2 files are identical if the hash is the same, but if the hash is different, you can bet your life the files are different.
and our asker needs something guarantaeble 100%. mathematically proven, no hash function can do that.
0
 
LVL 46

Expert Comment

by:Kent Olsen
ID: 20252995

>> all it matters is that given 2 random files, it CAN get the same hash.

That's not what matters.  Given file 1, with particular characteristics, modify it so that the new file has a defined set of characteristics AND the same checksum.


THAT is the challenge.

0
 
LVL 86

Expert Comment

by:jkr
ID: 20253022
>>hello.exe and erase.exe are both 6,144 bytes

Um, a typical Win32 executable linked to a CRT will usually start at ~40KB

>>it doesn't matter that in practice, md5sum is acceptable

Yes, all the people out there using it are wrong and you aren't. Point taken.
0
 
LVL 86

Expert Comment

by:jkr
ID: 20253030
>>THAT is the challenge.

Plus, that file still has to be valid machine code ;o)
0
 
LVL 46

Expert Comment

by:Kent Olsen
ID: 20253085

That's the first characteristic.  :)
The second is that the change 'does nothing' so that the program behavior isn't altered.

It's considerably tougher than ciuly argues.  

0
 
LVL 28

Expert Comment

by:2266180
ID: 20253241
whatever. ew're probably reading different questiosn. your're right md5 can guarantee 100% that 2 diles are the same if the hash is the same.

happy? there. I'm out of this question. unsubscribed. go learn some MATH.
0
 
LVL 16

Author Comment

by:George Tokas
ID: 20253243
Even though I feel GREATFULL to all of you for your participation here, along with my respect to ALL of you as developers and fine minds may I propose a way that MAYBE it is valid based on your posts.
First the senario:
We have an application's executable file "file.exe" located somewere in the storage area of a PC named PC1.
In another PC (PC2) in the same or another location in the storage area we found a file with the same name, "file.exe".
We want to be ABSOLUTETLY CERTAIN that the file is the same.
The size is greater than 300KB.
The file if we try to execute MAYBE present something maybe not. If it presents something on screen most probably presents the same.


1. Checking the icon. It is the same so we are going to next step
2. Checking the REAL size of the file. if it is NOT the same we start having doubts about it.
3. Testing the MD5 Hash of both files. Lets say the same hash was the result.
4. After all those testing if in binary part are more than 95% the same. If more than 98% in this test is the same outside the PE header of the file then we may say that we are ALMOST absolutetly certain that it is the same or a prior or later version.
 A 100% compatibility outside the PE header lead us to conclude that IT IS the same file.

 What do you say?
 Any other thoughts or additions?

 George Tokas.
0
 
LVL 46

Expert Comment

by:Kent Olsen
ID: 20253338

Hi George,

The step 4 premise needs some adjusting.  If the test passes step 3 (md5 hashes equal and file length equal) and the executable runs, and it produces the same known affect and output, the likelihood of the files being identical is billions to 1.

Modifying a single bit (forget about the checksum) in an executable often has disastrous results.  


Kent
0
 
LVL 86

Expert Comment

by:jkr
ID: 20253357
Yup, agree.
0
 
LVL 16

Author Comment

by:George Tokas
ID: 20253403
THANKS...
Don't forget this is REAL LIFE FORENSICS senario...
ciuly and Ichijo may I have AND your oppinion PLEASE?

George Tokas.
0
 
LVL 4

Expert Comment

by:Ichijo
ID: 20253708
The "billions to 1" figure assumes that a knowledgeable hacker didn't intentionally alter one of the files to produce an MD5 identical to the other. (But I have no idea how often this happens in real life.)

"Modifying a single bit (forget about the checksum) in an executable often has disastrous results."

You could find an unused or rarely used text string or other resource in the executable and modify it without any discernible difference. But again, you'd have to know what you're doing, and you can't really do it by accident.
0
 
LVL 16

Author Comment

by:George Tokas
ID: 20253745
@Ichijo
Don't forget the conditions I added my friend.
Binary compatibility of more than 98% out of the PE Header.
In C++ just an addition of 2 integers needs more than 10 bytes of machine code...
Not to mention pushing and poping registers...

The 2% of difference may mean that it is a prior or later version...

George Tokas.
0
 
LVL 16

Author Comment

by:George Tokas
ID: 20253811
One addition now we covered the most part for your information.
Here in Greece as in most countries we are printing invoices and using software with a PC.
When we use this way a special piece of hardware has to validate the document.
In my case - meaning my company - it is a machine communicates with the PC using the serial port and:
1. Calculating the SHA1 Hash of the document ( invoice ) is about to be send to the printer.
2. Return the Hash and the indentification of the machine, via the serial port, to be added in the document when it is printed.
3. Types in paper the transaction date along with the SHA1 signature.

This is considered VALID verification for invoices here.
Results are yours.

George Tokas.
0
 
LVL 16

Author Comment

by:George Tokas
ID: 20264768
Hello there again,
Sorry I'm not closing the Q yet.
I would like your oppinion on my latest post....
By the way most part of the story is uploaded on http://www.gtokas.com and in English...
For the poor English blame (except me of course) babelfish...

The MOST important parts are NOT uploaded yet...
You will see facts on the next parts you will not believe your eyes...

George Tokas.
0
 
LVL 16

Author Comment

by:George Tokas
ID: 20283698
Time to close this Q...
Thanks everybody for the support.
A VERY SPECIAL THANKS:
1. To ciuly. He left his usual topics to support and DESIGNATES that ALL THINGS have flaws regardless of thoe failsafe they look. It was a nice addition and will help me a lot.
2. To Ichijo. Usually when VERY higher ranked experts states an oppinion the lower rated accepted it by default. THANK YOU for sticking in your oppinion. Along with ciuly will be a GREAT help for me.
3. To evilrix. Same as above. You also have my gratitude.

For Kdo & jkr:
 A simple thanks (or very special) will not be enough.
 You know where to reach me if you need ANYTHING.
 Same apply and for the rest.

Best Regards,
George Tokas.

P.S. Stay tuned... On Friday we will have news.

0
 
LVL 40

Expert Comment

by:evilrix
ID: 20287504
Good luck to you George. Thanks.
0

Featured Post

Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Templates For Beginners Or How To Encourage The Compiler To Work For You Introduction This tutorial is targeted at the reader who is, perhaps, familiar with the basics of C++ but would prefer a little slower introduction to the more ad…
Jaspersoft Studio is a plugin for Eclipse that lets you create reports from a datasource.  In this article, we'll go over creating a report from a default template and setting up a datasource that connects to your database.
This tutorial covers a step-by-step guide to install VisualVM launcher in eclipse.
The viewer will learn how to user default arguments when defining functions. This method of defining functions will be contrasted with the non-default-argument of defining functions.

578 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question