Link to home
Start Free TrialLog in
Avatar of George Tokas
George TokasFlag for Greece

asked on

Determine that 2 files are ABSOLUTELY same.

This may look idiotic question from a person ranked first in BCB area but it is needed as proof.
Windows Operating System.
We have 2 files, same icon, same size reported, same interface etc...
How can we be ABSOLUTELY sure that those files are the same?
Is there any programmaticaly method or ready made project to test that?

Thanks in advance,
George Tokas.
ASKER CERTIFIED SOLUTION
Avatar of Kent Olsen
Kent Olsen
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of George Tokas

ASKER

I'm sorry I forgot.
The reason for this question is:
We have an application's executable on a PC named A.
We have the same file (name, size reported, etc ) on a PC named B.
We run both instances and the interface and the look is the same.
HOW CAN BE SURE THAT the file on A is the same with the one on B.
Any "fingerprint"?

George Tokas.
Hi gtokas,

Are you controlling the executable?  If so, make sure that version numbers and build numbers are current.



Kent
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
BTW, just in case that was not clear - the hashes are identical if the files are identical, if not, the files are different.
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
>>"the hashes are identical if the files are identical"
Thanks jkr.
The above may I assume means same size not the reported one but the real?
And also don't forget that the interface and the look of both files when run is the same....

George Tokas.
Well, if the files are not of the same size, you can skip the MD5 test, since then they aren't identical for sure. What do you mean by "real size" in that context?
>>What do you mean by "real size" in that context?
Size reported can be faked in PE header.

George Tokas.
Ah, I see. Well, if that size and the size on disk is different, you should hassume that the file has been tampered with. Yet a MD5 check can give you more certainty about that.
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
I am sorry for any missunderstanding...
The files are application's executables...
One is located on a PC named A and the other one on a PC named B.
Windows operating system.
LOOKS like they are the same. Icon at least.
Maybe when (and if) executed present the same result.
NOW how can we say that those files located on those two PC's is the same one WITHOUT ANY doubt.

George Tokas.
>>how can we say that those files located on those two PC's is the same one
>>WITHOUT ANY doubt.

Comaring the MD5 signatures will leave no doubt. See http://en.wikipedia.org/wiki/MD5
quote evilrix:
"Why can'[t you just diff then using a tool that supports binary diff?

http://winmerge.org/"

for your particular case, a binary diff will do it without any doubt. any hashing will leave doubts. but binary comparison will tell you 100% if those 2 FILES (not applications) are the same or not.

you can also use total commander if you are familiar with it, slect the 2 files and do ctrl+f,y

IF, you want a realtime comparison, without copying one of the files on the other computer, you can write a small program that does binary diff via network. it's really only a 1-2 our job.

however, it might be easier to just copy one of the files on the other PC.

I would like to state though (and I htink somebody else did too) that you can have 2 different files but the same APPLICATION.

I take an executable, I modify a few things (icon colors (human eye will not notice), other resources, (chane a few resource strings, etc), blablabla). basically, have the "same" application but different files.
>> Comaring the MD5 signatures will leave no doubt

correction, comparing md5 signatures will leave MATHEMATICAL doubt. which stands in court :)
>>correction, comparing md5 signatures will leave MATHEMATICAL doubt.

I think you might want to prove that with a link.
Since MD5 is a digest surely by its very definition there must be mathematical doubt since there is the probability of collision?
>>I think you might want to prove that with a link

why? it's the math you learn in highschool for crying out loud.

md5sum is a function declared like:

md5:R->[0..max] where max is a 256 bit value

the domain, R is infinite. the codomain [0..max] is finite

what link do you want? it's highschool level math.
Avatar of Ichijo
Ichijo

In a hash function that produces a fixed size output from an arbitrarily long message (such as MD5 which produces 32-bit hashes), "there will always be collisions, because any given hash has to correspond to an infinite number of possible inputs." http://en.wikipedia.org/wiki/Hash_collision
Sorry for interrupt...
@jkr
Even though when I'm referencing to you as "the ultimate programmer entity" without any sense of a joke - because more than 75% of solutions I found here and helped me solve problems were your posts - I have to say that ciuly has a point.

I'm glad there is another sick mind like mine here allways trying to find weak spots...:-)
And also stick to his oppinion....

George Tokas.
btw...
@jkr
Is there ANY sector in windows API that you don't know??
I have this question more than 2 years.
Hi George,

I've learned never to doubt a few folks on this board.  jkr is certainly one of them.

A layman (common person) won't understand the mathematics involved in the DM5 checksum and will likely have 1 of 2 reactions -- "I don't understand it so I don't trust it" or "Somebody a lot smarter than me figured that out so it must be ok".  Kind of like DNA, but being based on math instead of something physical, people's eyes will likely glaze over.

That said, if your audience is cooperative, you can find all kinds of documentation on MD5 that describes its use and effectiveness without explaining the math involved.  Will supporting documentation that the chances of the MD5 checksum fails less than 1 time in a billion (much better even than DNA testing) suffice?


Kent
@Kdo
Considering same size and same appearance I agree...

GENERALLY on the other hand when growing older and learned a few on the way, I found myself doubting everything before I search for flows....

First example:
1996. I buy a book for programming with Visual C++..
Inside there was the definition and explaination of TCP/IP datagram.
FIRST TIME I SAW IT!!!
My eye drops to the flag describing the internet/intranet...
Found a way to modify it at TCP/IP packets...
In every connection to remote networks I was accepted as a local node and with a few more moves I had full access...
Do you remember when Microsoft made a patch for this?
Of course no hack or crack meant from my side...
Just helped a friend ( the creator of Netgammon and now creator & owner of Gammonsite ) secure his servers...

That is why I respect sick minds like mine...:-)

George Tokas.
>>what link do you want? it's highschool level math.

A practical demonstration of two files of the same size that are different and produce the same MD5 hash.. That would e.g. prove wrong all these stupid people out there that publish MD5 checksums to ensure the integrity of their files, including all major OpenSource projects.

>>what link do you want? it's highschool level math.

Yes, but until you've looked at the algorithm, you don't know WHAT math is involved.  

An MD5 checksum is not a simple XORing of all of the data.  It explodes a bit across several bits (using different math for each resultant bit), and collapses those the exploded bits back using a different pattern.

It's pretty easy to see this.  Create a small test file with the letter 'B' somewhere in it.  Change the 'B' to a 'C'.  This sets one more bit in the file as the byte with the 'B' (0x42) becomes a 'C' (0x43).  The MD5 checksum on the two files are significantly different, even though the files vary by but a single bit.


Can you create more than one file with the same MD5 checksum?  Of course.  But given a file with a particular MD5 checksum, can you create a second file with the same checksum AND specific characteristics?  Unlikely.  An awful lot of people are stay up late at night trying to do that very thing.


Kent
At http://www.cits.rub.de/MD5Collisions/ the letter of recommendation (letter_of_rec.ps) and order (order.ps) files are both exactly 2,029 bytes long and contain different content but both produce the MD5 hash a25f7f0b29ee0b3968c860738533a4b9.
Any files of a larger size? I mean 2029 bytes is not really the typical size for an executable. Don't get me wrong, I am aware that this is possible, but let me put it like that: I consider a mechanism that is the de-facto standard for identifying and marking files as safe enough for that purpose.
hello.exe and erase.exe are both 6,144 bytes: http://www.mscs.dal.ca/~selinger/md5collision/
PEOPLE. I am talking about a simple problem a a finite function on an infinite domain. THAT is the math I am talking about. it doens't matter what that function do, mathematically, it will return the same value for an infinite number of input values. you learn that in 11th grade I think. or 12th. irrelevant

you all keep thinking about what md5 does. it doesn't matter. all it matters is that given 2 random files, it CAN get the same hash. and that's all the judge will care about.

the number of possible files is infinite. the number of possible hashes is very much finit. that is the only math you need to understand. a good lawyer will break your case with the reasonbable doubt.

"Is there a possibility for 2 different files to produce the same hash?"
"yes".

and your toast. it doesn't matter that in practice, md5sum is acceptable, because you take a file, you modify 1 BIT and the whole hash changes. that is irrelevant. what is relevant is that 2 different files CAN produce the same hash. and if you have an example (which you do) that is all the judge needs to dismiss your case. nad that is what counts. a dthat is without any complicated math behind the md5 sum, which btw, only proves that you cannot modify a file and get the same hash AND have the file still do something nice. that's why md5 sum is used for data corruption prevention.

the idea is that a hash function cannot **guarantee** that 2 files are identical if the hash is the same, but if the hash is different, you can bet your life the files are different.
and our asker needs something guarantaeble 100%. mathematically proven, no hash function can do that.

>> all it matters is that given 2 random files, it CAN get the same hash.

That's not what matters.  Given file 1, with particular characteristics, modify it so that the new file has a defined set of characteristics AND the same checksum.


THAT is the challenge.

>>hello.exe and erase.exe are both 6,144 bytes

Um, a typical Win32 executable linked to a CRT will usually start at ~40KB

>>it doesn't matter that in practice, md5sum is acceptable

Yes, all the people out there using it are wrong and you aren't. Point taken.
>>THAT is the challenge.

Plus, that file still has to be valid machine code ;o)

That's the first characteristic.  :)
The second is that the change 'does nothing' so that the program behavior isn't altered.

It's considerably tougher than ciuly argues.  

whatever. ew're probably reading different questiosn. your're right md5 can guarantee 100% that 2 diles are the same if the hash is the same.

happy? there. I'm out of this question. unsubscribed. go learn some MATH.
Even though I feel GREATFULL to all of you for your participation here, along with my respect to ALL of you as developers and fine minds may I propose a way that MAYBE it is valid based on your posts.
First the senario:
We have an application's executable file "file.exe" located somewere in the storage area of a PC named PC1.
In another PC (PC2) in the same or another location in the storage area we found a file with the same name, "file.exe".
We want to be ABSOLUTETLY CERTAIN that the file is the same.
The size is greater than 300KB.
The file if we try to execute MAYBE present something maybe not. If it presents something on screen most probably presents the same.


1. Checking the icon. It is the same so we are going to next step
2. Checking the REAL size of the file. if it is NOT the same we start having doubts about it.
3. Testing the MD5 Hash of both files. Lets say the same hash was the result.
4. After all those testing if in binary part are more than 95% the same. If more than 98% in this test is the same outside the PE header of the file then we may say that we are ALMOST absolutetly certain that it is the same or a prior or later version.
 A 100% compatibility outside the PE header lead us to conclude that IT IS the same file.

 What do you say?
 Any other thoughts or additions?

 George Tokas.

Hi George,

The step 4 premise needs some adjusting.  If the test passes step 3 (md5 hashes equal and file length equal) and the executable runs, and it produces the same known affect and output, the likelihood of the files being identical is billions to 1.

Modifying a single bit (forget about the checksum) in an executable often has disastrous results.  


Kent
Yup, agree.
THANKS...
Don't forget this is REAL LIFE FORENSICS senario...
ciuly and Ichijo may I have AND your oppinion PLEASE?

George Tokas.
The "billions to 1" figure assumes that a knowledgeable hacker didn't intentionally alter one of the files to produce an MD5 identical to the other. (But I have no idea how often this happens in real life.)

"Modifying a single bit (forget about the checksum) in an executable often has disastrous results."

You could find an unused or rarely used text string or other resource in the executable and modify it without any discernible difference. But again, you'd have to know what you're doing, and you can't really do it by accident.
@Ichijo
Don't forget the conditions I added my friend.
Binary compatibility of more than 98% out of the PE Header.
In C++ just an addition of 2 integers needs more than 10 bytes of machine code...
Not to mention pushing and poping registers...

The 2% of difference may mean that it is a prior or later version...

George Tokas.
One addition now we covered the most part for your information.
Here in Greece as in most countries we are printing invoices and using software with a PC.
When we use this way a special piece of hardware has to validate the document.
In my case - meaning my company - it is a machine communicates with the PC using the serial port and:
1. Calculating the SHA1 Hash of the document ( invoice ) is about to be send to the printer.
2. Return the Hash and the indentification of the machine, via the serial port, to be added in the document when it is printed.
3. Types in paper the transaction date along with the SHA1 signature.

This is considered VALID verification for invoices here.
Results are yours.

George Tokas.
Hello there again,
Sorry I'm not closing the Q yet.
I would like your oppinion on my latest post....
By the way most part of the story is uploaded on http://www.gtokas.com and in English...
For the poor English blame (except me of course) babelfish...

The MOST important parts are NOT uploaded yet...
You will see facts on the next parts you will not believe your eyes...

George Tokas.
Time to close this Q...
Thanks everybody for the support.
A VERY SPECIAL THANKS:
1. To ciuly. He left his usual topics to support and DESIGNATES that ALL THINGS have flaws regardless of thoe failsafe they look. It was a nice addition and will help me a lot.
2. To Ichijo. Usually when VERY higher ranked experts states an oppinion the lower rated accepted it by default. THANK YOU for sticking in your oppinion. Along with ciuly will be a GREAT help for me.
3. To evilrix. Same as above. You also have my gratitude.

For Kdo & jkr:
 A simple thanks (or very special) will not be enough.
 You know where to reach me if you need ANYTHING.
 Same apply and for the rest.

Best Regards,
George Tokas.

P.S. Stay tuned... On Friday we will have news.

Good luck to you George. Thanks.