Determine that 2 files are ABSOLUTELY same.

This may look idiotic question from a person ranked first in BCB area but it is needed as proof.
Windows Operating System.
We have 2 files, same icon, same size reported, same interface etc...
How can we be ABSOLUTELY sure that those files are the same?
Is there any programmaticaly method or ready made project to test that?

Thanks in advance,
George Tokas.
LVL 16
George TokasAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Kent OlsenData Warehouse Architect / DBACommented:
Hi gtokas,

Interesting question.  :)

Nothing in the file system or structure indicates a file's source.  That is, even if FILEB is created via 'COPY FILEA FILEB'', the O/S won't know that at its creation, FILEB is an exact copy of FILEA.

So given that the O/S never knows that two files are identical, it follows that the O/S can't know at random times, either.

It would seem that the only way to assert that two files are identical is to open them and read them, comparing every byte in the files.  Of course, one should also save and restore the "Last Access" timestamps for the files as comparing them for equality is an artificial access that most folks won't care about.


Good Luck,
Kent
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
George TokasAuthor Commented:
I'm sorry I forgot.
The reason for this question is:
We have an application's executable on a PC named A.
We have the same file (name, size reported, etc ) on a PC named B.
We run both instances and the interface and the look is the same.
HOW CAN BE SURE THAT the file on A is the same with the one on B.
Any "fingerprint"?

George Tokas.
0
Kent OlsenData Warehouse Architect / DBACommented:
Hi gtokas,

Are you controlling the executable?  If so, make sure that version numbers and build numbers are current.



Kent
0
Upgrade your Question Security!

Your question, your audience. Choose who sees your identity—and your question—with question security.

jkrCommented:
>>We have 2 files, same icon, same size reported, same interface etc...
>>How can we be ABSOLUTELY sure that those files are the same?

Create a MD5 hash for each file and compare them. You will find the complete description and source code at http://www.ietf.org/rfc/rfc3174.txt or a ready made tool at http://unxutils.sourceforge.net/ or http://www.pc-tools.net/win32/md5sums/
0
jkrCommented:
BTW, just in case that was not clear - the hashes are identical if the files are identical, if not, the files are different.
0
IchijoCommented:
"the hashes are identical if the files are identical"

Or if an MD5 collision occurs. But that doesn't happen very often outside of hacking.
0
evilrixSenior Software Engineer (Avast)Commented:
Why can'[t you just diff then using a tool that supports binary diff?

http://winmerge.org/
0
George TokasAuthor Commented:
>>"the hashes are identical if the files are identical"
Thanks jkr.
The above may I assume means same size not the reported one but the real?
And also don't forget that the interface and the look of both files when run is the same....

George Tokas.
0
jkrCommented:
Well, if the files are not of the same size, you can skip the MD5 test, since then they aren't identical for sure. What do you mean by "real size" in that context?
0
George TokasAuthor Commented:
>>What do you mean by "real size" in that context?
Size reported can be faked in PE header.

George Tokas.
0
jkrCommented:
Ah, I see. Well, if that size and the size on disk is different, you should hassume that the file has been tampered with. Yet a MD5 check can give you more certainty about that.
0
2266180Commented:
depends what you understand with "absolutely the same". and what you understand by "file"

some people would consider that a hash will do it. but a hash has a value domain in the 256 bits value family or whatever the size of the hash is, and a file can take any size. which means that there can exist 2 different files with the same hash.

and then we add the "same size". is this enough? math demonstrates again, that this is not enough.

take for example 2 different files (by content) with the same size of 2 GB.  this means that you can have exactly combinations of 256 taken by 2*1024*1024*1024 which is a very big number. a hash again will only have at most combinations of 256 taken by 32 (or whatever number of bytes a hash has).

which means that there can be 2 different files with the same size and the same hash.

and this can go forever. no matter what restrictions you place in, the number of possible files will allways be bigger than the posisble numbers of hashes.

so that mathematically cuts the hash out :)

if you consider that a file is represented by
- name
- content
then 2 files are absolutely the same if they have the same name and the same conent (which implies the same size)
if you add to the file definition a time stame or any other stuff, then those need to be the same as well.

so, if you say that a file is defined by:
- name
- content
- creation datetime

and you copy a file via copy c:\dir1\filea c:\dir2\filea
then the 2 fiels are NOT the same, even though they have the same name and same content, since the creation datetime differs.

so, the right answer is: "it depends" :)
0
George TokasAuthor Commented:
I am sorry for any missunderstanding...
The files are application's executables...
One is located on a PC named A and the other one on a PC named B.
Windows operating system.
LOOKS like they are the same. Icon at least.
Maybe when (and if) executed present the same result.
NOW how can we say that those files located on those two PC's is the same one WITHOUT ANY doubt.

George Tokas.
0
jkrCommented:
>>how can we say that those files located on those two PC's is the same one
>>WITHOUT ANY doubt.

Comaring the MD5 signatures will leave no doubt. See http://en.wikipedia.org/wiki/MD5
0
2266180Commented:
quote evilrix:
"Why can'[t you just diff then using a tool that supports binary diff?

http://winmerge.org/"

for your particular case, a binary diff will do it without any doubt. any hashing will leave doubts. but binary comparison will tell you 100% if those 2 FILES (not applications) are the same or not.

you can also use total commander if you are familiar with it, slect the 2 files and do ctrl+f,y

IF, you want a realtime comparison, without copying one of the files on the other computer, you can write a small program that does binary diff via network. it's really only a 1-2 our job.

however, it might be easier to just copy one of the files on the other PC.

I would like to state though (and I htink somebody else did too) that you can have 2 different files but the same APPLICATION.

I take an executable, I modify a few things (icon colors (human eye will not notice), other resources, (chane a few resource strings, etc), blablabla). basically, have the "same" application but different files.
0
2266180Commented:
>> Comaring the MD5 signatures will leave no doubt

correction, comparing md5 signatures will leave MATHEMATICAL doubt. which stands in court :)
0
jkrCommented:
>>correction, comparing md5 signatures will leave MATHEMATICAL doubt.

I think you might want to prove that with a link.
0
evilrixSenior Software Engineer (Avast)Commented:
Since MD5 is a digest surely by its very definition there must be mathematical doubt since there is the probability of collision?
0
2266180Commented:
>>I think you might want to prove that with a link

why? it's the math you learn in highschool for crying out loud.

md5sum is a function declared like:

md5:R->[0..max] where max is a 256 bit value

the domain, R is infinite. the codomain [0..max] is finite

what link do you want? it's highschool level math.
0
IchijoCommented:
In a hash function that produces a fixed size output from an arbitrarily long message (such as MD5 which produces 32-bit hashes), "there will always be collisions, because any given hash has to correspond to an infinite number of possible inputs." http://en.wikipedia.org/wiki/Hash_collision
0
George TokasAuthor Commented:
Sorry for interrupt...
@jkr
Even though when I'm referencing to you as "the ultimate programmer entity" without any sense of a joke - because more than 75% of solutions I found here and helped me solve problems were your posts - I have to say that ciuly has a point.

I'm glad there is another sick mind like mine here allways trying to find weak spots...:-)
And also stick to his oppinion....

George Tokas.
btw...
@jkr
Is there ANY sector in windows API that you don't know??
I have this question more than 2 years.
0
Kent OlsenData Warehouse Architect / DBACommented:
Hi George,

I've learned never to doubt a few folks on this board.  jkr is certainly one of them.

A layman (common person) won't understand the mathematics involved in the DM5 checksum and will likely have 1 of 2 reactions -- "I don't understand it so I don't trust it" or "Somebody a lot smarter than me figured that out so it must be ok".  Kind of like DNA, but being based on math instead of something physical, people's eyes will likely glaze over.

That said, if your audience is cooperative, you can find all kinds of documentation on MD5 that describes its use and effectiveness without explaining the math involved.  Will supporting documentation that the chances of the MD5 checksum fails less than 1 time in a billion (much better even than DNA testing) suffice?


Kent
0
George TokasAuthor Commented:
@Kdo
Considering same size and same appearance I agree...

GENERALLY on the other hand when growing older and learned a few on the way, I found myself doubting everything before I search for flows....

First example:
1996. I buy a book for programming with Visual C++..
Inside there was the definition and explaination of TCP/IP datagram.
FIRST TIME I SAW IT!!!
My eye drops to the flag describing the internet/intranet...
Found a way to modify it at TCP/IP packets...
In every connection to remote networks I was accepted as a local node and with a few more moves I had full access...
Do you remember when Microsoft made a patch for this?
Of course no hack or crack meant from my side...
Just helped a friend ( the creator of Netgammon and now creator & owner of Gammonsite ) secure his servers...

That is why I respect sick minds like mine...:-)

George Tokas.
0
jkrCommented:
>>what link do you want? it's highschool level math.

A practical demonstration of two files of the same size that are different and produce the same MD5 hash.. That would e.g. prove wrong all these stupid people out there that publish MD5 checksums to ensure the integrity of their files, including all major OpenSource projects.
0
Kent OlsenData Warehouse Architect / DBACommented:

>>what link do you want? it's highschool level math.

Yes, but until you've looked at the algorithm, you don't know WHAT math is involved.  

An MD5 checksum is not a simple XORing of all of the data.  It explodes a bit across several bits (using different math for each resultant bit), and collapses those the exploded bits back using a different pattern.

It's pretty easy to see this.  Create a small test file with the letter 'B' somewhere in it.  Change the 'B' to a 'C'.  This sets one more bit in the file as the byte with the 'B' (0x42) becomes a 'C' (0x43).  The MD5 checksum on the two files are significantly different, even though the files vary by but a single bit.


Can you create more than one file with the same MD5 checksum?  Of course.  But given a file with a particular MD5 checksum, can you create a second file with the same checksum AND specific characteristics?  Unlikely.  An awful lot of people are stay up late at night trying to do that very thing.


Kent
0
IchijoCommented:
At http://www.cits.rub.de/MD5Collisions/ the letter of recommendation (letter_of_rec.ps) and order (order.ps) files are both exactly 2,029 bytes long and contain different content but both produce the MD5 hash a25f7f0b29ee0b3968c860738533a4b9.
0
jkrCommented:
Any files of a larger size? I mean 2029 bytes is not really the typical size for an executable. Don't get me wrong, I am aware that this is possible, but let me put it like that: I consider a mechanism that is the de-facto standard for identifying and marking files as safe enough for that purpose.
0
IchijoCommented:
hello.exe and erase.exe are both 6,144 bytes: http://www.mscs.dal.ca/~selinger/md5collision/
0
2266180Commented:
PEOPLE. I am talking about a simple problem a a finite function on an infinite domain. THAT is the math I am talking about. it doens't matter what that function do, mathematically, it will return the same value for an infinite number of input values. you learn that in 11th grade I think. or 12th. irrelevant

you all keep thinking about what md5 does. it doesn't matter. all it matters is that given 2 random files, it CAN get the same hash. and that's all the judge will care about.

the number of possible files is infinite. the number of possible hashes is very much finit. that is the only math you need to understand. a good lawyer will break your case with the reasonbable doubt.

"Is there a possibility for 2 different files to produce the same hash?"
"yes".

and your toast. it doesn't matter that in practice, md5sum is acceptable, because you take a file, you modify 1 BIT and the whole hash changes. that is irrelevant. what is relevant is that 2 different files CAN produce the same hash. and if you have an example (which you do) that is all the judge needs to dismiss your case. nad that is what counts. a dthat is without any complicated math behind the md5 sum, which btw, only proves that you cannot modify a file and get the same hash AND have the file still do something nice. that's why md5 sum is used for data corruption prevention.

the idea is that a hash function cannot **guarantee** that 2 files are identical if the hash is the same, but if the hash is different, you can bet your life the files are different.
and our asker needs something guarantaeble 100%. mathematically proven, no hash function can do that.
0
Kent OlsenData Warehouse Architect / DBACommented:

>> all it matters is that given 2 random files, it CAN get the same hash.

That's not what matters.  Given file 1, with particular characteristics, modify it so that the new file has a defined set of characteristics AND the same checksum.


THAT is the challenge.

0
jkrCommented:
>>hello.exe and erase.exe are both 6,144 bytes

Um, a typical Win32 executable linked to a CRT will usually start at ~40KB

>>it doesn't matter that in practice, md5sum is acceptable

Yes, all the people out there using it are wrong and you aren't. Point taken.
0
jkrCommented:
>>THAT is the challenge.

Plus, that file still has to be valid machine code ;o)
0
Kent OlsenData Warehouse Architect / DBACommented:

That's the first characteristic.  :)
The second is that the change 'does nothing' so that the program behavior isn't altered.

It's considerably tougher than ciuly argues.  

0
2266180Commented:
whatever. ew're probably reading different questiosn. your're right md5 can guarantee 100% that 2 diles are the same if the hash is the same.

happy? there. I'm out of this question. unsubscribed. go learn some MATH.
0
George TokasAuthor Commented:
Even though I feel GREATFULL to all of you for your participation here, along with my respect to ALL of you as developers and fine minds may I propose a way that MAYBE it is valid based on your posts.
First the senario:
We have an application's executable file "file.exe" located somewere in the storage area of a PC named PC1.
In another PC (PC2) in the same or another location in the storage area we found a file with the same name, "file.exe".
We want to be ABSOLUTETLY CERTAIN that the file is the same.
The size is greater than 300KB.
The file if we try to execute MAYBE present something maybe not. If it presents something on screen most probably presents the same.


1. Checking the icon. It is the same so we are going to next step
2. Checking the REAL size of the file. if it is NOT the same we start having doubts about it.
3. Testing the MD5 Hash of both files. Lets say the same hash was the result.
4. After all those testing if in binary part are more than 95% the same. If more than 98% in this test is the same outside the PE header of the file then we may say that we are ALMOST absolutetly certain that it is the same or a prior or later version.
 A 100% compatibility outside the PE header lead us to conclude that IT IS the same file.

 What do you say?
 Any other thoughts or additions?

 George Tokas.
0
Kent OlsenData Warehouse Architect / DBACommented:

Hi George,

The step 4 premise needs some adjusting.  If the test passes step 3 (md5 hashes equal and file length equal) and the executable runs, and it produces the same known affect and output, the likelihood of the files being identical is billions to 1.

Modifying a single bit (forget about the checksum) in an executable often has disastrous results.  


Kent
0
jkrCommented:
Yup, agree.
0
George TokasAuthor Commented:
THANKS...
Don't forget this is REAL LIFE FORENSICS senario...
ciuly and Ichijo may I have AND your oppinion PLEASE?

George Tokas.
0
IchijoCommented:
The "billions to 1" figure assumes that a knowledgeable hacker didn't intentionally alter one of the files to produce an MD5 identical to the other. (But I have no idea how often this happens in real life.)

"Modifying a single bit (forget about the checksum) in an executable often has disastrous results."

You could find an unused or rarely used text string or other resource in the executable and modify it without any discernible difference. But again, you'd have to know what you're doing, and you can't really do it by accident.
0
George TokasAuthor Commented:
@Ichijo
Don't forget the conditions I added my friend.
Binary compatibility of more than 98% out of the PE Header.
In C++ just an addition of 2 integers needs more than 10 bytes of machine code...
Not to mention pushing and poping registers...

The 2% of difference may mean that it is a prior or later version...

George Tokas.
0
George TokasAuthor Commented:
One addition now we covered the most part for your information.
Here in Greece as in most countries we are printing invoices and using software with a PC.
When we use this way a special piece of hardware has to validate the document.
In my case - meaning my company - it is a machine communicates with the PC using the serial port and:
1. Calculating the SHA1 Hash of the document ( invoice ) is about to be send to the printer.
2. Return the Hash and the indentification of the machine, via the serial port, to be added in the document when it is printed.
3. Types in paper the transaction date along with the SHA1 signature.

This is considered VALID verification for invoices here.
Results are yours.

George Tokas.
0
George TokasAuthor Commented:
Hello there again,
Sorry I'm not closing the Q yet.
I would like your oppinion on my latest post....
By the way most part of the story is uploaded on http://www.gtokas.com and in English...
For the poor English blame (except me of course) babelfish...

The MOST important parts are NOT uploaded yet...
You will see facts on the next parts you will not believe your eyes...

George Tokas.
0
George TokasAuthor Commented:
Time to close this Q...
Thanks everybody for the support.
A VERY SPECIAL THANKS:
1. To ciuly. He left his usual topics to support and DESIGNATES that ALL THINGS have flaws regardless of thoe failsafe they look. It was a nice addition and will help me a lot.
2. To Ichijo. Usually when VERY higher ranked experts states an oppinion the lower rated accepted it by default. THANK YOU for sticking in your oppinion. Along with ciuly will be a GREAT help for me.
3. To evilrix. Same as above. You also have my gratitude.

For Kdo & jkr:
 A simple thanks (or very special) will not be enough.
 You know where to reach me if you need ANYTHING.
 Same apply and for the rest.

Best Regards,
George Tokas.

P.S. Stay tuned... On Friday we will have news.

0
evilrixSenior Software Engineer (Avast)Commented:
Good luck to you George. Thanks.
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Editors IDEs

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.