Compare Source code for Legal Matter.

rye004
rye004 used Ask the Experts™
on
In short, one company is saying another company has stolen source code.  

I have been independently contracted to make this determination.
 
My first thought was to do a text compare via command line.  However, I wanted to reach out to the community to see if there are any suggestions.   I know I can write something myself to do this, but I was curious if there was anything “off the shelf” that I could use to assist.

Any suggestions would be greatly appreciated.
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Do they talk about having stolen source code of a complete application, of a library, of an algorithm?

Probably you might have some scripts / code / tools, that can help you to identify sections, that are perhaps stolen.

Then a human has to check the highlighted sections.


Is it known whether the suspected stealer tried to conceal its effort or whether they just copied pasted code sections?

If such an accusation has been done, then the person who did the accusation might perhaps give you a starting point.

what Programming language is it?


For copy paste stealing you could just have a script reading in all lines, sorting them calculating a hash for each linFor copy / paste stealing you just had toe and see whether you find them in the other file.

If you suspect the code was reformatted you might look for tools, that standardize the source code formatting and do the comparison then.

If you suspect, that code was stolen and  work was invested to obfuscate this act you might have to create syntax trees but I'm not sure it will be easy to identify stolen code this way.


Do you have only access to the code or do you have access to the code repository (like git)? then analyzing the history of the repository might help.
Duncan RoeSoftware Developer
Commented:
You might find tkdiff helpful. Use View options Show Inline Comparison (recursive) and Ignore White Spaces. Use the bent green arrow to re-display diffs after making a change to one of the files (e.g. changing a variable name to match the other).
(I'm not a big GUI person myself but I find that diff refresh so much better than command-line re-diff).

Initially diff programs may not find significant matches between files at all. You as a human may spot one is some kind of transform of the other. Once you figure out how to undo the transform, you can start using tkdiff or whatever.

If the code was independently written, I would not expect any diff program to help you much.
Top Expert 2016
Commented:
if code is stolen, you definitively should get some hits if you search source file directories for identical contents. there are free tools where you could give a root folder that contains both source trees and which finds duplicates (by contents, not by name) . if there are 100 percent duplicates of non-trivial files, you already know that there is a case (of course only if it is guaranteed that company A isn't the thief themselves). then the tools could be run again to find duplicates with  a - say - 90 percent match. with that you would see to which extent the theft was made.

Sara
Microsoft Azure 2017

Azure has a changed a lot since it was originally introduce by adding new services and features. Do you know everything you need to about Azure? This course will teach you about the Azure App Service, monitoring and application insights, DevOps, and Team Services.

Test your restores, not your backups...
Top Expert 2016
Commented:
What language is the application written in, and do you have both source codes?

My go to tool for just comparing two files and looking for matches and differences between the two is BeyondCompare.  It allows some configuration of the matching constraints (ignore white space, etc) and does a pretty good job when the files are reasonably similar.  If large amounts of the files have been moved around though and placed in different locations it can't really identify that, since it's basic approach is a top down "matching" approach, trying to see where sections of the files align.


There are other tools that can analyze relative to programming languages, often used more for finding redundant code during development, rather than stolen code.  I haven't used these much (we did explore some many years ago on a Code Review team I was a part of) but at the time the tools we looked at didn't work as well as we would have liked.  I did find these looking for more current tools, and depending on your programming language and specific needs might be worth a look.

I'm interested to see what you find...



»bp
Top Expert 2014
Commented:
Is there a source code respository in use?  You can look at the initial checked-in code.  If it wasn't stolen, you would expect it to be small and not resemble that of the other company.  If it was stolen, you would expect it to be large.

A friend/colleague created DLSuperC that is still my go-to comparison utility.
https://dlsuperc.com/

You might look at the source code before and after the lawsuit was filed.  Global changes might reflect an effort to cover-up the theft.

Author

Commented:
Thank you so much everyone for the responses.  I am going to spend time reviewing each suggestion.

The code was written in Python.  Both sets of code should be coming from a repository.

Author

Commented:
Thank you everyone for your suggestions.  I am still trying to determine how I am going to do this; however, I think everyone helped give me direction.  I will update this later with my approach.
Yes an update will be very interesting.
Wishing you good progress
Top Expert 2014

Commented:
You should look at the repository itself, not just a code snapshot from the repository.

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial