bash script - Part 2 - to compare (diff) files in different folders


The goal of the script is explained in the above reference. I integrated the suggested bash script into my driver script that compares files in a BASE tree with a TEST tree. In the above reference, we understood that it wasn't perfect in handling the edge cases pertaining to file sizes. For awhile, those edge cases have not been a problem. The script has already saved me today a good deal of tedium in manually comparing by hand.

Now the file generating program has created a different number of files in one TEST folder than in the corresponding BASE folder. My driver routine reported that no diffs were performed at all.

Do you think we can beef up the script so that it can handle a different number of files? In one particular scenario, there were 3 files in the TEST folder and only 2 in the BASE folder and to make things more complicated, none of the file sizes matched, and a pure sort by file size would produce an out-of-sync condition immediately since the new file only had a few bytes in it.

Now, the bulk of the folders showed minimal changes as desired (thanks again). And for the couple of folders where there were no diffs, I could manually confirm why there was an extra file. (It may also be that a TEST folder could have less files than the BASE.)

It seems that the complication as was discussed in the previous question is that diff trials have to be made to identify the nearest same files. Certainly file sizes is a good way to reduce the combination of comparisons. There are often just 2 or 3 files. Most I have seen are 6 and 7 files.

PART 3 (future) ...
Another complication that I was not dealing with until now (and could be another question in Part 3 if it is doable), is to acknowledge that sometimes, I expect a chunk of differences, where a chunk might be 20-40 lines of diff. Maybe I have to give hints to the script as to whether I am expecting differences (or maybe it is just an input diff line count threshold that I can use). It may also be the case that some folders in the tree should have no changes (i.e., a threshold of maybe 6 lines from the diff output), and other folders might have some files.

Thanks again for your help. I brought up Part 3+ just to identify areas that we should not worry about for now. If we can modify the script to allow for a different number of files and identifying the file pairs having the closest match, that alone seems to be an interesting challenge.
LVL 33
Who is Participating?
Duncan RoeSoftware DeveloperCommented:
This BFI (Brute Force and Ignorance:) script outputs diff commands for the closest matching files between the 2 supplied directories. It would be a lot shorter without the annotation

# By default, compare files in Base/ with some other dir ($2 changes)
# By default, compare with TestNeq/ ($1 changes)

#set -x
for i in $A/*.txt
  # There always will be a best match for now.
  # Later, we might introduce a minimum acceptable diff length.
  # If we do that, there will not necessarily be a best match,
  # so initialise it here.
  # The arbitrary dummy start value must be bigger than
  # any conceivable diff length.
  for j in $B/*.txt
    # This command generates our base data.
    # We could try experimenting with diff and wc options:-
    # I tried wc -c (count characters) but the best match was less obvious
    # (it was still the same file though).
    # I also tried diff -w (ignore white space)
    # but it didn't seem to make any significant difference.
    k=$(diff $i $j|wc -l)
    if ((k<shortest_diff))
  # Test that a matching file was found.
  # (As per above, this will always be true for now).
  [ -z "$best_match" ] ||
    if ((shortest_diff))
      echo "diff $i $best_match # $shortest_diff lines"
      echo "# $i and $best_match are identical"

Open in new window

And this is how the output looks
13:13:10$ ./ 
diff Base/b0001.txt TestNeq/t0004.txt # 5 lines
# Base/b0002.txt and TestNeq/t0009.txt are identical
diff Base/b0003.txt TestNeq/t0010.txt # 6 lines
diff Base/b0004.txt TestNeq/t0002.txt # 8 lines
diff Base/b0005.txt TestNeq/t0005.txt # 36 lines
diff Base/b0006.txt TestNeq/t0001.txt # 8 lines
diff Base/b0007.txt TestNeq/t0008.txt # 13 lines
diff Base/b0008.txt TestNeq/t0007.txt # 8 lines
diff Base/b0009.txt TestNeq/t0003.txt # 13 lines
diff Base/b0010.txt TestNeq/t0006.txt # 4 lines

Open in new window

There are lots of ways to go from here. You might like to have the script actually do the diffs; you might like to be able to specify a maximum acceptable diff length; you might like to have an option whether or not to do diffs; and so on. I'd suggest to use the getopt shell command for option parsing, then add options as you find they would be useful.
You might like some of the earlier iterations of this script too: you can check them out from the attached rcs file (remove .txt off the end of its name an place in an RCS directory created in the folder where you want to appear)
(The file should end up being called comma v
I just noticed that EE changed comma to minus sign - D.)
Kent OlsenData Warehouse Architect / DBACommented:
Hi phoffric,

What happens when the directories are wildly out of sync?


100  File1
102  File2
103  File3
120  FIle4
130  File5


125  File1A
300  File2A
310  File3A
320  File4A
330  File5A

In such a scenario, File5A would appear to closely match File4 or File5 equally by size, but nothing else compares at all.
phoffricAuthor Commented:
>> What happens when the directories are wildly out of sync?
I guess I am not looking for  general solution (e.g., Part 3 - future).
A test scenario should have most of the files closely matched, with perhaps one or two files having a difference of say, 20-40 lines (out of 200).
In my scenarios, there are some small, medium, and large files, more like:


1208    File1
31033  File2


 105       File1    - new file does not match anything in BASE
1200     File2A   -  matches File1
 31021  File3A   - matches File2
Cloud Class® Course: Microsoft Azure 2017

Azure has a changed a lot since it was originally introduce by adding new services and features. Do you know everything you need to about Azure? This course will teach you about the Azure App Service, monitoring and application insights, DevOps, and Team Services.

Kent OlsenData Warehouse Architect / DBACommented:
It would seem that the ideal solution is recursive.


1210   File1
30003 File2


1200   File1A
1220   File2A
30000 File3A

In this case, File1 is 10 characters different from both File1A and File2A, so either could be selected.  File2 nearly matches to File3A.  The file that isn't matched to File1 is "extra".  And it might be that File1 is 1 line different from File1A, but completely different from File2A.
phoffricAuthor Commented:
Being new to bash scripting, I was hoping that I could get a script that solves this problem.
Duncan RoeSoftware DeveloperCommented:
Perhaps you need to re-jig the script so as to do something more general than work from 2 lists. I don't have time to code anything right now but my approach would be:

    Have 2 lists of files, sorted by size (as now)

    Work through files in one of the lists individually

    If there's an equal size file in 2nd list, compare against it and you are done


    locate next-smaller file and count # lines in diff

    locate next-larger file and count # lines in diff

    if the above 2 steps only find one file (e.g. no smaller file), report comparison against that file

    Otherwise report the smaller diff (or maybe both, depending ...)
phoffricAuthor Commented:
Even in this part, I am still not looking for perfection. Anything that reduces further the number of manual steps is most appreciated.

kdo's first comment really shows the most complexity except kdo was kind enough to have an equal number of files. It is getting complicated enough that the question is whether this can or should be done in bash script without pulling out hairs, and whether this question should be put in the algorithms zone.
Duncan RoeSoftware DeveloperCommented:
Some years ago, a colleague introduced me to a methodology for accomplishing complex tasks in Bash. You split everything into functions. The main() function calls subsidiary functions with meaningful names so you can see almost at a glance how the script works. The very last line of the script invokes main, so the order in which functions are defined is immaterial - any can call any other.

I've enjoyed too much red wine tonight to productively embark on coding a solution. Anyhow I'd rather wait for your test data - look forward to that.

Cheers ... Duncan.
phoffricAuthor Commented:
I agree with the function approach and had about 6+ functions in them. They ran through an INPUT tree picking out its sub-folders, and finding input files, and based on the sub-folder names, it would create a TEST folder with the same tree structure. Wasn't bad for my first bash script attempt.

@kdo, You mentioned recursion. I didn't think bash could handle recursion because I would set a variable and without passing in arguments, I found that a called function could use that variable. But maybe if I pass in variables it would work.

Thought a little more about your first post. That could happen, but it should not because, I test with every change I make. So, at most I expect, say, an 8 line change - can be deleted, added, or even modified. After being satisfied with the TEST results, I make the TEST become the new BASE so as to keep diffs minimal.

One last piece of bad news. I found one output folder which had 6 files in them all hovering between roughly 4800 to 5000 bytes with two files being the same size. The Part1 solution broke down on this one folder (but worked great on all the other folders). I'll try to debug the problem with -x and echo's. When I did it manually, I had to make guesses as to which ones matched - they were that close. Oddly, I think that one file was being diff'd twice.

For this Part 2 solution, one brute force method, for say, 4 files and 3 files (doesn't matter which tree has these numbers) is to just take a file having the smallest number of files and diff it with all the files in its mirror folder. Then select the one having the minimal line or size change. Then remove those two from the list. Now there are 3x2, and then 2x1 files, and the 1x0 (leaving this one file out). Actually, it's important for me to know how whether the number of files differ, so I will need to highlight that with lots of echo statements.

Thanks again for sound-boarding. From Part 1, and a brief look at Part 1a, I think I am starting to get a feel of how to solve these problems in bash. If this were C or C++, I would know how to do it. I suspect that this problem would be easier in another scripting language such as Perl or Python.
Duncan RoeSoftware DeveloperCommented:
4800 to 5000 bytes ... These files are tiny. Even the crudest brute-force approach is not going to tax a modern system. Not that I'm advocating such an approach - but if it works, what the heck
Duncan RoeSoftware DeveloperCommented:
Hi phoffric,

in you said you would be posting some test data. Are you still planning to do that?

Cheers ... Duncan.
phoffricAuthor Commented:
Sure, as soon as I get my head above water on my project, which should be this week unless I get another surprise.

About that piece of bad news that I posted.. I think it may have been due to a cockpit error (that I didn't realize was a cockpit error until it occurred a few times). The problem I think was that sometimes I would run the script while I was still in the test folder. One of the things my script does is to destroy the test tree; but with me being in it, apparently it is not getting destroyed. As a result, as the program generates new test files, they are added to a non-empty folder causing the Part 1 script to produce bogus results.

The script I am currently using has helped me a good deal. Since, as you say, the files are tiny, I can live with a brute force approach. (I have 12 cores, so maybe I can figure out a way to use a few of them.)

I am planning another run without a tree, just a single base and test folder. Instead of just 2-6 files in a folder, there may be 21 files in the base folder. Do you think a brute force approach with that many files may take too long?

Instead of a brute force approach, you talked about an ideal recursive solution. I am now aware that bash allows recursion. I didn't understand the recursive approach you are recommending or why it is ideal. I could open a separate question in algorithms/scripting zones if you are up for explaining your concept.
Duncan RoeSoftware DeveloperCommented:
21 files is still a pretty small number
phoffricAuthor Commented:
Hopefully, soon. I think I am almost done for this deadline.
If 21 files is not a problem, then I'll work on giving you a sample.

Is a recursive solution more ideal than a brute force solution? I would like to get a better gist of how that would work, and open a separate question if it is better. I'll then time the different approaches and post the results.
phoffricAuthor Commented:
Hi Duncan Roe,
My work project worked, and the script assistance I received in Parts 1 and 1a helped facilitate my own end-to-end testing. I appreciate that help.
Now I have temporarily stopped developing and working on documentation. So, this is low priority now, so if you are busy, just let me know - I fully understand.

-rw-rw-r-- 1 paulh paulh 1123 Aug 24 19:19 b0001.txt
-rw-rw-r-- 1 paulh paulh 1108 Aug 24 19:24 b0002.txt
-rw-rw-r-- 1 paulh paulh 1141 Aug 24 19:22 b0003.txt
-rw-rw-r-- 1 paulh paulh 3419 Aug 24 19:33 b0004.txt
-rw-rw-r-- 1 paulh paulh 3415 Aug 24 19:34 b0005.txt
-rw-rw-r-- 1 paulh paulh 3370 Aug 24 19:27 b0006.txt
-rw-rw-r-- 1 paulh paulh  776 Aug 24 19:41 b0007.txt
-rw-rw-r-- 1 paulh paulh 1120 Aug 24 19:40 b0008.txt
-rw-rw-r-- 1 paulh paulh  777 Aug 24 19:42 b0009.txt
-rw-rw-r-- 1 paulh paulh  778 Aug 24 19:43 b0010.txt

// TestEqExtraFile has an extra file but the other 10 files
// should have minor differences for matching up with the Base folder
-rw-rw-r-- 1 paulh paulh 3369 Aug 24 19:55 t0001.txt
-rw-rw-r-- 1 paulh paulh 3410 Aug 24 19:56 t0002.txt
-rw-rw-r-- 1 paulh paulh  778 Aug 24 19:56 t0003.txt
-rw-rw-r-- 1 paulh paulh 1117 Aug 24 19:57 t0004.txt
-rw-rw-r-- 1 paulh paulh 3414 Aug 24 19:57 t0005.txt
-rw-rw-r-- 1 paulh paulh  770 Aug 24 20:03 t0006.txt
-rw-rw-r-- 1 paulh paulh 1120 Aug 24 19:45 t0007.txt
-rw-rw-r-- 1 paulh paulh  774 Aug 24 19:57 t0008.txt
-rw-rw-r-- 1 paulh paulh 1105 Aug 24 19:57 t0009.txt
-rw-rw-r-- 1 paulh paulh 1130 Aug 24 19:57 t0010.txt
-rw-rw-r-- 1 paulh paulh  777 Aug 24 19:57 t0011.txt

// The following is possibly for the future if the script does not work
// TestNeq has some files that do not match the Base folder
// This is due to a functional change in the program.
-rw-rw-r-- 1 paulh paulh 3368 Aug 24 20:08 t0001.txt
-rw-rw-r-- 1 paulh paulh 3415 Aug 24 20:08 t0002.txt
-rw-rw-r-- 1 paulh paulh 1506 Aug 24 20:08 t0003.txt
-rw-rw-r-- 1 paulh paulh 1127 Aug 24 20:08 t0004.txt
-rw-rw-r-- 1 paulh paulh 3403 Aug 24 20:08 t0005.txt
-rw-rw-r-- 1 paulh paulh  777 Aug 24 20:08 t0006.txt
-rw-rw-r-- 1 paulh paulh 1114 Aug 24 20:08 t0007.txt
-rw-rw-r-- 1 paulh paulh  782 Aug 24 20:08 t0008.txt
-rw-rw-r-- 1 paulh paulh 1108 Aug 24 19:54 t0009.txt
-rw-rw-r-- 1 paulh paulh 1141 Aug 24 20:08 t0010.txt

Open in new window

phoffricAuthor Commented:
Attached zip file corresponding to previous post is here.
phoffricAuthor Commented:
I do not think that the actual words in the file should be taken into consideration when matching up files.
Duncan RoeSoftware DeveloperCommented:
Hi Paul,

This sounds like a fun project but may have to wait for a bit. Will advise you of progress,

Cheers ... Duncan
phoffricAuthor Commented:
Hi Duncan,

I actually enjoyed writing my first bash script and reaped the rewards immediately. Then I got stuck as the files started getting out of order. I added the extra folder in the zip file where some of the files are not the same just to see how well a new script works on them.

I consider files to be the same if the only differences are a couple of lines in the "garbage" section at the top of the file, but I did not try to take that into account in any scripts as I do not have a formal definition of the garbage section. So far, that hasn't been an issue in identifying good comparisons as I quickly go over the final comparison report and see that only the "garbage" header has changed. (It might actually have a changing timestamp from run to run, for example.)

The Base folder has 10 files and the Test folder has 11. Two possibilities for the result. Either the extra file is singularly identified, or the extra file is part of a set of files that did not compare well. These results can be implicit by a manual review of the diff report, or fancier, the mismatches are called out separately for manual review.

Perfection is not required and not attainable unless I provide a rigorous definition of what constitutes success. The goal is to reduce the amount of manual effort required to compare many combinations of files.
I had 21 files to compare. If the scripts reduced the number to 5, that is a huge savings.

phoffricAuthor Commented:
@Duncan Roe,

Thanks for the script. I suspect that your program yields the right results. I didn't notice that you actually worked on the "PART 3 (future) ..." scenario that deals with some mismatched files, instead of the EQ folder. As I forgot to write down the answers to EQ manually, I manually compared files to get the EQ answers. I will have to do the same now for NE scenario (but will be going to work so won't have time for awhile). Your results actually were consistent with the OP EQ question with one exception. Of course, since you didn't have the extra file to contend with in the EQ folder, then it is not surprising that my results with respect to the mismatch differ from yours.

Before trying to apply your script to the OP question: "Do you think we can beef up the script so that it can handle a different number of files?", I ran it on the NE folder. But it doesn't work. I have attached an err.txt file to show you where the problem is. I ran this on my win7 laptop using Ubuntu 12.04 in VMware.

A quick search of the following syntax did not give me desired results:

Do you have link that will explain the above syntax. I know that Base and TestNeq are folder names; but I don't know what the numbers mean and the colon and dash sign mean. Not even sure what the $-sign means as I use that to get the value of the variable.

If on the command line, I do:
Script$ A=${7:-Base}
Script$ echo $A

I get Base regardless of the number I use. From simple tests, if I do
Script$ A=Basic
$ echo $A

So, I don't see a difference between the two ways of setting A.
Duncan RoeSoftware DeveloperCommented:
From err.txt, it looks very much like you are using a shell other than bash. My fault - please change the first line to #!/bin/bash and try it again. Either it will complain it can't find bash or it should work. If it doesn't find /bin/bash, maybe you have /usr/bin/bash (I'm not familiar with Ubuntu internals).
Your experiment A=${7:-Base} worked correctly, since you had no $7 (7th positional command line argument). It's documented under Parameter Expansion when you enter man bash.
The script should work for any number of files in either directory. It compares each file in the base dir with every file in the other dir, so does every possible diff.
Default action of script is as if you had typed compareFiles.x TestNeq Base so you can compare other dirs with Base by only giving the other dir as a parameter.
Have to go now - post back anything I haven't answered
phoffricAuthor Commented:
I corrected your bash script (also in your post) to use bash instead of sh, and now the script gives the same results.

I now see what you did with setting A and B.
              Use Default Values.  If parameter is unset or null, the expansion of word is substituted.   Otherwise, the value of parameter is substituted.
phoffricAuthor Commented:
It sure seems that the BFI approach using wc -l works pretty good with the file set I gave you for both Part 2 and Part 3. Interestingly, in another test, I had missing files from the Test folders, and the script dutifully found the best matches by minimizing "wc -l". This resulted in one file being matched twice. But as I said that perfection comes in incremental steps, I think this solution will significantly reduce the number of manual steps I need to take to confirm a new test run. Thanks!

I will write this small script up at work and see how it does in my production environment sometime and drop a line to let you know how it fared. I will try to figure out how to identify missing or extra files in the Test folder (that may be another question).

Thanks again.
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.