bash script modifications to compare (diff) files in different folders

I wrote a bash script to diff files in $TEST with files in $BASE.
If there are 6 files, the names will match:
*_00.txt
*_01.txt
*_02.txt
...
*_05.txt
for y in `seq 0 $NUMBER_FILES` ;
do
  diff *${y}.txt ${BASE}/*${y}.txt >> ${TEST_COMPARE}
done

Open in new window

NUMBER_FILES is 5 in the above scenario.

This worked for awhile. Then the producer program started mixing the files around for some scenarios. (I cannot change the program.) I then manually compare based on file sizes. Often I will find 3 files exactly having the same size. (There sometimes is a slight "don't care" variation by about 4-10 chars in the header.) For files not having same size, I look for closest match (of about a 4-10 byte difference). I then do the diff manually on these closest match based on file size, and have good results that way.

Can you show me a bash script that will identify the files having the closest file size match and do the diff on those files?

All the files in a folder have the same timestamp when using "ls -l"

Thanks!
LVL 33
phoffricAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Daniel McAllisterPresident, IT4SOHO, LLCCommented:
I'm pretty good at bash programming - but I'm afraid I'm confused by your question.
so, please show me a SIMPLE sample of the 2 folders and the files you want to compare... then provide a COMPLEX example of the same -- including the files (in the complex example) you think the script should find to compare.

Dan
0
Gerwin Jansen, EE MVETopic Advisor Commented:
Hello phoffric, are you running diff based on file names or file sizes?

If you would sort the files in TEST and BASE on file size, would the diff work in that case?
0
phoffricAuthor Commented:
I hope this example will help clarify the question.
TEST and BASE are set to two different folders. Suppose for one of the test runs, they both have 5 files in them. In following example, the prefix in the filenames such as "aaaaaaa" or "DDDDDDD" represent arbitrary sequence of random (hex) characters.
In the ${BASE} folder could be something like:

 size          filename
43507        aaaaaaa_00.txt
423          bbbbbbb_01.txt
429          ccccccc_02.txt
43476        ddddddd_03.txt
5308         eeeeeee_04.txt


In the ${TEST } folder could be something like:

 size          filename
429          AAAAAAA_00.txt
5301         BBBBBBB_01.txt
417          CCCCCCC_02.txt
43470        DDDDDDD_03.txt
43500        EEEEEEE_04.txt

Open in new window

For some scenarios the bash code in the OP worked fine, but here I obviously don't want to diff aaaaaaa_00.txt with AAAAAAA_00.txt. Here is what I do manually. I scan the file sizes and diff those having same sizes. Sometimes I get lucky and two files have the same file size. Then I compare sizes of the remaining files and compare those with close file sizes. Here are the following pairs I would diff for the above example:
${BASE}         ${TEST }
ccccccc_02.txt  AAAAAAA_00.txt  -- file sizes are the same (429)
aaaaaaa_00.txt  EEEEEEE_04.txt  -- 43507 close to 43500
bbbbbbb_01.txt  CCCCCCC_02.txt  -- 423   close to 417 
ddddddd_03.txt  DDDDDDD_03.txt  -- 43476 close to 43470 
eeeeeee_04.txt  BBBBBBB_01.txt  -- 5308  close to 5301

Open in new window

I think the above is about as COMPLEX as it gets. The file size differences of the properly paired files differ by 0 to 16 bytes.

I realize that it may not be possible to get perfect results if too many files have similar file sizes. That has happened, but luckily when the file sizes were the same, then the diff would give a match, and I could eliminate that file pair from the remaining list.

Thanks.
0
Ultimate Tool Kit for Technology Solution Provider

Broken down into practical pointers and step-by-step instructions, the IT Service Excellence Tool Kit delivers expert advice for technology solution providers. Get your free copy now.

Kent OlsenData Warehouse Architect / DBACommented:
Hi phoffric,

Can you do something as simple as sorting the file lists an lining them up side by side?

In the ${BASE} folder could be something like:

 size          filename
423          bbbbbbb_01.txt
429          ccccccc_02.txt
5308         eeeeeee_04.txt
43476        ddddddd_03.txt
43507        aaaaaaa_00.txt


In the ${TEST } folder could be something like:

 size          filename
429          AAAAAAA_00.txt
417          CCCCCCC_02.txt
5301         BBBBBBB_01.txt
43470        DDDDDDD_03.txt
43500        EEEEEEE_04.txt

Open in new window


Then it's just a matter of comparing the first item of each list, then the second, etc.


Kent
0
phoffricAuthor Commented:
Hello Gerwin Jansen,
I didn't see your post until after I posted my example. In my script in the OP, the diff is based strictly on filenames comparing 00 with 00; then 01 with 01; and so on. This works for a number of scenarios.

>> If you would sort the files in TEST and BASE on file size, would the diff work in that case?
It would probably be a good improvement if first those with equal file sizes were taken out of the list. So far, my manual diff'ing has always had good matches with file pairs having the same file size. The benefit of doing this is that once we get out of sync, then all the remaining file pairs are out of sync. So, removing the file pairs having greatest likelihood of having a good match will improve the odds.

As I said, I am not expecting perfection in this question. If the results get 4 out of 6 correct, I can manually diff the remaining two pairs.

I think a separate question might be to strive for perfection. It would take the form of selecting the best candidate pairs based on file sizes, doing the diff, and if the number of chars in the diff output exceeds a threshold (an cmd line arg), then try the next closest pair. Once a pair's diff result is below the threshold, take that pair out of the two folder lists and continue diff'ing.
0
Gerwin Jansen, EE MVETopic Advisor Commented:
>> might be to strive for perfection
When comparing files that have similar content, this is what I'd try and go for ;)

When you look at the files in BASE, do they contain any strings or key/value pairs that we can lookup in the TEST file set? That way, we could just loop over every file in BASE and search (by content) for the corresponding file in TEST. File sizes that differ would not be an issue then.
0
phoffricAuthor Commented:
Hi Kdo,
I didn't see your comment as I was posting my reply to Gerwin Jansen. I think sorting would be good after first taking out the list the pairs having the same file size.

BTW - I am just starting bash scripting so I am not sure how to implement the ability to:
- create the list of {size, filename}
- diff the pairs having the same file size
- remove these diff'd equal size files from the list
- sort the remaining list by size
- diff the results

Thanks all for your comments.
0
phoffricAuthor Commented:
>> When you look at the files in BASE, do they contain any strings or key/value pairs that we can lookup in the TEST file set? That way, we could just loop over every file in BASE and search (by content) for the corresponding file in TEST. File sizes that differ would not be an issue then.

It's a good question. Unfortunately, most of the files will all likely have many keywords in common, so most will match. I think we have to stick with file sizes. For perfection, I think using a diff output threshold would prove effective. When I have to do this manually, I only see a few chars differing. When I make the wrong choice I see a huge output.
0
Gerwin Jansen, EE MVETopic Advisor Commented:
Well, some basic scripting:

ls -l <folder> - get a list of files, including details like size

| - used to 'pipe' output from one command to input of the next command

awk '{print $2}' <file> - prints column #2 of text in the file

With these few basic commands, we could create the list with filenames with equal size. What remains could be sorted on size.
0
Gerwin Jansen, EE MVETopic Advisor Commented:
A basic script would get you 2 lists to compare, one with the file names that have the same size, and one with file names that have different sizes (sorted).

Here's a basic script for you to try:

# set paths
BASE=/path/to/base
TEST=/path/to/test

# get 2 file lists (size name)
ls -l $BASE | awk '{print $5 " " $9}' > base.txt
ls -l $TEST | awk '{print $5 " " $9}' > test.txt

# loop through BASE, find same file sizes in TEST
cat base.txt | while read line
do
  s1=$(echo $line | awk '{print $1}')
  if grep -q $s1 test.txt
  then
    echo $line >> same1.txt
    grep $s1 test.txt >> same2.txt
  fi;
done;

# create files with different sizes
grep -v -f same1.txt base.txt > diff1.txt
grep -v -f same2.txt test.txt > diff2.txt

# do the diffs
echo "Diffing files with same size"
paste same1.txt same2.txt | while read line
do
s1=$(echo $line | awk '{print $2}')
s2=$(echo $line | awk '{print $4}')
diff $s1 $s2
done;

echo "Diffing files with differerent size"
paste diff1.txt diff2.txt | while read line
do
s1=$(echo $line | awk '{print $2}')
s2=$(echo $line | awk '{print $4}')
diff $s1 $s2
done;

Open in new window

0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
phoffricAuthor Commented:
@Gerwin Jansen,
Thanks!
Your technique looks pretty good. I guess in bash, we use files to hold our lists instead of variables. That is good to know.

When I get into work, I will give it a try. Most of the code looks good.
One question. Per kdo's comment about sorting, doesn't the script have to sort the lines by size for the unequal size cases (and where the sizes are of different widths; e.g., 4150, 420, 4118).
0
Duncan RoeSoftware DeveloperCommented:
ls -S already sorts by size - no need to roll your own
(ls -lS to see what the sizes are)
0
phoffricAuthor Commented:
@Duncan Roe,
Just saw your comment after I did my own sort -g operation creating extra files. But I tweaked a line adding your ls -lS and then was able to remove the sort. Thanks.
0
phoffricAuthor Commented:
Thanks Gerwin Jansen for the script and your explanations. I appreciate it. This is my own way of personal testing that does full end-to-end testing. For some reason, I am not supposed to submit this into configuration management, but this is the way I feel comfortable testing my programs. (I do have to do gtest which is submitted.) That's why I said perfection wasn't required.

Thanks kdo and Duncan Roe for your suggestions and implementation for the numeric sorting on size.

This little script will save me countless hours. It's the tedium that really gets me. This automation really gives me more assurance that my changes are working as desired.

I tested this on one BASE/TEST pair that was especially troublesome, and it worked fine. Actually it's a tree structure of many pairs, but I have that stuff already working, and this script will fit in very nicely.

Thanks again!

p.s. -
>> When comparing files that have similar content, this is what I'd try and go for ;)
Probably it's my inexperience with bashing that makes me think that using the diff threshold approach would take awhile to get it right. But if you think it's a walk in the park, then I may open another question especially if the file sizes are close enough to fool this script in the tree structure.
0
Duncan RoeSoftware DeveloperCommented:
Re diff threshold - as a starting point try diff filea fileb | wc -l

(please post here if you do open another Q)
0
phoffricAuthor Commented:
>> diff filea fileb | wc -l
That certainly makes sense.

I have to get this integrated and then meet a deadline shortly. Then I'll have time to post and try the threshold approach. Will post here if I do open a new question. Thanks again!
0
Gerwin Jansen, EE MVETopic Advisor Commented:
You're welcome. Thanks for your nice closing comment :-)

Gerwin.
0
phoffricAuthor Commented:
I was using the script all day today and has prevented some brain cell loss, I am sure. I made a change to the feeding program, and I see the file generating folder sometimes produced an extra file. (I will not be surprised if it will produce less files another time.)

This adds more complexity than was even discussed above. Here is the new question.
http://www.experts-exchange.com/Programming/Languages/Scripting/Shell/Q_28490563.html

Thanks again!

p.s. - I wrote this up because it seemed like a fun thing to do, and brings us closer to perfection. Actually, the current script is working like a champ in sports, where achieving your goals 88% of the time is considered superb.
0
phoffricAuthor Commented:
I just checked one folder where no diffs came out (i.e., the diff files were empty).
The BASE and TEST folders had one file in them that differed in size.

Maybe I copied something wrong. I'll try to look into this particular case maybe by this weekend, time permitting. I may open a new question on this one if I can repeat it in a simple case.

In any case, the new question for differing number of files still is valid.
0
phoffricAuthor Commented:
http://www.experts-exchange.com/Programming/Languages/Scripting/Shell/Q_28490915.html
This new question is for the case where no files in BASE and TEST have the same size. Then same1.txt and same2.txt do not exist, and no diff occurs.
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Shell Scripting

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.