?
Solved

bash script modifications to compare (diff) files in different folders

Posted on 2014-08-01
20
Medium Priority
?
607 Views
Last Modified: 2014-08-05
I wrote a bash script to diff files in $TEST with files in $BASE.
If there are 6 files, the names will match:
*_00.txt
*_01.txt
*_02.txt
...
*_05.txt
for y in `seq 0 $NUMBER_FILES` ;
do
  diff *${y}.txt ${BASE}/*${y}.txt >> ${TEST_COMPARE}
done

Open in new window

NUMBER_FILES is 5 in the above scenario.

This worked for awhile. Then the producer program started mixing the files around for some scenarios. (I cannot change the program.) I then manually compare based on file sizes. Often I will find 3 files exactly having the same size. (There sometimes is a slight "don't care" variation by about 4-10 chars in the header.) For files not having same size, I look for closest match (of about a 4-10 byte difference). I then do the diff manually on these closest match based on file size, and have good results that way.

Can you show me a bash script that will identify the files having the closest file size match and do the diff on those files?

All the files in a folder have the same timestamp when using "ls -l"

Thanks!
0
Comment
Question by:phoffric
  • 11
  • 5
  • 2
  • +2
20 Comments
 
LVL 21

Expert Comment

by:Daniel McAllister
ID: 40236364
I'm pretty good at bash programming - but I'm afraid I'm confused by your question.
so, please show me a SIMPLE sample of the 2 folders and the files you want to compare... then provide a COMPLEX example of the same -- including the files (in the complex example) you think the script should find to compare.

Dan
0
 
LVL 38

Expert Comment

by:Gerwin Jansen, EE MVE
ID: 40236458
Hello phoffric, are you running diff based on file names or file sizes?

If you would sort the files in TEST and BASE on file size, would the diff work in that case?
0
 
LVL 32

Author Comment

by:phoffric
ID: 40236468
I hope this example will help clarify the question.
TEST and BASE are set to two different folders. Suppose for one of the test runs, they both have 5 files in them. In following example, the prefix in the filenames such as "aaaaaaa" or "DDDDDDD" represent arbitrary sequence of random (hex) characters.
In the ${BASE} folder could be something like:

 size          filename
43507        aaaaaaa_00.txt
423          bbbbbbb_01.txt
429          ccccccc_02.txt
43476        ddddddd_03.txt
5308         eeeeeee_04.txt


In the ${TEST } folder could be something like:

 size          filename
429          AAAAAAA_00.txt
5301         BBBBBBB_01.txt
417          CCCCCCC_02.txt
43470        DDDDDDD_03.txt
43500        EEEEEEE_04.txt

Open in new window

For some scenarios the bash code in the OP worked fine, but here I obviously don't want to diff aaaaaaa_00.txt with AAAAAAA_00.txt. Here is what I do manually. I scan the file sizes and diff those having same sizes. Sometimes I get lucky and two files have the same file size. Then I compare sizes of the remaining files and compare those with close file sizes. Here are the following pairs I would diff for the above example:
${BASE}         ${TEST }
ccccccc_02.txt  AAAAAAA_00.txt  -- file sizes are the same (429)
aaaaaaa_00.txt  EEEEEEE_04.txt  -- 43507 close to 43500
bbbbbbb_01.txt  CCCCCCC_02.txt  -- 423   close to 417 
ddddddd_03.txt  DDDDDDD_03.txt  -- 43476 close to 43470 
eeeeeee_04.txt  BBBBBBB_01.txt  -- 5308  close to 5301

Open in new window

I think the above is about as COMPLEX as it gets. The file size differences of the properly paired files differ by 0 to 16 bytes.

I realize that it may not be possible to get perfect results if too many files have similar file sizes. That has happened, but luckily when the file sizes were the same, then the diff would give a match, and I could eliminate that file pair from the remaining list.

Thanks.
0
VIDEO: THE CONCERTO CLOUD FOR HEALTHCARE

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

 
LVL 46

Assisted Solution

by:Kent Olsen
Kent Olsen earned 300 total points
ID: 40236476
Hi phoffric,

Can you do something as simple as sorting the file lists an lining them up side by side?

In the ${BASE} folder could be something like:

 size          filename
423          bbbbbbb_01.txt
429          ccccccc_02.txt
5308         eeeeeee_04.txt
43476        ddddddd_03.txt
43507        aaaaaaa_00.txt


In the ${TEST } folder could be something like:

 size          filename
429          AAAAAAA_00.txt
417          CCCCCCC_02.txt
5301         BBBBBBB_01.txt
43470        DDDDDDD_03.txt
43500        EEEEEEE_04.txt

Open in new window


Then it's just a matter of comparing the first item of each list, then the second, etc.


Kent
0
 
LVL 32

Author Comment

by:phoffric
ID: 40236479
Hello Gerwin Jansen,
I didn't see your post until after I posted my example. In my script in the OP, the diff is based strictly on filenames comparing 00 with 00; then 01 with 01; and so on. This works for a number of scenarios.

>> If you would sort the files in TEST and BASE on file size, would the diff work in that case?
It would probably be a good improvement if first those with equal file sizes were taken out of the list. So far, my manual diff'ing has always had good matches with file pairs having the same file size. The benefit of doing this is that once we get out of sync, then all the remaining file pairs are out of sync. So, removing the file pairs having greatest likelihood of having a good match will improve the odds.

As I said, I am not expecting perfection in this question. If the results get 4 out of 6 correct, I can manually diff the remaining two pairs.

I think a separate question might be to strive for perfection. It would take the form of selecting the best candidate pairs based on file sizes, doing the diff, and if the number of chars in the diff output exceeds a threshold (an cmd line arg), then try the next closest pair. Once a pair's diff result is below the threshold, take that pair out of the two folder lists and continue diff'ing.
0
 
LVL 38

Expert Comment

by:Gerwin Jansen, EE MVE
ID: 40236486
>> might be to strive for perfection
When comparing files that have similar content, this is what I'd try and go for ;)

When you look at the files in BASE, do they contain any strings or key/value pairs that we can lookup in the TEST file set? That way, we could just loop over every file in BASE and search (by content) for the corresponding file in TEST. File sizes that differ would not be an issue then.
0
 
LVL 32

Author Comment

by:phoffric
ID: 40236492
Hi Kdo,
I didn't see your comment as I was posting my reply to Gerwin Jansen. I think sorting would be good after first taking out the list the pairs having the same file size.

BTW - I am just starting bash scripting so I am not sure how to implement the ability to:
- create the list of {size, filename}
- diff the pairs having the same file size
- remove these diff'd equal size files from the list
- sort the remaining list by size
- diff the results

Thanks all for your comments.
0
 
LVL 32

Author Comment

by:phoffric
ID: 40236494
>> When you look at the files in BASE, do they contain any strings or key/value pairs that we can lookup in the TEST file set? That way, we could just loop over every file in BASE and search (by content) for the corresponding file in TEST. File sizes that differ would not be an issue then.

It's a good question. Unfortunately, most of the files will all likely have many keywords in common, so most will match. I think we have to stick with file sizes. For perfection, I think using a diff output threshold would prove effective. When I have to do this manually, I only see a few chars differing. When I make the wrong choice I see a huge output.
0
 
LVL 38

Expert Comment

by:Gerwin Jansen, EE MVE
ID: 40236532
Well, some basic scripting:

ls -l <folder> - get a list of files, including details like size

| - used to 'pipe' output from one command to input of the next command

awk '{print $2}' <file> - prints column #2 of text in the file

With these few basic commands, we could create the list with filenames with equal size. What remains could be sorted on size.
0
 
LVL 38

Accepted Solution

by:
Gerwin Jansen, EE MVE earned 1400 total points
ID: 40236601
A basic script would get you 2 lists to compare, one with the file names that have the same size, and one with file names that have different sizes (sorted).

Here's a basic script for you to try:

# set paths
BASE=/path/to/base
TEST=/path/to/test

# get 2 file lists (size name)
ls -l $BASE | awk '{print $5 " " $9}' > base.txt
ls -l $TEST | awk '{print $5 " " $9}' > test.txt

# loop through BASE, find same file sizes in TEST
cat base.txt | while read line
do
  s1=$(echo $line | awk '{print $1}')
  if grep -q $s1 test.txt
  then
    echo $line >> same1.txt
    grep $s1 test.txt >> same2.txt
  fi;
done;

# create files with different sizes
grep -v -f same1.txt base.txt > diff1.txt
grep -v -f same2.txt test.txt > diff2.txt

# do the diffs
echo "Diffing files with same size"
paste same1.txt same2.txt | while read line
do
s1=$(echo $line | awk '{print $2}')
s2=$(echo $line | awk '{print $4}')
diff $s1 $s2
done;

echo "Diffing files with differerent size"
paste diff1.txt diff2.txt | while read line
do
s1=$(echo $line | awk '{print $2}')
s2=$(echo $line | awk '{print $4}')
diff $s1 $s2
done;

Open in new window

0
 
LVL 32

Author Comment

by:phoffric
ID: 40236812
@Gerwin Jansen,
Thanks!
Your technique looks pretty good. I guess in bash, we use files to hold our lists instead of variables. That is good to know.

When I get into work, I will give it a try. Most of the code looks good.
One question. Per kdo's comment about sorting, doesn't the script have to sort the lines by size for the unequal size cases (and where the sizes are of different widths; e.g., 4150, 420, 4118).
0
 
LVL 35

Assisted Solution

by:Duncan Roe
Duncan Roe earned 300 total points
ID: 40236923
ls -S already sorts by size - no need to roll your own
(ls -lS to see what the sizes are)
0
 
LVL 32

Author Comment

by:phoffric
ID: 40236940
@Duncan Roe,
Just saw your comment after I did my own sort -g operation creating extra files. But I tweaked a line adding your ls -lS and then was able to remove the sort. Thanks.
0
 
LVL 32

Author Closing Comment

by:phoffric
ID: 40236954
Thanks Gerwin Jansen for the script and your explanations. I appreciate it. This is my own way of personal testing that does full end-to-end testing. For some reason, I am not supposed to submit this into configuration management, but this is the way I feel comfortable testing my programs. (I do have to do gtest which is submitted.) That's why I said perfection wasn't required.

Thanks kdo and Duncan Roe for your suggestions and implementation for the numeric sorting on size.

This little script will save me countless hours. It's the tedium that really gets me. This automation really gives me more assurance that my changes are working as desired.

I tested this on one BASE/TEST pair that was especially troublesome, and it worked fine. Actually it's a tree structure of many pairs, but I have that stuff already working, and this script will fit in very nicely.

Thanks again!

p.s. -
>> When comparing files that have similar content, this is what I'd try and go for ;)
Probably it's my inexperience with bashing that makes me think that using the diff threshold approach would take awhile to get it right. But if you think it's a walk in the park, then I may open another question especially if the file sizes are close enough to fool this script in the tree structure.
0
 
LVL 35

Expert Comment

by:Duncan Roe
ID: 40236960
Re diff threshold - as a starting point try diff filea fileb | wc -l

(please post here if you do open another Q)
0
 
LVL 32

Author Comment

by:phoffric
ID: 40236969
>> diff filea fileb | wc -l
That certainly makes sense.

I have to get this integrated and then meet a deadline shortly. Then I'll have time to post and try the threshold approach. Will post here if I do open a new question. Thanks again!
0
 
LVL 38

Expert Comment

by:Gerwin Jansen, EE MVE
ID: 40237188
You're welcome. Thanks for your nice closing comment :-)

Gerwin.
0
 
LVL 32

Author Comment

by:phoffric
ID: 40240231
I was using the script all day today and has prevented some brain cell loss, I am sure. I made a change to the feeding program, and I see the file generating folder sometimes produced an extra file. (I will not be surprised if it will produce less files another time.)

This adds more complexity than was even discussed above. Here is the new question.
http://www.experts-exchange.com/Programming/Languages/Scripting/Shell/Q_28490563.html

Thanks again!

p.s. - I wrote this up because it seemed like a fun thing to do, and brings us closer to perfection. Actually, the current script is working like a champ in sports, where achieving your goals 88% of the time is considered superb.
0
 
LVL 32

Author Comment

by:phoffric
ID: 40240427
I just checked one folder where no diffs came out (i.e., the diff files were empty).
The BASE and TEST folders had one file in them that differed in size.

Maybe I copied something wrong. I'll try to look into this particular case maybe by this weekend, time permitting. I may open a new question on this one if I can repeat it in a simple case.

In any case, the new question for differing number of files still is valid.
0
 
LVL 32

Author Comment

by:phoffric
ID: 40241334
http://www.experts-exchange.com/Programming/Languages/Scripting/Shell/Q_28490915.html
This new question is for the case where no files in BASE and TEST have the same size. Then same1.txt and same2.txt do not exist, and no diff occurs.
0

Featured Post

[Webinar] Cloud and Mobile-First Strategy

Maybe you’ve fully adopted the cloud since the beginning. Or maybe you started with on-prem resources but are pursuing a “cloud and mobile first” strategy. Getting to that end state has its challenges. Discover how to build out a 100% cloud and mobile IT strategy in this webinar.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Google Drive is extremely cheap offsite storage, and it's even possible to get extra storage for free for two years.  You can use the free account 15GB, and if you have an Android device..when you install Google Drive for the first time it will give…
The Windows functions GetTickCount and timeGetTime retrieve the number of milliseconds since the system was started. However, the value is stored in a DWORD, which means that it wraps around to zero every 49.7 days. This article shows how to solve t…
Learn how to match and substitute tagged data using PHP regular expressions. Demonstrated on Windows 7, but also applies to other operating systems. Demonstrated technique applies to PHP (all versions) and Firefox, but very similar techniques will w…
The viewer will learn how to look for a specific file type in a local or remote server directory using PHP.
Suggested Courses
Course of the Month17 days, 5 hours left to enroll

864 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question