Solved

How to compare two directories and remove the difference in files

Posted on 2004-04-20
13
827 Views
Last Modified: 2010-04-20
What I want to do is this:

I have 2 directories 'image' & 'thumbnail' and I want to compare the 2 directories *recursively* so that I can find the files that are still in 'thumbnail' but not in 'image'. Please note that I just want to compare the filenames (ie doesn't matter if their sizes are different). Then I want to delete the files that are still in 'thumbnail' but not in 'image'.

At the moment I have experimented with diff command such that:

'diff -r image thumbnail'

this yields:

Binary files image/009.jpg and thumbnail/009.jpg differ
Only in thumbnail: 010.jpg
Binary files image/sunset.jpg and thumbnail/sunset.jpg differ

I only want to 'capture' the second line of the output (i.e. Only in thumbnail)  and remove (rm -rf) those file(s).

Thanx for your help in advance
0
Comment
Question by:bjai
  • 4
  • 4
  • 4
  • +1
13 Comments
 
LVL 45

Expert Comment

by:sunnycoder
ID: 10875484
> Then I want to delete the files that are still in 'thumbnail' but not in 'image'.

find <thumbnail dir> type -f | while read name
do
           newname=`echo $name | sed 's/<thumbnail dir>/<image dir>/'`
           if [ ! -f "$newname" ]
           then
                      echo $name
          fi
done

this will print the names of all such files ... If you wish to remove them, replace the echo $name with rm -f $name
0
 

Expert Comment

by:robipolli
ID: 10877866
A Quicker one:

find image -type f|sed 's/^image/thumbnail/'|xargs rm  -i

rm -i ask for confirm.

Peace, R.
0
 

Assisted Solution

by:robipolli
robipolli earned 200 total points
ID: 10878031
sorry! I misunderstood the question:
This should work

thumbnails# diff -r . ../images  |awk -F\: '/Only in \./ {print $2}'| xargs rm -i

 the sunnycoder one is ok, but you should check
<thumbnail dir> being sure it's the first word . This to avoid problems with filenames like
thumbnail/thumbnail.jpg which becomes image/image.jpg

Pax, R.
0
 
LVL 1

Author Comment

by:bjai
ID: 10879935
the one liner concept is what I am looking for but robipolli please explain the awk part, the line doesn't seem to work... It seems that I need to use the awk to extract the path names of the result of diff, please explain how the awk options and arguments work.
0
 
LVL 1

Author Comment

by:bjai
ID: 10880049
robipolli, i also tried using your command with 'xargs -print0' at the end just to check what files are being extracted but the command just echoes '/bin/echo ?...'
0
 
LVL 3

Assisted Solution

by:tolgadalkilic
tolgadalkilic earned 100 total points
ID: 10884070
Let me explain :)
-F\: means ":" character is the delimiter,
 '/Only in \./ {print $2}' means it outputs the second part (oarst are seperated with delimiter ":" ) of the line that contains the string "Only in" and pipes it to remove command. The idea is basically this i guess. You can look at the manuel at:
http://www.tldp.org/LDP/abs/html/awk.html
for  awk. I suggest "awk" and "sed" together for searching and editing strings in files.
0
Maximize Your Threat Intelligence Reporting

Reporting is one of the most important and least talked about aspects of a world-class threat intelligence program. Here’s how to do it right.

 
LVL 1

Author Comment

by:bjai
ID: 10885284
I have realized another problem with the diff result output, I wonder if there is an option in diff to change the output format eg delimiter => fullpath of the file or a tab (instead of ': ')  coz if a sub-directory name of the thumbnail or image contains a ':' the regular expression will extract the wrong stuff. I would like to see if fullpath format can be outputted from diff

Thanx
0
 
LVL 45

Accepted Solution

by:
sunnycoder earned 200 total points
ID: 10885803
robipolli,

>This to avoid problems with filenames like
>thumbnail/thumbnail.jpg which becomes image/image.jpg
not unless you specify g in the sed command :o)


bjai,

>, I wonder if there is an option in diff to change the output format eg delimiter => fullpath of the
>file or a tab (instead of ': '
run it through sed

diff <> <> | sed 's/:/=>/'
0
 

Expert Comment

by:robipolli
ID: 10886096
Hi all!
I'm in Italy so I've just read your msgs!
4 sunnycoder:
 yes! You're right! sorry!

4 bjai:
  I tested my script on linux and it works for directories having the same tree, maybe it won't fit your problem but should be syntax-correct
  you can do little about changing output format of diff, and dat little  depends on your operating system. [ man diff ]
  it would be easier  if  you explain the pattern you want to match and we'll try to write back a regexp

Peace, R.
0
 
LVL 1

Author Comment

by:bjai
ID: 10886147
I think the main problem with using diff is that diff is meant to be used for comparing file (contents). The safest way is to use the find command on <thumbnail dir> and loop thru each file to see if the same filepath occurs in <image dir>. (similar to solution of sunny code). However I wonder what will be the load on the system when there are lots of files to compare.

A final comment on the loading of the system with such method is appreciated (tho i have given out the points). :)

I think I may also look at a different direction by creating thumbnail (and image) index files that contain all the thumbnail/image file paths. Then by comparing the 2 files (now it's time to use the diff command) for different pathnames can find out which thumbnails are redundant.
0
 

Expert Comment

by:robipolli
ID: 10886318
Well,
the load of the system during diff depends on diff options and of the size of 'almost equal' files. find should not affect too much workload if you have a fast HD.
another  approach, but the sunnycoder one seems to perform better.

1) copy  all redundant files in another directory (tar  is needed for preservig directory structure
/home/test#  mkdir new_thumb
/home/test/images# find . -type f|xargs tar cf - -C ../thumb |tar -C ../newthumb -tvf-




Peace, R
0
 
LVL 45

Expert Comment

by:sunnycoder
ID: 10886493
from
http://www.experts-exchange.com/help.jsp#hi73

C: Because Experts' reliability are often judged by their grading records, many Experts would like the opportunity to clarify if you have questions about their solutions. If you have given the Expert(s) ample time to respond to your clarification posts and you have responded to each of their posts providing requested information; or if the answers, after clarification, lack finality or do not completely address the issue presented, then a "C" grade is an option. You also have the option here of just asking Community Support to delete the question.

Remember, the Expert helping you today is probably going to be helping you next time you post a question. Give them a fair chance to earn an 'Excellent!' grade and they'll provide you with some amazing support. It's also true that a "C" is the lowest grade you can give, and the Experts know that -- so use it judiciously.
0
 
LVL 45

Expert Comment

by:sunnycoder
ID: 10886521
>However I wonder what will be the load on the system when there are lots of files to compare.
When there are lots of files to compare, the system load will be high irrespective of the method you use ... Efficient methods may keep the duration of high load short but it cannot be avoided all together (unless you decide to do it in very small pieces)

0

Featured Post

Better Security Awareness With Threat Intelligence

See how one of the leading financial services organizations uses Recorded Future as part of a holistic threat intelligence program to promote security awareness and proactively and efficiently identify threats.

Join & Write a Comment

Currently, there is not an RPM package available under the RHEL/Fedora/CentOS distributions that gives you a quick and easy way to allow PHP to interface with Oracle. As a result, I have included a set of instructions on how to do this with minimal …
rdate is a Linux command and the network time protocol for immediate date and time setup from another machine. The clocks are synchronized by entering rdate with the -s switch (command without switch just checks the time but does not set anything). …
Learn how to find files with the shell using the find and locate commands. Use locate to find a needle in a haystack.: With locate, check if the file still exists.: Use find to get the actual location of the file.:
Connecting to an Amazon Linux EC2 Instance from Windows Using PuTTY.

758 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

16 Experts available now in Live!

Get 1:1 Help Now