Solved

Locating Duplicate filenames

Posted on 1998-03-10
22
335 Views
Last Modified: 2008-03-17
On our Sun Solaris server, I need a perl program to find all duplicate filenames on a file system.  I want just the first characters up the first "." (dot) in the search.  We may have some files by the same name but one of them may be compressed; (.Z) in another directory.  I also need to be able to exclude some directories from the search for which i know we would have duplicate filenames there.
0
Comment
Question by:j_k
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 10
  • 9
  • 3
22 Comments
 
LVL 2

Expert Comment

by:alexbik
ID: 1209895
Hi,

This doesn't seem a 'real' question to me, it's more like 'could anybody write a script for me'. Be more specific please..

Alex
0
 

Author Comment

by:j_k
ID: 1209896
Honestly, yes i was looking for a script to do this!  If not, can you provide some assistance in developing it myself. Thanks.

0
 
LVL 2

Expert Comment

by:alexbik
ID: 1209897
Hi,

I think you should put the names of all files in a list, like:

@files=`find <arguments>`

With this, you can specify the dir's you want to search in. @files will contain a list with all files under the specified directory. Now you can make a loop, and process all files one by one:

for $path (@files) {
    .... code ...
}

In the 'code' part, you have to get the filename from the whole path:
$file=$path; $file=~/.*\/(.+?)/;

after this, $1 will contain the name of the file. With another regexp you can strip everything after the dot:
$beforedot=$file; $beforedot=~/.*(\.)/;

If you have the filename, you can create another loop, which tests the
name you found against all files in the @files variable.

Note that I didn't actually write this script, I didn't test the regexps, so they may need
some ajustments. It should point you in the right direction however.


0
Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 

Author Comment

by:j_k
ID: 1209898
The expression /.*\/(.+?)/ does not pull out the filename from the whole path. I understand the;
. means any character
* matches zero or more times
but the rest i am not sure what it is doing, can you explain?

0
 
LVL 2

Expert Comment

by:alexbik
ID: 1209899
Hi,

I made a mistake whith the regexp indeed.. The following example works (at least on my linuxbox):

#!/usr/bin/perl
@files=`find /`;
for $path (@files) {
        chop $path;
        $file=$path ; $file=~/.*\/(.*)/;
        print "$1\n";
}

A "." in a regexp indeed means "any character", the "*" means "repeated as many times as necesary (sp?). the "\" escapes the "/", which cannot be used bare, since it is a special character. The following .* should be clear, the () are used to put that part of the string found in $1.

Alex
0
 
LVL 84

Expert Comment

by:ozo
ID: 1209900
/ isn't special, you could have said
  $file=~m".*/(.*)";

then to check duplicates,
  print "$1\n" if( $seen{$1}++ };

Or if you wanted just the first characters up the first "." (dot) in the search,
   $file=~m".*/([^\.]+)";
0
 

Author Comment

by:j_k
ID: 1209901
ozo,

print "$1\n" if( $seen{$1}++ );  gives me a warning
Identifier "main::seen" used only once: possible typo

Is this a debugging message i can turn off?

0
 
LVL 84

Expert Comment

by:ozo
ID: 1209902
if you
  use diagnostics;
it should give you a more complete explaination of how to avoid the warning;

I'd suggest initializing %seen to empty with
  %seen = ();
at the beginning of the program.

Do you need help with excluding directorys in perl, or do you just want to let `find` handle that?

you could also use pfind:
pfind / 'print "$1\n" if m"([^.]+)" && $seen{$1}++ == 1'

0
 

Author Comment

by:j_k
ID: 1209903
ozo,
That was going to be my next question!  I am finding out that there are quite a few directories that i want to exclude.  And i was attempting to use find to do that.  But now im thinking i would want find to get all files and then remove records from the list by some search patterns, and then check for duplicates with the modified list.  the searching/removeing from the list would have to be done on the whole path names list.  Is this where the perl function grep could come in handy?

0
 
LVL 84

Expert Comment

by:ozo
ID: 1209904
If you have a recent version of pfind, you might try something like
pfind / 'BEGIN{%xd=map{($_,1)}qw(/excludeme /exclude/me/too))}' '!$xd{$dir}' '/([^.]+)/ && $seen{$1}++ == 1'

or
find / \( -type d \( -name 'excludeme' -o -name 'metoo' \) -prune  \) -o -print | perl -ne 'push @{$seen{$1}},$_ if m".*/([^.]+)"; END{ for( values(%seen)){ print "@{$_}\n" if @{$_} > 1 } }'

which lists all instances of repeated names

0
 

Author Comment

by:j_k
ID: 1209905
ozo,
I decided to use find to narrow my search.  I am having trouble with the syntax.
So far, the following works

find /dir/?????/{dir1,dir2,dir3} -type f -name "*.dwg*" -print

but i also want to exclude the directories "coord" and "area" from printing.  Ive tried

find /dir/?????/{dir1,dir2,dir3} -type f -name "*.dwg*" -type d \( -name 'coord' -o -name 'area' \) -prune -print

What am i doing wrong?

0
 
LVL 84

Expert Comment

by:ozo
ID: 1209906
Sorry, I forgot this question was still open.
this seems to be getting into a Unix Programming Topic Area question,
and I don't know if all versions of find handle -prune the same way, but

find /dir/?????/{dir1,dir2,dir3} \( -type d \( -name 'coord' -o -name 'area' \) -prune \) -o -type f -name "*.dwg*" -print

seems to work for me.
 
0
 

Author Comment

by:j_k
ID: 1209907
ozo, Back to perl stuff, In the following code:

while ( <FILES>){
      push @{$seen{$1}}, $_ if m".*/([^.]+)";
      for( values(%seen)){
            print "@{$_}\n" if @{$_} > 1;
      }
}

When there is a match and something to print,  it will print the previous match again until another match is found, then it prints that match again and again until another match is found or EOF.

0
 
LVL 84

Expert Comment

by:ozo
ID: 1209908
while ( <FILES> ){
    push @{$seen{$1}}, $_ if m".*/([^.]+)";
}
for( values(%seen)){
    print "@{$_}\n" if @{$_} > 1;
}

0
 
LVL 84

Expert Comment

by:ozo
ID: 1209909
And while we're back to perl stuff,
find2perl / \( -type d \( -name 'coord' -o -name 'area' \) -prune \) -o -type f -name "*.dwg*" -print
produces
  sub wanted {
    (
        (($dev,$ino,$mode,$nlink,$uid,$gid) = lstat($_)) &&
        -d _ &&
        (
            /^coord$/
            ||
            /^area$/
        ) &&
        ($prune = 1)
    )
    ||
    ($nlink || (($dev,$ino,$mode,$nlink,$uid,$gid) = lstat($_))) &&
    -f _ &&
    /^.*\.dwg.*$/ &&
    print("$name\n");
  }

see
perldoc File::Find
0
 

Author Comment

by:j_k
ID: 1209910
Ok, I'm getting close to the end, Here's what i have now.  The last thing i'm stuck on is how to step through the array "@files" in a while loop!


@files=`find . -type f -print`;
@files = grep (!/00000|dwf|x2a/, @files);
@files = grep (/\d{5}\w{3}.*/, @files);

while ( ??????? ){
      push @{$seen{$1}}, $_ if m".*/([^.]+)";
}

for( values(%seen)){
      print "@{$_}\n" if @{$_} > 1;
}
0
 
LVL 84

Expert Comment

by:ozo
ID: 1209911
foreach( @files ){
   push @{$seen{$1}}, $_ if m".*/([^.]+)";
}
0
 

Author Comment

by:j_k
ID: 1209912
Within Perl, how would i mail the output to the user j_k?, or would it be best to redirect the output of the program (piping it through mail).

such as;
# find_dup_files.pl | mail j_k
0
 
LVL 84

Expert Comment

by:ozo
ID: 1209913
That should work.
You could also redirect the output within perl,

open(OUTPUT,"|mail j_k");
print OUTPUT "@{$_}\n" or die "couldn't output $!";



0
 

Author Comment

by:j_k
ID: 1209914
That's it!, It's all working great!, Thanks
How do i close this thread?
0
 
LVL 84

Accepted Solution

by:
ozo earned 50 total points
ID: 1209915
You can grade the answer, or, if you're not happy with the answer,
you can reject it and request another.
0
 

Author Comment

by:j_k
ID: 1209916
Very Helpful,
This was a long drawn out question, but ozo was patient and responsive.
Thanks

0

Featured Post

Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
Perl for loop for 2000 ms 7 112
grep that displays 4 lines above & 1 line below of what's found 10 88
perl to convert excel to csv 3 344
perl getopt long help 34 116
I have been pestered over the years to produce and distribute regular data extracts, and often the request have explicitly requested the data be emailed as an Excel attachement; specifically Excel, as it appears: CSV files confuse (no Red or Green h…
In the distant past (last year) I hacked together a little toy that would allow a couple of Manager types to query, preview, and extract data from a number of MongoDB instances, to their tool of choice: Excel (http://dilbert.com/strips/comic/2007-08…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…

752 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question