Solved

Locating Duplicate filenames

Posted on 1998-03-10
22
325 Views
Last Modified: 2008-03-17
On our Sun Solaris server, I need a perl program to find all duplicate filenames on a file system.  I want just the first characters up the first "." (dot) in the search.  We may have some files by the same name but one of them may be compressed; (.Z) in another directory.  I also need to be able to exclude some directories from the search for which i know we would have duplicate filenames there.
0
Comment
Question by:j_k
  • 10
  • 9
  • 3
22 Comments
 
LVL 2

Expert Comment

by:alexbik
ID: 1209895
Hi,

This doesn't seem a 'real' question to me, it's more like 'could anybody write a script for me'. Be more specific please..

Alex
0
 

Author Comment

by:j_k
ID: 1209896
Honestly, yes i was looking for a script to do this!  If not, can you provide some assistance in developing it myself. Thanks.

0
 
LVL 2

Expert Comment

by:alexbik
ID: 1209897
Hi,

I think you should put the names of all files in a list, like:

@files=`find <arguments>`

With this, you can specify the dir's you want to search in. @files will contain a list with all files under the specified directory. Now you can make a loop, and process all files one by one:

for $path (@files) {
    .... code ...
}

In the 'code' part, you have to get the filename from the whole path:
$file=$path; $file=~/.*\/(.+?)/;

after this, $1 will contain the name of the file. With another regexp you can strip everything after the dot:
$beforedot=$file; $beforedot=~/.*(\.)/;

If you have the filename, you can create another loop, which tests the
name you found against all files in the @files variable.

Note that I didn't actually write this script, I didn't test the regexps, so they may need
some ajustments. It should point you in the right direction however.


0
 

Author Comment

by:j_k
ID: 1209898
The expression /.*\/(.+?)/ does not pull out the filename from the whole path. I understand the;
. means any character
* matches zero or more times
but the rest i am not sure what it is doing, can you explain?

0
 
LVL 2

Expert Comment

by:alexbik
ID: 1209899
Hi,

I made a mistake whith the regexp indeed.. The following example works (at least on my linuxbox):

#!/usr/bin/perl
@files=`find /`;
for $path (@files) {
        chop $path;
        $file=$path ; $file=~/.*\/(.*)/;
        print "$1\n";
}

A "." in a regexp indeed means "any character", the "*" means "repeated as many times as necesary (sp?). the "\" escapes the "/", which cannot be used bare, since it is a special character. The following .* should be clear, the () are used to put that part of the string found in $1.

Alex
0
 
LVL 84

Expert Comment

by:ozo
ID: 1209900
/ isn't special, you could have said
  $file=~m".*/(.*)";

then to check duplicates,
  print "$1\n" if( $seen{$1}++ };

Or if you wanted just the first characters up the first "." (dot) in the search,
   $file=~m".*/([^\.]+)";
0
 

Author Comment

by:j_k
ID: 1209901
ozo,

print "$1\n" if( $seen{$1}++ );  gives me a warning
Identifier "main::seen" used only once: possible typo

Is this a debugging message i can turn off?

0
 
LVL 84

Expert Comment

by:ozo
ID: 1209902
if you
  use diagnostics;
it should give you a more complete explaination of how to avoid the warning;

I'd suggest initializing %seen to empty with
  %seen = ();
at the beginning of the program.

Do you need help with excluding directorys in perl, or do you just want to let `find` handle that?

you could also use pfind:
pfind / 'print "$1\n" if m"([^.]+)" && $seen{$1}++ == 1'

0
 

Author Comment

by:j_k
ID: 1209903
ozo,
That was going to be my next question!  I am finding out that there are quite a few directories that i want to exclude.  And i was attempting to use find to do that.  But now im thinking i would want find to get all files and then remove records from the list by some search patterns, and then check for duplicates with the modified list.  the searching/removeing from the list would have to be done on the whole path names list.  Is this where the perl function grep could come in handy?

0
 
LVL 84

Expert Comment

by:ozo
ID: 1209904
If you have a recent version of pfind, you might try something like
pfind / 'BEGIN{%xd=map{($_,1)}qw(/excludeme /exclude/me/too))}' '!$xd{$dir}' '/([^.]+)/ && $seen{$1}++ == 1'

or
find / \( -type d \( -name 'excludeme' -o -name 'metoo' \) -prune  \) -o -print | perl -ne 'push @{$seen{$1}},$_ if m".*/([^.]+)"; END{ for( values(%seen)){ print "@{$_}\n" if @{$_} > 1 } }'

which lists all instances of repeated names

0
 

Author Comment

by:j_k
ID: 1209905
ozo,
I decided to use find to narrow my search.  I am having trouble with the syntax.
So far, the following works

find /dir/?????/{dir1,dir2,dir3} -type f -name "*.dwg*" -print

but i also want to exclude the directories "coord" and "area" from printing.  Ive tried

find /dir/?????/{dir1,dir2,dir3} -type f -name "*.dwg*" -type d \( -name 'coord' -o -name 'area' \) -prune -print

What am i doing wrong?

0
How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

 
LVL 84

Expert Comment

by:ozo
ID: 1209906
Sorry, I forgot this question was still open.
this seems to be getting into a Unix Programming Topic Area question,
and I don't know if all versions of find handle -prune the same way, but

find /dir/?????/{dir1,dir2,dir3} \( -type d \( -name 'coord' -o -name 'area' \) -prune \) -o -type f -name "*.dwg*" -print

seems to work for me.
 
0
 

Author Comment

by:j_k
ID: 1209907
ozo, Back to perl stuff, In the following code:

while ( <FILES>){
      push @{$seen{$1}}, $_ if m".*/([^.]+)";
      for( values(%seen)){
            print "@{$_}\n" if @{$_} > 1;
      }
}

When there is a match and something to print,  it will print the previous match again until another match is found, then it prints that match again and again until another match is found or EOF.

0
 
LVL 84

Expert Comment

by:ozo
ID: 1209908
while ( <FILES> ){
    push @{$seen{$1}}, $_ if m".*/([^.]+)";
}
for( values(%seen)){
    print "@{$_}\n" if @{$_} > 1;
}

0
 
LVL 84

Expert Comment

by:ozo
ID: 1209909
And while we're back to perl stuff,
find2perl / \( -type d \( -name 'coord' -o -name 'area' \) -prune \) -o -type f -name "*.dwg*" -print
produces
  sub wanted {
    (
        (($dev,$ino,$mode,$nlink,$uid,$gid) = lstat($_)) &&
        -d _ &&
        (
            /^coord$/
            ||
            /^area$/
        ) &&
        ($prune = 1)
    )
    ||
    ($nlink || (($dev,$ino,$mode,$nlink,$uid,$gid) = lstat($_))) &&
    -f _ &&
    /^.*\.dwg.*$/ &&
    print("$name\n");
  }

see
perldoc File::Find
0
 

Author Comment

by:j_k
ID: 1209910
Ok, I'm getting close to the end, Here's what i have now.  The last thing i'm stuck on is how to step through the array "@files" in a while loop!


@files=`find . -type f -print`;
@files = grep (!/00000|dwf|x2a/, @files);
@files = grep (/\d{5}\w{3}.*/, @files);

while ( ??????? ){
      push @{$seen{$1}}, $_ if m".*/([^.]+)";
}

for( values(%seen)){
      print "@{$_}\n" if @{$_} > 1;
}
0
 
LVL 84

Expert Comment

by:ozo
ID: 1209911
foreach( @files ){
   push @{$seen{$1}}, $_ if m".*/([^.]+)";
}
0
 

Author Comment

by:j_k
ID: 1209912
Within Perl, how would i mail the output to the user j_k?, or would it be best to redirect the output of the program (piping it through mail).

such as;
# find_dup_files.pl | mail j_k
0
 
LVL 84

Expert Comment

by:ozo
ID: 1209913
That should work.
You could also redirect the output within perl,

open(OUTPUT,"|mail j_k");
print OUTPUT "@{$_}\n" or die "couldn't output $!";



0
 

Author Comment

by:j_k
ID: 1209914
That's it!, It's all working great!, Thanks
How do i close this thread?
0
 
LVL 84

Accepted Solution

by:
ozo earned 50 total points
ID: 1209915
You can grade the answer, or, if you're not happy with the answer,
you can reject it and request another.
0
 

Author Comment

by:j_k
ID: 1209916
Very Helpful,
This was a long drawn out question, but ozo was patient and responsive.
Thanks

0

Featured Post

Enabling OSINT in Activity Based Intelligence

Activity based intelligence (ABI) requires access to all available sources of data. Recorded Future allows analysts to observe structured data on the open, deep, and dark web.

Join & Write a Comment

In the distant past (last year) I hacked together a little toy that would allow a couple of Manager types to query, preview, and extract data from a number of MongoDB instances, to their tool of choice: Excel (http://dilbert.com/strips/comic/2007-08…
Checking the Alert Log in AWS RDS Oracle can be a pain through their user interface.  I made a script to download the Alert Log, look for errors, and email me the trace files.  In this article I'll describe what I did and share my script.
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
In this seventh video of the Xpdf series, we discuss and demonstrate the PDFfonts utility, which lists all the fonts used in a PDF file. It does this via a command line interface, making it suitable for use in programs, scripts, batch files — any pl…

706 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

18 Experts available now in Live!

Get 1:1 Help Now