Still celebrating National IT Professionals Day with 3 months of free Premium Membership. Use Code ITDAY17

x
?
Solved

Locating Duplicate filenames

Posted on 1998-03-10
22
Medium Priority
?
348 Views
Last Modified: 2008-03-17
On our Sun Solaris server, I need a perl program to find all duplicate filenames on a file system.  I want just the first characters up the first "." (dot) in the search.  We may have some files by the same name but one of them may be compressed; (.Z) in another directory.  I also need to be able to exclude some directories from the search for which i know we would have duplicate filenames there.
0
Comment
Question by:j_k
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 10
  • 9
  • 3
22 Comments
 
LVL 2

Expert Comment

by:alexbik
ID: 1209895
Hi,

This doesn't seem a 'real' question to me, it's more like 'could anybody write a script for me'. Be more specific please..

Alex
0
 

Author Comment

by:j_k
ID: 1209896
Honestly, yes i was looking for a script to do this!  If not, can you provide some assistance in developing it myself. Thanks.

0
 
LVL 2

Expert Comment

by:alexbik
ID: 1209897
Hi,

I think you should put the names of all files in a list, like:

@files=`find <arguments>`

With this, you can specify the dir's you want to search in. @files will contain a list with all files under the specified directory. Now you can make a loop, and process all files one by one:

for $path (@files) {
    .... code ...
}

In the 'code' part, you have to get the filename from the whole path:
$file=$path; $file=~/.*\/(.+?)/;

after this, $1 will contain the name of the file. With another regexp you can strip everything after the dot:
$beforedot=$file; $beforedot=~/.*(\.)/;

If you have the filename, you can create another loop, which tests the
name you found against all files in the @files variable.

Note that I didn't actually write this script, I didn't test the regexps, so they may need
some ajustments. It should point you in the right direction however.


0
VIDEO: THE CONCERTO CLOUD FOR HEALTHCARE

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

 

Author Comment

by:j_k
ID: 1209898
The expression /.*\/(.+?)/ does not pull out the filename from the whole path. I understand the;
. means any character
* matches zero or more times
but the rest i am not sure what it is doing, can you explain?

0
 
LVL 2

Expert Comment

by:alexbik
ID: 1209899
Hi,

I made a mistake whith the regexp indeed.. The following example works (at least on my linuxbox):

#!/usr/bin/perl
@files=`find /`;
for $path (@files) {
        chop $path;
        $file=$path ; $file=~/.*\/(.*)/;
        print "$1\n";
}

A "." in a regexp indeed means "any character", the "*" means "repeated as many times as necesary (sp?). the "\" escapes the "/", which cannot be used bare, since it is a special character. The following .* should be clear, the () are used to put that part of the string found in $1.

Alex
0
 
LVL 84

Expert Comment

by:ozo
ID: 1209900
/ isn't special, you could have said
  $file=~m".*/(.*)";

then to check duplicates,
  print "$1\n" if( $seen{$1}++ };

Or if you wanted just the first characters up the first "." (dot) in the search,
   $file=~m".*/([^\.]+)";
0
 

Author Comment

by:j_k
ID: 1209901
ozo,

print "$1\n" if( $seen{$1}++ );  gives me a warning
Identifier "main::seen" used only once: possible typo

Is this a debugging message i can turn off?

0
 
LVL 84

Expert Comment

by:ozo
ID: 1209902
if you
  use diagnostics;
it should give you a more complete explaination of how to avoid the warning;

I'd suggest initializing %seen to empty with
  %seen = ();
at the beginning of the program.

Do you need help with excluding directorys in perl, or do you just want to let `find` handle that?

you could also use pfind:
pfind / 'print "$1\n" if m"([^.]+)" && $seen{$1}++ == 1'

0
 

Author Comment

by:j_k
ID: 1209903
ozo,
That was going to be my next question!  I am finding out that there are quite a few directories that i want to exclude.  And i was attempting to use find to do that.  But now im thinking i would want find to get all files and then remove records from the list by some search patterns, and then check for duplicates with the modified list.  the searching/removeing from the list would have to be done on the whole path names list.  Is this where the perl function grep could come in handy?

0
 
LVL 84

Expert Comment

by:ozo
ID: 1209904
If you have a recent version of pfind, you might try something like
pfind / 'BEGIN{%xd=map{($_,1)}qw(/excludeme /exclude/me/too))}' '!$xd{$dir}' '/([^.]+)/ && $seen{$1}++ == 1'

or
find / \( -type d \( -name 'excludeme' -o -name 'metoo' \) -prune  \) -o -print | perl -ne 'push @{$seen{$1}},$_ if m".*/([^.]+)"; END{ for( values(%seen)){ print "@{$_}\n" if @{$_} > 1 } }'

which lists all instances of repeated names

0
 

Author Comment

by:j_k
ID: 1209905
ozo,
I decided to use find to narrow my search.  I am having trouble with the syntax.
So far, the following works

find /dir/?????/{dir1,dir2,dir3} -type f -name "*.dwg*" -print

but i also want to exclude the directories "coord" and "area" from printing.  Ive tried

find /dir/?????/{dir1,dir2,dir3} -type f -name "*.dwg*" -type d \( -name 'coord' -o -name 'area' \) -prune -print

What am i doing wrong?

0
 
LVL 84

Expert Comment

by:ozo
ID: 1209906
Sorry, I forgot this question was still open.
this seems to be getting into a Unix Programming Topic Area question,
and I don't know if all versions of find handle -prune the same way, but

find /dir/?????/{dir1,dir2,dir3} \( -type d \( -name 'coord' -o -name 'area' \) -prune \) -o -type f -name "*.dwg*" -print

seems to work for me.
 
0
 

Author Comment

by:j_k
ID: 1209907
ozo, Back to perl stuff, In the following code:

while ( <FILES>){
      push @{$seen{$1}}, $_ if m".*/([^.]+)";
      for( values(%seen)){
            print "@{$_}\n" if @{$_} > 1;
      }
}

When there is a match and something to print,  it will print the previous match again until another match is found, then it prints that match again and again until another match is found or EOF.

0
 
LVL 84

Expert Comment

by:ozo
ID: 1209908
while ( <FILES> ){
    push @{$seen{$1}}, $_ if m".*/([^.]+)";
}
for( values(%seen)){
    print "@{$_}\n" if @{$_} > 1;
}

0
 
LVL 84

Expert Comment

by:ozo
ID: 1209909
And while we're back to perl stuff,
find2perl / \( -type d \( -name 'coord' -o -name 'area' \) -prune \) -o -type f -name "*.dwg*" -print
produces
  sub wanted {
    (
        (($dev,$ino,$mode,$nlink,$uid,$gid) = lstat($_)) &&
        -d _ &&
        (
            /^coord$/
            ||
            /^area$/
        ) &&
        ($prune = 1)
    )
    ||
    ($nlink || (($dev,$ino,$mode,$nlink,$uid,$gid) = lstat($_))) &&
    -f _ &&
    /^.*\.dwg.*$/ &&
    print("$name\n");
  }

see
perldoc File::Find
0
 

Author Comment

by:j_k
ID: 1209910
Ok, I'm getting close to the end, Here's what i have now.  The last thing i'm stuck on is how to step through the array "@files" in a while loop!


@files=`find . -type f -print`;
@files = grep (!/00000|dwf|x2a/, @files);
@files = grep (/\d{5}\w{3}.*/, @files);

while ( ??????? ){
      push @{$seen{$1}}, $_ if m".*/([^.]+)";
}

for( values(%seen)){
      print "@{$_}\n" if @{$_} > 1;
}
0
 
LVL 84

Expert Comment

by:ozo
ID: 1209911
foreach( @files ){
   push @{$seen{$1}}, $_ if m".*/([^.]+)";
}
0
 

Author Comment

by:j_k
ID: 1209912
Within Perl, how would i mail the output to the user j_k?, or would it be best to redirect the output of the program (piping it through mail).

such as;
# find_dup_files.pl | mail j_k
0
 
LVL 84

Expert Comment

by:ozo
ID: 1209913
That should work.
You could also redirect the output within perl,

open(OUTPUT,"|mail j_k");
print OUTPUT "@{$_}\n" or die "couldn't output $!";



0
 

Author Comment

by:j_k
ID: 1209914
That's it!, It's all working great!, Thanks
How do i close this thread?
0
 
LVL 84

Accepted Solution

by:
ozo earned 200 total points
ID: 1209915
You can grade the answer, or, if you're not happy with the answer,
you can reject it and request another.
0
 

Author Comment

by:j_k
ID: 1209916
Very Helpful,
This was a long drawn out question, but ozo was patient and responsive.
Thanks

0

Featured Post

Free Tool: SSL Checker

Scans your site and returns information about your SSL implementation and certificate. Helpful for debugging and validating your SSL configuration.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

On Microsoft Windows, if  when you click or type the name of a .pl file, you get an error "is not recognized as an internal or external command, operable program or batch file", then this means you do not have the .pl file extension associated with …
There are many situations when we need to display the data in sorted order. For example: Student details by name or by rank or by total marks etc. If you are working on data driven based projects then you will use sorting techniques very frequently.…
Explain concepts important to validation of email addresses with regular expressions. Applies to most languages/tools that uses regular expressions. Consider email address RFCs: Look at HTML5 form input element (with type=email) regex pattern: T…
Six Sigma Control Plans

705 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question