removing strange characters from filenames

charlatan01
charlatan01 used Ask the Experts™
on
Hey Everyone,

I have a bit of a problem and i'm on a serious deadline.  My file server house about 70 GB of data in tens of thousands of files.  I discovered that there are hundreds of files that have unusual characters in their filename.  This wouldn't be a problem except that I'm migrating these files over to a new file server and it won't accept these files.  

In order to fix this problem I need a script that can replace these invalid characters with a valid one.

The script needs to establish a base set of "OK" characters (a-z A-Z - _ , . @ etc...) and then iterate through every file and substitute any illegal characters with a underscore.  So, "monthly^??report.xls" would change to "monthly___report.xls"

The script would also need to descend into subdirectories doing the same thing to them.  Ideally it would output its changes to a logfile for tracking too.

I haven't touched in PERL in 4 years and forgotten just about everything.  I'm okay at reverse engineering scripts, but creating them from scratch is beyond my current skill set.  Can someone help?

Thanks.
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®

Commented:
What about this oneliner (ignore "invisible file") :

ls -R | grep -v '.' | xargs  perl -pi -e "s/[^\w_,.@]/_/g"

Commented:
I'd recommend for a pure perl script, with more features :

File::Searcher module.

Slightly modified from the man page :

use File::Searcher;

my $search = File::Searcher->new('*');
$search->add_expression(name=>'street',
   $search->add_expression(name=>'department',
   search=>'[^\w_,.@]',
   replace=>'_',
   options=>'g');
$search->start;

@files_matched = $search->files_matched;
print "Files Matched\n";
print "\t" . join("\n\t", @files_matched) . "\n";
print "Total Files:\t" . $search->file_cnt . "\n";
print "Directories:\t" . $search->dir_cnt . "\n";
my @files_replaced = $search->expression('[^\w_,.@]')->files_replaced;

Commented:
I'd recommend for a pure perl script, with more features :

File::Searcher module.

Slightly modified from the man page :

use File::Searcher;

my $search = File::Searcher->new('*');
$search->add_expression(name=>'weird',
   search=>'[^\w_,.@]',
   replace=>'_',
   options=>'g');
$search->start;

@files_matched = $search->files_matched;
print "Files Matched\n";
print "\t" . join("\n\t", @files_matched) . "\n";
print "Total Files:\t" . $search->file_cnt . "\n";
print "Directories:\t" . $search->dir_cnt . "\n";
my @files_replaced = $search->expression('weird')->files_replaced;

PMI ACP® Project Management

Prepare for the PMI Agile Certified Practitioner (PMI-ACP)® exam, which formally recognizes your knowledge of agile principles and your skill with agile techniques.

Commented:
Oops !!!
Sorry Forget my previous posts (especially the second)

I've miss "In the Filename", my solutions are for changing the content...

Commented:
Are all the file names that need to change in the same (sub) dir or root? Are there no files that don't need changing if they do contain illegal characters?

If not, pipe a find to a perl script which will rename them:

find | ./char.pl

one charl.pl coming up, give it five minutes

Commented:
#!/usr/bin/perl

my $sGoodChars="a-zA-Z0-9\-\.\_\/"; # Add acceptable characters here

while (<STDIN>)
{
  chomp $_;

  my $sFileName=$_;

  if ($sFileName!~m/^[$sGoodChars]+$/) # Add any other acceptable characters
  {
    my $sNewName = $sFileName;

    $sNewName=~s/[^$sGoodChars]/_/g;

    print "Renaming $sFileName to $sNewName\n";

    # rename $sFileName, $sNewName;

    # Might want to run it once, give the resulting list a
    # quick look-over to make sure it's okay
  }
}
The only thing I'd add to binkzz's solution is a check to prevent renaming onto an existing file. Put something like these lines before the print...

my $disambiguator = "_000";
while( -e $sNewName ) {
  ++$disambiguator;
  $sNewName =~ s/_\d{0,3}/$disambiguator/;
 }

Commented:
Good call!

Though shouldn't it be:

#

my $disambiguator="000";
while( -e $sNewName ) {
 ++$disambiguator;
 $sNewName =~ s/_\d{0,3}/_$disambiguator/;
}

#

as ++"000" eq "001",
but ++"_000" eq "1" ?
Or something. I was recently impressed by the "magical" properties of the increment operator when given strings that were not strictly numbers and got carried away without testing. At least, we don't appear to need an 'e' on the end of the substitution -- something I had thought about but forgot.
Nice problem :) But you have forgot some things:
1. Windows clients may like to use UNICODE characters in filenames and since unix does not allow just 2 characters in filenames (the / and \0) - it's quite possible
2. Here's a lot of existing products that have filenames hardcoded into them, so if you have any software distribution package on that fileserver - better find a better OS option for the new one (users will not like that documents database suddenly started to bring file not found errors, don't forget about linked XLS files, MDB databases etc ...)

NT is the worst solution for the fileserver since it restricts a lots of frequently used characters on unix, e.g.  EMPLOYEES_SNAPSHOT_207:19:47:55.ndx can't be handled by NT, also if NT's codepage does not match one on user machine - it will cause all international characters to be converted into NT's codepage which makes filenames a total garbage (win2k has similar but not such obvious bugs).

Commented:
Keeping TrickerThe1st's arguments in mind, I believe that this comes close to what you want:

use strict;
use File::Find;
my $okPat=q{a-zA-Z0-9\-\,\_@\.};
find({ wanted => \&cleaner, follow => 1 }, '.');

sub cleaner {
  my $old=$_;
  # only plain files
  return unless(-f $_);
  return unless s/[^$okPat]/_/g;
  if(-f $_) {
    # exists already?
    my $gen;
    1 while(-f $_.++$gen);
    $_.=$gen;
  }
  print qq{rename($old,$_);\n};
}

Adjust your $okPat and run it in a test-directory. It will print all the rename() calls without doing anything.

Once you feel confident change the last line to:

rename($old,$_) || warn "trouble renaming $File::Find::name - $!";

Hope this helps
  Tobias

Commented:
And for cleanness sake you may also want to replace the last two "-f" with a "-e" :-)

Author

Commented:
In reply to binkzz's question " Are all the file names that need to change in the same (sub) dir or root? Are there no files that don't need changing if they do contain illegal characters?"

All the files are in one directory and dozens of descending subdirectories.  If *any file* has these characters it needs to be modified.  Files that have no special characters in the  filename shouldn't be touched.

In reply to trickerthe1st:  all the filenames that i am worrying about are data files--not applications, so the hard-coded names thing shouldn't be a problem.  Linked documents on the other hand might cause problems... Most of these files are the work of graphic designers... but that's a problem easily fixed most of the time.

Thanks to everybody for your help and insight.  I appreciate it!

Commented:
charl - the code I posted will be sufficient for what you need in that case. It prints out everything it renames, so you can run it and force the output in a file as such:

find | ./charl.pl > rename.log

where rename.log would keep a record of all the renamed files (and their original name)

Run it without un-commenting the rename command and have a look at the output to see if it's alright.



#!/usr/bin/perl

my $sGoodChars="a-zA-Z0-9\-\.\_\/"; # Add acceptable characters here

while (<STDIN>)
{
 chomp $_;

 my $sFileName=$_;

 if ($sFileName!~m/^[$sGoodChars]+$/) # Add any other acceptable characters
 {
   my $sNewName = $sFileName;

   $sNewName=~s/[^$sGoodChars]/_/g;

   while( -e $sNewName ) { $sNewName.="_"; }

   print "Renaming $sFileName to $sNewName\n";

   # rename $sFileName, $sNewName;

   # Might want to run it once, give the resulting list a
   # quick look-over to make sure it's okay
 }
}

Author

Commented:
Hey Binkzz,

I tried your script but I couldn't get it to work.

And I forgot to tell you one important thing (please don't throw rocks)...  I'm doing this with Active Perl on Windows (w2k server that i'm trying to get rid of).  I haven't had too much problem with interoperability in the past but this might be an issue.

When the script is invoked it will hang at the command line and appear to do nothing.  the convert.log file is created but empty, and no files are renamed.

??

Commented:
It would do from a windows prompt

Would it be a problem for you to install CygWin?

This imitates a unix environment on a windows box, which is excellent for these kind of solutions

If that's a problem, let me know and I'll post an ActiveState dos prompt version

Author

Commented:
I don't know much about cygwin, and this is a production server where i have activestate perl already installed...  

Would you mind posting a version that works with activestate?

Also, would any changes need to be made to the code snippet that was suggested i add in to avoid renaming onto an existing file?

--Chris
I think it will be a good idea to make a perl script that produces list of files to rename (at least you will easily be able to verify everything)
File contents can be something like (windows NT):

rename "c:\documents\monthly???report.xls" "c:\documents\monthly___report.xls"

and so on

Commented:
I've modified the code to work with and prevent duplicate file names. It's all working as it should, though I've deliberately not set it to rename (sub)directories. If you do want those renamed as well, let me know.

Change the first line of the script to where your perl is situated (or just #!perl with activestate I believe).

I'm running Windows 2000, and it let me pipe the output of a simple dir to this, the command I used was
"dir /b/s | perl charl.pl"
If your server doesn't support piping out, there are some easy ways around this. Let me know how you're getting on.

#!/usr/bin/perl

my $sGoodChars="a-zA-Z0-9\-\.\_\/:\\\\"; # Add acceptable characters here. The 4 backspaces (\\\\) effectively only count as one.

while (<STDIN>)
{
  chomp $_;

  my $sFileName=$_;

  $sFileName =~ s/^(.*\\)//;

  my $sDir=$1;

  if (  ($sFileName!~m/^[$sGoodChars]+$/) # Add any other acceptable characters
      &&(!-d "$sDir$sFileName") ) # Only files, no dirs
  {
    my $sNewName = $sFileName;

    $sNewName=~s/[^$sGoodChars]/_/g;

    while( -e "$sDir$sNewName" )
    {
      $sNewName .= "_";
    }

    print "Renaming $sDir$sFileName to $sDir$sNewName\n";

    # rename "$sDir$sFileName", "$sDir$sNewName";

    # Might want to run it once, give the resulting list a
    # quick look-over to make sure it's okay
  }
}

Author

Commented:
Well... it works and it doesn't.  If I modify the "good characters" to take out capital letters (for example) the script works fine.  But it won't take out the funky characters--even though it claims to.  here's a snip from the log:

Renaming F:\exp\Terry w?students to F:\exp\Terry w_students
Renaming F:\exp\woman?globe.out.d to F:\exp\woman_globe.out.d

The funky characters show up as ? in the dos window and in the result log.  When you look at the properties sheet the funky character shows up as a vertical line slightly thicker and shorter than the pipe character.  When looking at it in an explorer window you can only barely detect it in the filename.

The way it renames in the log is ***exactly*** what i want it to look like.

Any ideas?

One day my life will be easier.

Commented:
It is possible that windows won't allow you to rename the file itself (in which case you shouldn't be able to rename it from Explorer either, or even delete it). If that's not the case, and it's a fairly small file, is there any chance you could e-mail and zip me one or two of the files that don't rename so I can add them to the folders I used for testing?

My mail is binkzz@powernet.co.uk

Author

Commented:
binkzz,  I just confirmed the cause of the funky characters.  I have a win2k server running svcs for macintosh.  My mac users have not been following the rules of naming files in a pc-friendly format.  The funny thing is that i can create a file named @#$%^&?(){}[] on a mac and copy it to the PC.  Of course, you can't create a file like that on a PC.

And here's the interesting thing.... when you run the script on a file named @#$%^&?(){}[], it returned
_#$%_&_()____.  So even though symbols aren't designated as good characters, they remain behind.  

As far as mailing some of the files to you, i honestly don't know how to do it.  B/c of the special characters, you can't really do anything with them in any windows program--even zip them into an archive.  Here's what I'll do.  go to http://130.207.150.176/funkychars/ and i'll dump some of these files on my mac web server to see if you can download them.  

--chris

Commented:
Thanks, I'll download them as soon as you set up read permissions ;)
Commented:
My browser won't allow me to download them as is. You could try to pack them with "sit", though I would expect Sit for Windows just unpacks them with windows acceptable names.

[quote]
And here's the interesting thing.... when you run the script on a file named @#$%^&?(){}[], it returned
_#$%_&_()____.  So even though symbols aren't designated as good characters, they remain behind.
[/quote]

This is what confuses me - it should either rename it properly, or not at all. I don't get how it renames just a few characters instead of the whole filename. Does the entry in the log reflect the change exactly, or does the log claim it renamed it properly?

Also - Try changing:

   print "Renaming $sDir$sFileName to $sDir$sNewName\n";   rename "$sDir$sFileName", "$sDir$sNewName";

to

  my $sResult = rename($sDir$sFileName, $sDir$sNewName);
  print "Renamed $sDir$sFileName to $sDir$sNewName ($sResult)\n";

To capture any errors it might give.

Author

Commented:
Binkzz's gave me the answer I needed!  Awesome!

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial