• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 1044
  • Last Modified:

complicated perl script - check / change name and archive

HI

I've been fiddling with perl for the last couple of days and am completely bit bamboozled by its intricacies - I'm finding this pretty hard as a perl newbie, and unfortunately my boss has me on a deadline.

3 folders exist:
C:\Converted PDFs, C:\Approved; C:\Archive

I'm looking for a script which does the following:
Some users generate PDFs from PPT's and drop them in a folder, C:\Converted PDFs - that bit is fine :)

1. I need to check that these files are in the right format - if, for example, they have a filename of
'Microsoft Excel - actual_filename.pdf'
'Microsoft Word - actual_filename.pdf'
'Microsoft Powerpoint - actual_filename.pdf'
where actual_filename can be random characters and underscores of no defined length, but probably not more than 25 chars.,
and '.pdf' is the suffix
for this part, strip off the 'microsoft excel -' etc.. so that I'm left with the 'actual_filename.pdf' component.

2. Check if 'actual_filename.pdf' already exists in C:\Approved.

If NO MATCH is found, ie C:\Converted PDFs\FILE1.PDF does not exist in C:\Approved, copy FILE1.PDF directly to C:\Approved.

If a filename match occurs, ie (C:\converted PDFs\FILE1.PDF = C:\Approved\FILE1.PDF)
then rename (the existing) C:\Approved\FILE1.PDF to include a time-date stamp so that it becomes
C:\Approved\File1_YYYY_MM_DD__HH-MM-SS.PDF

and move
C:\Approved\FILE1_YYYY_MM_DD__HH-MM-SS.PDF to C:\ARCHIVE\FILE1.PDF.YYYY_MM_DD__HH-MM-SS

then move the new FILE1.PDF to the C:\Approved folder.
(loop for each file found in C:\Converted PDFs)
---

that's it.
what I have so far probably isn't of much use, and is mainly gleaned from different postings and from the oreilly perl cookbook.


----
#!/usr/bin/perl -wall

use strict;
use Getopt::Std;
use File::Basename;
use File::Copy;

# static values - these can be parameterised for use elsewhere

############## GLOBAL VARIABLES ##############

my $src="C:\\Generated PDF" ;
my $dest="C:\\Approved";
my $arch="C:\\Archive" ;

#########################################

my @firstDirNew;
my @secondDirNew;
my @unmatched;
my $unmatcheditem;
my $i = 0;


    opendir(APPROVED, $dest);
    my @firstDir = readdir APPROVED;
    closedir APPROVED;
    opendir(SOURCE, $src);
    my @secondDir = readdir SOURCE;
    closedir SOURCE;

    #assign the directory lengths to a variable;
    my $length1 = @firstDir;
    my $length2 = @secondDir;

    #call the sub;
    compareDirs();

    #print the results;
    print "APPROVED Directory contains $length1 files.\n";
    print "SOURCE Directory contains $length2 files.\n";
    unless (@unmatched eq "") {
    print "Items in the APPROVED Directory not in the SOURCE Directory:\n";
    foreach $unmatcheditem (@unmatched) {
    print $unmatcheditem . "\n";
    }
    }

    sub compareDirs {
    my $filename;
    my %seen = ();
    #build a lookup table
    foreach $filename (@secondDir) { $seen{$filename} = 1 }
    #find only elements in @firstDirNew not in @secondDirNew;
    foreach $filename (@firstDir) {
    unless ($seen{$filename}) {
    #it’s not in %seen, so add to @unmatched
    push(@unmatched, $filename);
    }
    }
    }




-------------

#             case where files do exist in the destination:
#            datestamp each file
#            move existing file from $dest to archive
#            copy file from $src to $dest

#!/usr/bin/perl -w

use strict ;
use POSIX ;
use File::Copy ;
use File::Basename ;
use File::stat

chdir "$arch"
   or die "Can't chdir to archive directory [ C:\\archive ] $!\n" ;

for my $file (<*.pdf>)
  { my ($name,$path,$suffix) = fileparse($file,"\.pdf") ;
   my $info = stat($file);
   #       This time is the time when the file was last *MODIFIED*
   #       because you use "$info->mtime".
   #      When you want to have the date the file was last accessed,
   #      you have to use "$info->atime"
    my $datestamp = strftime(".%Y-%m-%d_%H:%M:%S", localtime($info->mtime));
    copy $file,"W:\\archive\\$name$datestamp$suffix"
      or warn "Cannot copy $file $!\n" ;
  }

-----------

#      Case where files do NOT exist in the destination - move the files from the array to the destination

#!/usr/bin/perl
use File::Copy;

my $movefile;
my @dirfile, @movelist;

# DO I HAVE A MOVELIST, OR IS THIS CREATED FROM SOME OTHER LOOP?
open (MOVELIST, ‘path_to_movelist’);
my $srcpath = ‘$src’;
my $movepath = ‘$dest’;

while () {
chomp $_;
push (@movelist, $_);
}
foreach $movefile(@movelist){
$movefile .= “.pdf”;
print “$srcpath$movefile”.” — “. “$movepath$movefile”.”\n”;
copy(”$srcpath$movefile”, “$movepath$movefile”);
}


ideally everything in one file is what I'm looking for, so I only need to poll one file.

that's it. any help you can give is greatly appreciated :)))
0
sr1xxon
Asked:
sr1xxon
  • 7
  • 7
  • 6
2 Solutions
 
Adam314Commented:
For step 2, when you have to add a file date/time stamp, what date/time do you want to use?  The current date/time?  The one on the files creation date? The one on the files modified date?
0
 
Adam314Commented:
Here is a script you can use... you'll have to update the date/time stamp part......



#!/usr/bin/perl
use File::Copy;

############## GLOBAL VARIABLES ##############

my $src="C:\\temp_ee\\Generated PDF\\" ;
my $dest="C:\\temp_ee\\Approved\\";
my $arch="C:\\temp_ee\\Archive\\" ;


#Get list of source files... save the actual name, and what the new name will be
opendir(DIR,$src) or die "Couldn't open dir '$src': $!\n";
while($File=readdir(DIR)){
      next if !-f "$src$File";  #skip non-files (eg: directories)
      if($File =~ /Microsoft (\w+) - ([\w_]+)\.pdf/){
            push @SrcFiles, [$File, $1];
      }
      else {
            #What to do with invalid filenames?
            print "Invalid source filename: $File\n";
      }
}
closedir(DIR);


#Process each file
foreach $File (@SrcFiles){
      if(-e "$dest$$File[1].pdf"){
            #File exists, move it with timestamp
            $Timestamp="What_to_use_for_a_timestamp";
            rename "$dest$$File[1].pdf","$arch$$File[1]_$Timestamp.pdf";
      }
      copy "$src$$File[0].pdf", "$dest$$File[1].pdf";
}
0
 
GnarOlakCommented:
Here is a very stripped down and untested piece of code that does about what you need:

#!/usr/bin/perl -wall

use strict;

############## GLOBAL VARIABLES ##############

my $src="C:\\Generated PDF" ;
my $dest="C:\\Approved";
my $arch="C:\\Archive" ;

#########################################

my ($second, $minute, $hour, $day, $month, $year) = localtime(time);

while (my $file = <$src\\*.pdf>)
{
   # strip of the directory name and check for correct name
   next if ($file !~ s/$src\\Microsoft .* - //);
   # this line does three things.
   # First is tests that $file looks like "C:\Generated PDF\Microsoft <...> - "
   # If it doesn't then it skips to the next file.
   # If it does then it strips all of that from the filename and goes on

   if (-e $dest\\$file)  # check if the file exists in the destination dir
   {
      # file exists so move it adding time stamp
      system ("move $dest\\$file $arch\\$file.$year_$month_$day__$hour-$minute-$second");
   }
   # and move the new file into the destination dir
   system ("move $src\\$file $dest");
}

Note that for simplicity I've used the "system" call to do the actual move.  You are free to swap that out for something better.  Also, I've used the <> glob operator and just loaded the pdf files rather than opendir/readdir.
0
Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
Adam314Commented:
GnarOlak -

A few things about your code:

This line:
next if ($file !~ s/$src\\Microsoft .* - //);
This would allow spaces in the filename... maybe sr1 wanted to allow this or not, don't know.  But if there are spaces, the filenames need to be escaped int the system commands.

This line:
system ("move $src\\$file $dest");
$file has been changed, from "Microsoft <...> - file_name.pdf" to "file_name.pdf", so "$src\\$file" doesn't exist - it still has the original name on the file system.

Otherwise, I think what you've done will work.
0
 
sr1xxonAuthor Commented:
Hi Adam,

your script doesn't appear to work - perhaps I'm doing something wrong..

I have 2 test files -
C:\Generated_PDF\TEST1.PDF
C:\Generated_PDF\Microsoft Word - TEST2.PDF

your script fails with
'invalid source filename: Microsoft Word - TEST2.PDF'
'invalid source filename: TEST1.PDF'
 - any ideas what I'm doing wrong??

Adam, I think (hope) this will work with your script:
     $timestamp = strftime("_%Y_%m_%d__%H-%M-%S",localtime) ;

GnarOlak,

your script isn't working either..
--
Backslash found where operator expected at gnalorlak.pl line 24, near "$dest\"
        (Missing operator before \?)
syntax error at gnalorlak.pl line 24, near "$dest\"
Global symbol "$year_" requires explicit package name at gnalorlak.pl line 27.
Global symbol "$month_" requires explicit package name at gnalorlak.pl line 27.
Global symbol "$day__" requires explicit package name at gnalorlak.pl line 27.
syntax error at gnalorlak.pl line 31, near "}"
Execution of gnalorlak.pl aborted due to compilation errors.

Thanks for your prompt responses :) - I suppose it's easy to do when you know how -


FYI, I want to use the current time in the file, not the last modified time.

in terms of the originating filename, they will be like 'Microsoft Project - ALAN_Q1_04.PDF' - there WILL be spaces in the source filename, however the approved files will always be like 'ALAN_Q1_04.PDF'

0
 
GnarOlakCommented:
As I said, "untested".  Haste makes waste as the saying goes :)

This line:

if (-e $dest\\$file) # check if the file exists in the destination dir

needs quotes:

if (-e "$dest\\$file") # check if the file exists in the destination dir
 

This line:

system ("move $dest\\$file $arch\\$file.$year_$month_$day__$hour-$minute-$second");

won't work because the _ characters are valid in variable names.  
Instead it should look something like this:

system ("move $dest\\$file $arch\\$file.$year" . '_' . $month . '_' . $day . '__' . "$hour-$minute-$second");
0
 
Adam314Commented:
with my script... make sure these end in a backslash:


my $src="C:\\Generated PDF\\" ;
my $dest="C:\\Approved\\";
my $arch="C:\\Archive\\" ;

0
 
sr1xxonAuthor Commented:
HI

I cut and pasted the scripts and ran them from the command line - I've got activestate perl v5.8.7 installed..

the errors posted previously are those output to the console.

the double backslashes are in the script as recommended - any ideas?

do I need any additional modules installed?




0
 
Adam314Commented:
With my code, you need the File::Copy module - which you must have or you'd get an error about that.

These messages:
'invalid source filename: Microsoft Word - TEST2.PDF'
'invalid source filename: TEST1.PDF'

Tell you that you have files in the "Generated PDF" folder that have invalid filenames: meaning they are not in this format:
"Microsoft <app> - The_real_file_name.pdf"

"Micorsoft Word - TEST2.PDF" - this fails because the extension is capitalized
"TEST1.PDF" - this fails because it doesn't start with "Microsoft <app> - ", and the extension is capitalized.

To ignore the case, change this line:
    if($File =~ /Microsoft (\w+) - ([\w_]+)\.pdf/){
To this (notice the additional "i"):
    if($File =~ /Microsoft (\w+) - ([\w_]+)\.pdf/i){

I'm not sure what you want to do with files that aren't in the right format, like "TEST1.PDF" for example.  Right now, it just prints a message.  What would you like to do with these?  What if the filename was "Test for adam.pdf" or "temp1.txt" (not a .pdf file) or anything else?
0
 
GnarOlakCommented:
Adam314's comments about my code are correct.

Also, I can't seem to make the file glob operator work with a directory with a space in the name.

So I re-wrote the script using opendir and this time I tested it.

Here for you scrutiny is my offering:

#!/usr/bin/perl -wall

use strict;

############## GLOBAL VARIABLES ##############

my $src="C:\\Generated PDF";
my $dest="C:\\Approved";
my $arch="C:\\Archive";

#########################################

my ($second, $minute, $hour, $day, $month, $year) = localtime(time);
$year += 1900;
$month++;

opendir(SRC, $src);
my @src = readdir SRC;
closedir SRC;

foreach my $file (@src)
{
   next if ($file !~ /\.pdf$/);
   my $orig_file = $file;
   # strip of the directory name and check for correct name
   next if ($file !~ s/Microsoft .* - //);
   # this line does three things.
   # First is tests that $file looks like "C:\Generated PDF\Microsoft <...> - "
   # If it doesn't then it skips to the next file.
   # If it does then it strips all of that from the filename and goes on

   if (-e "$dest\\$file") # check if the file exists in the destination dir
   {
      # file exists so move it adding time stamp
      system ("move $dest\\$file $arch\\$file.$year" . "_$month" . "_$day" . "__$hour-$minute-$second");
   }
   # and move the new file into the destination dir
   system ("move \"$src\\$orig_file\" \"$dest\\$file\"");
}
0
 
sr1xxonAuthor Commented:
ok.. 90% there.

seemed to be a problem with the perl installation on my local pc.. tried on another pc and it worked fine..

GnarOlak, your code is working :)  nice one.
For the case where the source file doesn't contain the 'Microsoft <program_name> - ' prefix, (eg TEST1.PDF) the files aren't moved.
I want to check (against $dest) for the case where no prefix is present and also archive if these are present. and then move to $dest - just like for the working example for the 'Microsoft - labelled files.

Adam314, this equally applies - where you have
          else{
          #What to do with invalid filenames?
          print "Invalid source filename: $File\n";}
I'd like to move non-prefixed .PDF files to the $dest instead of ignoring them.
0
 
GnarOlakCommented:
Are you saying that you want to check all pdf files and strip the MS program name if it exists as part of the file name?
0
 
Adam314Commented:
Instead of
    print "Invalid source filename"
put
    push @SrcFiles, [$File, $File];


0
 
GnarOlakCommented:
Do you want to archive non-pdf files with time stamps as well?

If so then the process becomes easier to deal with:
1.  Get list of all files in $src
2.  Strip the Microsoft stuff from the name creating destination file name.
3.  If a file with that name is in $dest then move the file from $dest to $archive adding timestamp.
4.  Move the original file from $src to $dest with the new name

This will result in all files moving through the three directories regardless of original name or extension.  Files with 'Microsoft ... - ' will have their name converted.

Is this what you are trying to accomplish?
0
 
sr1xxonAuthor Commented:
I only want to check for PDF's - but if there's a microsoft <whatever> - prefix, then strip that bit.

I'm not worried about anything that isn't a PDF - in that way, only PDF's will be approved for publication.

As there are only (read:should only be) PDF files in the approved ($dest) directory, only PDF's need to be timestamped and archived.

Presently there are about 90,000 files in the $dest directory, so hopefully searching will be fast.

I haven't tried either script running to production yet as I can't afford to get it wrong.



 
0
 
GnarOlakCommented:
Before testing any code in your production area I'd suggest copying all the files in your three directories into a set of three parallel directories just to be safe.

And this would be easy enough to test on any machine.  Just make the directories and copy a few test cases into it.

Here's a replacment snip of code for my previous example:

foreach my $file (@src)
{
   next if ($file !~ /\.pdf$/);  # skip non-pdf files
   my $orig_file = $file;  # hold onto the original name
   $file =~ s/Microsoft .* - //);   # remove the MS stuff if it exists
   if (-e "$dest\\$file") # check if the file exists in the destination dir
   {
      # archive the file if it's in $dest
      system ("move $dest\\$file $arch\\$file.$year" . "_$month" . "_$day" . "__$hour-$minute-$second");
   }
   # and move the original file from $src to the new name in $dest
   system ("move \"$src\\$orig_file\" \"$dest\\$file\"");
}
0
 
Adam314Commented:
I agree with GnarOlak that you should test in a backup - not just with this, but anytime your production area is to important to have things go wrong.  Even the best will occasionally have mistakes....
0
 
sr1xxonAuthor Commented:
I've been testing and things look ok
except if there is a space in the path, then the archive process fails.

I've tested with
$src="C:\pdfs"
$dest="C:\pdfs\approved"
$arch="C:\pdfs\archive"

works ok.

$src="C:\generated pdfs"
$dest="C:\generated pdfs\approved"
$arch="C:\generated pdfs\archive"
fails :(

this case moves files from $src to $dest, but doesn't archive. (ie move from $dest to $arch fails)
ie, if there is a space somewhere in the path, the files get moved to the destination OK, but they are not archived.
I think this is because the system call to move doesn't like spaces in the directory names.

The paths seem to be case sensitive, which is fine.
The path seems to accept an 8.3 filename eg. 'C:\Genera~1' and I thought that may have been a workaround, however this also fails for the archive step.

any ideas?

I'll copy the production directories and test against that.. so long as I can get archiving doing it's thing :)
0
 
GnarOlakCommented:
I forgot to add the quotes there:

system ("move $dest\\$file $arch\\$file.$year" . "_$month" . "_$day" . "__$hour-$minute-$second");

should be:

system ("move \"$dest\\$file\" \"$arch\\$file.$year" . "_$month" . "_$day" . "__$hour-$minute-$second\"");

It's these tiny details that always trip you up :)
0
 
sr1xxonAuthor Commented:
Sorted.

Finally have a working script, which runs for any tested scenario.
working script is as follows:

#!/usr/bin/perl -wall
use strict;
use POSIX;
use File::Copy;
use File::Basename;
########### GLOBAL VARIABLES ############
my $src="C:\\generated PDFS\\";
my $dest="C:\\generated PDFS\\approved\\";
my $arch="C:\\generated PDFS\\archive\\";
my $suffix=".pdf";
####################################

my $timestamp = strftime(".%Y-%m-%d-%H%M%S",localtime);

opendir(SRC, $src)
      or die "Can't open source directory $src $!\n";
      my @src = readdir SRC;
closedir SRC;

foreach my $file (@src)
{
      next if ($file !~ /\.pdf$/);
      my $orig_file = $file;
        ($file =~ s/Microsoft .* - //);
        my ($name) = fileparse($file,$suffix);
         if (-e "$dest$file")
         {
                # ARCHIVE PROCESS
              move ("$dest$file","$arch$name$timestamp$suffix")
                  or warn "ARCHIVE_ERROR moving approved file $file to archive : $!\n";
      }

      # APPROVE PROCESS
        move ("$src$orig_file","$dest$file")
      or warn "APPROVE_ERROR moving file $orig_file from $src to $dest : $!\n";
}

This works a treat, however if you see any improvements (badly written routines which might affect the speed in runtime etc I'd love to know your opinions - thanks for all your input, it's been a learning experience.

0

Featured Post

VIDEO: THE CONCERTO CLOUD FOR HEALTHCARE

Modern healthcare requires a modern cloud. View this brief video to understand how the Concerto Cloud for Healthcare can help your organization.

  • 7
  • 7
  • 6
Tackle projects and never again get stuck behind a technical roadblock.
Join Now