Link to home
Start Free TrialLog in
Avatar of new_perl_user
new_perl_user

asked on

Reading Files in perl

Hi,
 Can anyone please help me with a code snippet for the following in perl

Logic:

Recursively read folders and subfolders from a path . From these folders and subfolders grep the .DOC documents open them and search for Keywords  " custom cost" "Financial Cost".

If found then print that particular "filename" and "Keyword" into a log file.
Avatar of sshah254
sshah254

Why bother with PERL?

Just do

grep -rl "Financial Cost" > "Financial_Cost.txt"

This will give entries for all files with "Financial Cost".

Do the same for the other (either in the same file, or a different file).

Ss
Of course, you can do

grep -ril ....

If you want the search to be case insensitive.

Ss
Avatar of new_perl_user

ASKER

Need this in a script format because it will be automated in the form of cron and will run every otherday against approx 1000 documents.
and the platform is windows.
Avatar of FishMonger
Use the File::Find or File::Find::Rule module.

Here's a skeleton script and as such would need to be extended to meet all of your needs, but it's all I have time to do at this point.
#!/usr/bin/berl

use strict;
use warnings;
use File::Find;

open my $log_fh, '>', 'file.log' or die "failed to open 'file.log' $!";

find(\&wanted, '.');

sub wanted {
    -f and /.doc\z/i or return;
    
    open my $fh, '<', $_ or die "failed to open '$File::Find::name' $!";
    
    if ( grep( /(custom cost|Financial Cost)/, <$fh> ) ) {
        print $fh "$File::Find::name: $1\n";
    }
    
}

Open in new window

print $fh "$File::Find::name: $1\n";

should have been

print $log_fh "$File::Find::name: $1\n";
Tried the above code as below. It is generating an empty log file with no data.

#!/usr/local/bin/perl

use strict;
use warnings;
use File::Find;

open my $log_fh, '>', 'file.log' or die "failed to open 'file.log' $!";

find(\&wanted, 'C:\Documents and Settings\paul\Desktop\New Folder');

sub wanted {
    -f and /.doc\z/i or return;
   
    open my $fh, '<', $_ or die "failed to open '$File::Find::name' $!";
   
    if ( grep( /(custom cost)/, <$fh> ) ) {
        print $log_fh "$File::Find::name: $1\n";
    }
   
}

any more suggestions  or help please....
Change the wanted sub to this, and try it again.
sub wanted {
    -f and /.doc\z/i or return;
    
    open my $fh, '<', $_ or die "failed to open '$File::Find::name' $!";
    
    while ( my $line = <$fh> ) {
        if ( $line =~ /(custom cost|Financial Cost)/ ) {
            print $log_fh "$File::Find::name: $1\n";
        }
    }
    close $fh;
}

Open in new window

Hi,
 It is not working yet. I am attaching a test document I am trying to read by using the below code.

use strict;
use warnings;
use File::Find;

open my $log_fh, '>', 'file.log' or die "failed to open 'file.log' $!";

find(\&wanted, '/usr/local/Files');

sub wanted {
    -f and /.doc\z/i or return;
    
    open my $fh, '<', $_ or die "failed to open '$File::Find::name' $!";
    
    while ( my $line = <$fh> ) {
        if ( $line =~ /(Custom Cost|Financial Cost)/ ) {
            print $log_fh "$File::Find::name: $1\n";
        }
    }
    close $fh;
}

Open in new window

test.docx
That's not a plain text file, which is why it's not working.

Either save the files as plain text, or use one of the Win32 modules designed for parsing Windows Word documents.

The primary module most people use is Win32::OLE, but in this case Text::Extract::Word looks like it would be easier to use.  I have not used either of them.

Win32::OLE - http://search.cpan.org/~jdb/Win32-OLE-0.1709/lib/Win32/OLE.pm
Text::Extract::Word - http://search.cpan.org/~snkwatt/Text-Extract-Word-0.02/lib/Text/Extract/Word.pm

so  I need to install the above module and do something like

use strict;
use warnings;
use File::Find;

use Win32::OLE;
use Win32::OLE::Const 'Microsoft Word';
use warnings;


  my $word = new Win32::OLE 'Word.Application','' or die "Cannot start word!\n";

open my $log_fh, '>', 'file.log' or die "failed to open 'file.log' $!";

find(\&wanted, '/usr/local/Files');

sub wanted {
    -f and /.doc\z/i or return;
    
    open my $fh, '<', $_ or die "failed to open '$File::Find::name' $!";
    
    while ( my $line = <$fh> ) {
        if ( $line =~ /(Custom Cost|Financial Cost)/ ) {
            print $log_fh "$File::Find::name: $1\n";
        }
    }
    close $fh;
}

Open in new window

That's a start, but the wanted sub will need to be changed so that it utilizes the module for parsing the file.

This is the direction I'd look into, but it produces an error that needs to be fixed, which I don't have time write now to work on it.
#!/usr/bin/perl

use strict;
use warnings;
use File::Find;
use Text::Extract::Word;

open my $log_fh, '>', 'file.log' or die "failed to open 'file.log' $!";

find(\&wanted, '.');

sub wanted {
    /\.docx?\z/i or return;
    
    my $file = Text::Extract::Word->new($_);
    my $body = $file->get_body();

    # lets see if it got parsed correctly
    print "$File::Find::name:\n", $body;
    
    #while ( my $line = <$fh> ) {
    #    if ( $line =~ /(Custom Cost|Financial Cost)/i ) {
    #        print "$File::Find::name: $1\n";
    #    }
    #}
    #close $fh;
}

Open in new window

ASKER CERTIFIED SOLUTION
Avatar of FishMonger
FishMonger
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Hi,
 Tried the above suggestions, but still it generating an empty log file.
Can you also specify where the  documents should reside.
The doc files need to reside in/under the directory specified in the call to the find() sub.  In my example that would be in or a sub dir under the current working directory.

That script will work if you're doc files are in Microsoft's standard MS Word .doc format.  If you're using a newer version of Word that creates a .docx file which is a different format and compressed, then you'll need to make the proper additions to the process_docx() sub.

D:\perl\new_perl>dir
 Volume in drive D is DATA
 Volume Serial Number is B676-6CE5

 Directory of D:\perl\new_perl

05/08/2011  09:39 AM    <DIR>          .
05/08/2011  09:39 AM    <DIR>          ..
05/05/2011  02:52 PM            16,384 test.doc
05/05/2011  03:26 PM               761 test.pl
               2 File(s)         17,145 bytes
               2 Dir(s)  207,021,109,248 bytes free

D:\perl\new_perl>test.pl

D:\perl\new_perl>dir
 Volume in drive D is DATA
 Volume Serial Number is B676-6CE5

 Directory of D:\perl\new_perl

05/08/2011  09:39 AM    <DIR>          .
05/08/2011  09:39 AM    <DIR>          ..
05/08/2011  09:39 AM                53 file.log
05/05/2011  02:52 PM            16,384 test.doc
05/05/2011  03:26 PM               761 test.pl
               3 File(s)         17,198 bytes
               2 Dir(s)  207,021,109,248 bytes free

D:\perl\new_perl>type file.log
./test.doc: Custom Cost
./test.doc: Financial Cost
Hi,

Thanks a lot for all the suggestions. with you code snippet help i was able to tweak a little the script and finally it is working. Currently working one is below.

#!/usr/local/bin/perl

use strict;
use warnings;
use File::Find;
use Win32::OLE;
use Text::Extract::Word;


my $string1 = " Note";
my $string2 = " Loss";

my %dispatch_code = (
    doc  => \&process_doc,
    docx => \&process_docx,
);

open my $log_fh, '>', 'file.log' or die "failed to open 'file.log' $!";

find(\&wanted, 'C:\Perl\Scripts\Test Documents');


sub wanted {
    
    /\.(doc?)\z/ or return;
    $dispatch_code{$1}->($File::Find::name);
    
}


sub process_doc {
    
    my $filename = shift;
    my $file = Text::Extract::Word->new( $filename );
    my $body = $file->get_body();
    
    if( $body =~ m/$string1/ig ) {
         #print "yes";
        print $log_fh "$filename | $string1 \n";
    }
   
elsif ($body =~ m/$string2/ig ) {
         #print "yes";
        print $log_fh "$filename | $string2 \n";
    }
}

Open in new window