• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 284
  • Last Modified:

Reading Files in perl

Hi,
 Can anyone please help me with a code snippet for the following in perl

Logic:

Recursively read folders and subfolders from a path . From these folders and subfolders grep the .DOC documents open them and search for Keywords  " custom cost" "Financial Cost".

If found then print that particular "filename" and "Keyword" into a log file.
0
new_perl_user
Asked:
new_perl_user
  • 9
  • 7
  • 2
1 Solution
 
sshah254Commented:
Why bother with PERL?

Just do

grep -rl "Financial Cost" > "Financial_Cost.txt"

This will give entries for all files with "Financial Cost".

Do the same for the other (either in the same file, or a different file).

Ss
0
 
sshah254Commented:
Of course, you can do

grep -ril ....

If you want the search to be case insensitive.

Ss
0
 
new_perl_userAuthor Commented:
Need this in a script format because it will be automated in the form of cron and will run every otherday against approx 1000 documents.
0
Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

 
new_perl_userAuthor Commented:
and the platform is windows.
0
 
FishMongerCommented:
Use the File::Find or File::Find::Rule module.

Here's a skeleton script and as such would need to be extended to meet all of your needs, but it's all I have time to do at this point.
#!/usr/bin/berl

use strict;
use warnings;
use File::Find;

open my $log_fh, '>', 'file.log' or die "failed to open 'file.log' $!";

find(\&wanted, '.');

sub wanted {
    -f and /.doc\z/i or return;
    
    open my $fh, '<', $_ or die "failed to open '$File::Find::name' $!";
    
    if ( grep( /(custom cost|Financial Cost)/, <$fh> ) ) {
        print $fh "$File::Find::name: $1\n";
    }
    
}

Open in new window

0
 
FishMongerCommented:
print $fh "$File::Find::name: $1\n";

should have been

print $log_fh "$File::Find::name: $1\n";
0
 
new_perl_userAuthor Commented:
Tried the above code as below. It is generating an empty log file with no data.

#!/usr/local/bin/perl

use strict;
use warnings;
use File::Find;

open my $log_fh, '>', 'file.log' or die "failed to open 'file.log' $!";

find(\&wanted, 'C:\Documents and Settings\paul\Desktop\New Folder');

sub wanted {
    -f and /.doc\z/i or return;
   
    open my $fh, '<', $_ or die "failed to open '$File::Find::name' $!";
   
    if ( grep( /(custom cost)/, <$fh> ) ) {
        print $log_fh "$File::Find::name: $1\n";
    }
   
}

0
 
new_perl_userAuthor Commented:
any more suggestions  or help please....
0
 
FishMongerCommented:
Change the wanted sub to this, and try it again.
sub wanted {
    -f and /.doc\z/i or return;
    
    open my $fh, '<', $_ or die "failed to open '$File::Find::name' $!";
    
    while ( my $line = <$fh> ) {
        if ( $line =~ /(custom cost|Financial Cost)/ ) {
            print $log_fh "$File::Find::name: $1\n";
        }
    }
    close $fh;
}

Open in new window

0
 
new_perl_userAuthor Commented:
Hi,
 It is not working yet. I am attaching a test document I am trying to read by using the below code.

use strict;
use warnings;
use File::Find;

open my $log_fh, '>', 'file.log' or die "failed to open 'file.log' $!";

find(\&wanted, '/usr/local/Files');

sub wanted {
    -f and /.doc\z/i or return;
    
    open my $fh, '<', $_ or die "failed to open '$File::Find::name' $!";
    
    while ( my $line = <$fh> ) {
        if ( $line =~ /(Custom Cost|Financial Cost)/ ) {
            print $log_fh "$File::Find::name: $1\n";
        }
    }
    close $fh;
}

Open in new window

test.docx
0
 
FishMongerCommented:
That's not a plain text file, which is why it's not working.

Either save the files as plain text, or use one of the Win32 modules designed for parsing Windows Word documents.

The primary module most people use is Win32::OLE, but in this case Text::Extract::Word looks like it would be easier to use.  I have not used either of them.

Win32::OLE - http://search.cpan.org/~jdb/Win32-OLE-0.1709/lib/Win32/OLE.pm
Text::Extract::Word - http://search.cpan.org/~snkwatt/Text-Extract-Word-0.02/lib/Text/Extract/Word.pm
0
 
new_perl_userAuthor Commented:

so  I need to install the above module and do something like

use strict;
use warnings;
use File::Find;

use Win32::OLE;
use Win32::OLE::Const 'Microsoft Word';
use warnings;


  my $word = new Win32::OLE 'Word.Application','' or die "Cannot start word!\n";

open my $log_fh, '>', 'file.log' or die "failed to open 'file.log' $!";

find(\&wanted, '/usr/local/Files');

sub wanted {
    -f and /.doc\z/i or return;
    
    open my $fh, '<', $_ or die "failed to open '$File::Find::name' $!";
    
    while ( my $line = <$fh> ) {
        if ( $line =~ /(Custom Cost|Financial Cost)/ ) {
            print $log_fh "$File::Find::name: $1\n";
        }
    }
    close $fh;
}

Open in new window

0
 
FishMongerCommented:
That's a start, but the wanted sub will need to be changed so that it utilizes the module for parsing the file.

This is the direction I'd look into, but it produces an error that needs to be fixed, which I don't have time write now to work on it.
#!/usr/bin/perl

use strict;
use warnings;
use File::Find;
use Text::Extract::Word;

open my $log_fh, '>', 'file.log' or die "failed to open 'file.log' $!";

find(\&wanted, '.');

sub wanted {
    /\.docx?\z/i or return;
    
    my $file = Text::Extract::Word->new($_);
    my $body = $file->get_body();

    # lets see if it got parsed correctly
    print "$File::Find::name:\n", $body;
    
    #while ( my $line = <$fh> ) {
    #    if ( $line =~ /(Custom Cost|Financial Cost)/i ) {
    #        print "$File::Find::name: $1\n";
    #    }
    #}
    #close $fh;
}

Open in new window

0
 
FishMongerCommented:
Your opening post said that you wanted to process .doc files, but the example you attached is a .docx file, which is not the same file format.  The Text::Extract::Word module I suggested won't parse the docx format, so I used Open Office (since I don't have/use MS Office) and saved the file in standard MS .doc format.  At that point the script I posted was able to parse it.

Here's an updated version which handles the one format and includes a sub which needs to be completed to process the .docx format.

#!/usr/bin/perl

use strict;
use warnings;
use File::Find;
use Win32::OLE;
use Text::Extract::Word;

my %dispatch_code = (
    doc  => \&process_doc,
    docx => \&process_docx,
);

open my $log_fh, '>', 'file.log' or die "failed to open 'file.log' $!";

find(\&wanted, '.');


sub wanted {
    
    /\.(docx?)\z/ or return;
    $dispatch_code{$1}->($File::Find::name);
    
}


sub process_doc {
    
    my $filename = shift;
    my $file = Text::Extract::Word->new( $filename );
    my $body = $file->get_body();
    
    while ( $body =~ /(Custom Cost|Financial Cost)/ig ) {
        print $log_fh "$filename: $1\n";
    }
}


sub process_docx {
    
    my $filename = shift;
    # parse the file using Win32::OLE
}

Open in new window


You'll need to read through the Win32::OLE documentation as well as the MSDN library to find out how to access the data in the docx file format.
MSDN Library
0
 
new_perl_userAuthor Commented:
Hi,
 Tried the above suggestions, but still it generating an empty log file.
0
 
new_perl_userAuthor Commented:
Can you also specify where the  documents should reside.
0
 
FishMongerCommented:
The doc files need to reside in/under the directory specified in the call to the find() sub.  In my example that would be in or a sub dir under the current working directory.

That script will work if you're doc files are in Microsoft's standard MS Word .doc format.  If you're using a newer version of Word that creates a .docx file which is a different format and compressed, then you'll need to make the proper additions to the process_docx() sub.

D:\perl\new_perl>dir
 Volume in drive D is DATA
 Volume Serial Number is B676-6CE5

 Directory of D:\perl\new_perl

05/08/2011  09:39 AM    <DIR>          .
05/08/2011  09:39 AM    <DIR>          ..
05/05/2011  02:52 PM            16,384 test.doc
05/05/2011  03:26 PM               761 test.pl
               2 File(s)         17,145 bytes
               2 Dir(s)  207,021,109,248 bytes free

D:\perl\new_perl>test.pl

D:\perl\new_perl>dir
 Volume in drive D is DATA
 Volume Serial Number is B676-6CE5

 Directory of D:\perl\new_perl

05/08/2011  09:39 AM    <DIR>          .
05/08/2011  09:39 AM    <DIR>          ..
05/08/2011  09:39 AM                53 file.log
05/05/2011  02:52 PM            16,384 test.doc
05/05/2011  03:26 PM               761 test.pl
               3 File(s)         17,198 bytes
               2 Dir(s)  207,021,109,248 bytes free

D:\perl\new_perl>type file.log
./test.doc: Custom Cost
./test.doc: Financial Cost
0
 
new_perl_userAuthor Commented:
Hi,

Thanks a lot for all the suggestions. with you code snippet help i was able to tweak a little the script and finally it is working. Currently working one is below.

#!/usr/local/bin/perl

use strict;
use warnings;
use File::Find;
use Win32::OLE;
use Text::Extract::Word;


my $string1 = " Note";
my $string2 = " Loss";

my %dispatch_code = (
    doc  => \&process_doc,
    docx => \&process_docx,
);

open my $log_fh, '>', 'file.log' or die "failed to open 'file.log' $!";

find(\&wanted, 'C:\Perl\Scripts\Test Documents');


sub wanted {
    
    /\.(doc?)\z/ or return;
    $dispatch_code{$1}->($File::Find::name);
    
}


sub process_doc {
    
    my $filename = shift;
    my $file = Text::Extract::Word->new( $filename );
    my $body = $file->get_body();
    
    if( $body =~ m/$string1/ig ) {
         #print "yes";
        print $log_fh "$filename | $string1 \n";
    }
   
elsif ($body =~ m/$string2/ig ) {
         #print "yes";
        print $log_fh "$filename | $string2 \n";
    }
}

Open in new window

0

Featured Post

Free Tool: ZipGrep

ZipGrep is a utility that can list and search zip (.war, .ear, .jar, etc) archives for text patterns, without the need to extract the archive's contents.

One of a set of tools we're offering as a way to say thank you for being a part of the community.

  • 9
  • 7
  • 2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now