new_perl_user
asked on
Reading Files in perl
Hi,
Can anyone please help me with a code snippet for the following in perl
Logic:
Recursively read folders and subfolders from a path . From these folders and subfolders grep the .DOC documents open them and search for Keywords " custom cost" "Financial Cost".
If found then print that particular "filename" and "Keyword" into a log file.
Can anyone please help me with a code snippet for the following in perl
Logic:
Recursively read folders and subfolders from a path . From these folders and subfolders grep the .DOC documents open them and search for Keywords " custom cost" "Financial Cost".
If found then print that particular "filename" and "Keyword" into a log file.
Of course, you can do
grep -ril ....
If you want the search to be case insensitive.
Ss
grep -ril ....
If you want the search to be case insensitive.
Ss
ASKER
Need this in a script format because it will be automated in the form of cron and will run every otherday against approx 1000 documents.
ASKER
and the platform is windows.
Use the File::Find or File::Find::Rule module.
Here's a skeleton script and as such would need to be extended to meet all of your needs, but it's all I have time to do at this point.
Here's a skeleton script and as such would need to be extended to meet all of your needs, but it's all I have time to do at this point.
#!/usr/bin/berl
use strict;
use warnings;
use File::Find;
open my $log_fh, '>', 'file.log' or die "failed to open 'file.log' $!";
find(\&wanted, '.');
sub wanted {
-f and /.doc\z/i or return;
open my $fh, '<', $_ or die "failed to open '$File::Find::name' $!";
if ( grep( /(custom cost|Financial Cost)/, <$fh> ) ) {
print $fh "$File::Find::name: $1\n";
}
}
print $fh "$File::Find::name: $1\n";
should have been
print $log_fh "$File::Find::name: $1\n";
should have been
print $log_fh "$File::Find::name: $1\n";
ASKER
Tried the above code as below. It is generating an empty log file with no data.
#!/usr/local/bin/perl
use strict;
use warnings;
use File::Find;
open my $log_fh, '>', 'file.log' or die "failed to open 'file.log' $!";
find(\&wanted, 'C:\Documents and Settings\paul\Desktop\New Folder');
sub wanted {
-f and /.doc\z/i or return;
open my $fh, '<', $_ or die "failed to open '$File::Find::name' $!";
if ( grep( /(custom cost)/, <$fh> ) ) {
print $log_fh "$File::Find::name: $1\n";
}
}
#!/usr/local/bin/perl
use strict;
use warnings;
use File::Find;
open my $log_fh, '>', 'file.log' or die "failed to open 'file.log' $!";
find(\&wanted, 'C:\Documents and Settings\paul\Desktop\New Folder');
sub wanted {
-f and /.doc\z/i or return;
open my $fh, '<', $_ or die "failed to open '$File::Find::name' $!";
if ( grep( /(custom cost)/, <$fh> ) ) {
print $log_fh "$File::Find::name: $1\n";
}
}
ASKER
any more suggestions or help please....
Change the wanted sub to this, and try it again.
sub wanted {
-f and /.doc\z/i or return;
open my $fh, '<', $_ or die "failed to open '$File::Find::name' $!";
while ( my $line = <$fh> ) {
if ( $line =~ /(custom cost|Financial Cost)/ ) {
print $log_fh "$File::Find::name: $1\n";
}
}
close $fh;
}
ASKER
Hi,
It is not working yet. I am attaching a test document I am trying to read by using the below code.
It is not working yet. I am attaching a test document I am trying to read by using the below code.
use strict;
use warnings;
use File::Find;
open my $log_fh, '>', 'file.log' or die "failed to open 'file.log' $!";
find(\&wanted, '/usr/local/Files');
sub wanted {
-f and /.doc\z/i or return;
open my $fh, '<', $_ or die "failed to open '$File::Find::name' $!";
while ( my $line = <$fh> ) {
if ( $line =~ /(Custom Cost|Financial Cost)/ ) {
print $log_fh "$File::Find::name: $1\n";
}
}
close $fh;
}
test.docx
That's not a plain text file, which is why it's not working.
Either save the files as plain text, or use one of the Win32 modules designed for parsing Windows Word documents.
The primary module most people use is Win32::OLE, but in this case Text::Extract::Word looks like it would be easier to use. I have not used either of them.
Win32::OLE - http://search.cpan.org/~jdb/Win32-OLE-0.1709/lib/Win32/OLE.pm
Text::Extract::Word - http://search.cpan.org/~snkwatt/Text-Extract-Word-0.02/lib/Text/Extract/Word.pm
Either save the files as plain text, or use one of the Win32 modules designed for parsing Windows Word documents.
The primary module most people use is Win32::OLE, but in this case Text::Extract::Word looks like it would be easier to use. I have not used either of them.
Win32::OLE - http://search.cpan.org/~jdb/Win32-OLE-0.1709/lib/Win32/OLE.pm
Text::Extract::Word - http://search.cpan.org/~snkwatt/Text-Extract-Word-0.02/lib/Text/Extract/Word.pm
ASKER
so I need to install the above module and do something like
use strict;
use warnings;
use File::Find;
use Win32::OLE;
use Win32::OLE::Const 'Microsoft Word';
use warnings;
my $word = new Win32::OLE 'Word.Application','' or die "Cannot start word!\n";
open my $log_fh, '>', 'file.log' or die "failed to open 'file.log' $!";
find(\&wanted, '/usr/local/Files');
sub wanted {
-f and /.doc\z/i or return;
open my $fh, '<', $_ or die "failed to open '$File::Find::name' $!";
while ( my $line = <$fh> ) {
if ( $line =~ /(Custom Cost|Financial Cost)/ ) {
print $log_fh "$File::Find::name: $1\n";
}
}
close $fh;
}
That's a start, but the wanted sub will need to be changed so that it utilizes the module for parsing the file.
This is the direction I'd look into, but it produces an error that needs to be fixed, which I don't have time write now to work on it.
This is the direction I'd look into, but it produces an error that needs to be fixed, which I don't have time write now to work on it.
#!/usr/bin/perl
use strict;
use warnings;
use File::Find;
use Text::Extract::Word;
open my $log_fh, '>', 'file.log' or die "failed to open 'file.log' $!";
find(\&wanted, '.');
sub wanted {
/\.docx?\z/i or return;
my $file = Text::Extract::Word->new($_);
my $body = $file->get_body();
# lets see if it got parsed correctly
print "$File::Find::name:\n", $body;
#while ( my $line = <$fh> ) {
# if ( $line =~ /(Custom Cost|Financial Cost)/i ) {
# print "$File::Find::name: $1\n";
# }
#}
#close $fh;
}
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Hi,
Tried the above suggestions, but still it generating an empty log file.
Tried the above suggestions, but still it generating an empty log file.
ASKER
Can you also specify where the documents should reside.
The doc files need to reside in/under the directory specified in the call to the find() sub. In my example that would be in or a sub dir under the current working directory.
That script will work if you're doc files are in Microsoft's standard MS Word .doc format. If you're using a newer version of Word that creates a .docx file which is a different format and compressed, then you'll need to make the proper additions to the process_docx() sub.
That script will work if you're doc files are in Microsoft's standard MS Word .doc format. If you're using a newer version of Word that creates a .docx file which is a different format and compressed, then you'll need to make the proper additions to the process_docx() sub.
D:\perl\new_perl>dir
Volume in drive D is DATA
Volume Serial Number is B676-6CE5
Directory of D:\perl\new_perl
05/08/2011 09:39 AM <DIR> .
05/08/2011 09:39 AM <DIR> ..
05/05/2011 02:52 PM 16,384 test.doc
05/05/2011 03:26 PM 761 test.pl
2 File(s) 17,145 bytes
2 Dir(s) 207,021,109,248 bytes free
D:\perl\new_perl>test.pl
D:\perl\new_perl>dir
Volume in drive D is DATA
Volume Serial Number is B676-6CE5
Directory of D:\perl\new_perl
05/08/2011 09:39 AM <DIR> .
05/08/2011 09:39 AM <DIR> ..
05/08/2011 09:39 AM 53 file.log
05/05/2011 02:52 PM 16,384 test.doc
05/05/2011 03:26 PM 761 test.pl
3 File(s) 17,198 bytes
2 Dir(s) 207,021,109,248 bytes free
D:\perl\new_perl>type file.log
./test.doc: Custom Cost
./test.doc: Financial Cost
ASKER
Hi,
Thanks a lot for all the suggestions. with you code snippet help i was able to tweak a little the script and finally it is working. Currently working one is below.
Thanks a lot for all the suggestions. with you code snippet help i was able to tweak a little the script and finally it is working. Currently working one is below.
#!/usr/local/bin/perl
use strict;
use warnings;
use File::Find;
use Win32::OLE;
use Text::Extract::Word;
my $string1 = " Note";
my $string2 = " Loss";
my %dispatch_code = (
doc => \&process_doc,
docx => \&process_docx,
);
open my $log_fh, '>', 'file.log' or die "failed to open 'file.log' $!";
find(\&wanted, 'C:\Perl\Scripts\Test Documents');
sub wanted {
/\.(doc?)\z/ or return;
$dispatch_code{$1}->($File::Find::name);
}
sub process_doc {
my $filename = shift;
my $file = Text::Extract::Word->new( $filename );
my $body = $file->get_body();
if( $body =~ m/$string1/ig ) {
#print "yes";
print $log_fh "$filename | $string1 \n";
}
elsif ($body =~ m/$string2/ig ) {
#print "yes";
print $log_fh "$filename | $string2 \n";
}
}
Just do
grep -rl "Financial Cost" > "Financial_Cost.txt"
This will give entries for all files with "Financial Cost".
Do the same for the other (either in the same file, or a different file).
Ss