• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 216
  • Last Modified:

How to extra title & author info from all PDFs in a folder

Hi,
I have a zillion articles in PDF form that were downloaded from an academic database. The format of all of them is that the first page is a title page that has the title, author, and publication info. The rest various depending on the publication it came from.

I want to generate a text file with the those fields for all of the articles. For example:

Jones, Paul. "yada yada." THE WALL STREET JOURNAL, December 1, 2008.
Smith, John. "blah blah". THE ECONOMIST. June 1, 2006.
and so on.

My ultimate goal is to generate an XML file for each so that I can import this info into a reference manager (e.g., EndNote), but for now I just want to extract this info.

Any suggestions on tools or strategies for writing such a script? I'd prefer .Net, but am open to whatever works.

Thanks.
 

0
Leprechaun
Asked:
Leprechaun
  • 12
  • 10
1 Solution
 
Adam314Commented:

#!/usr/bin/perl
use strict;
use warnings;
use CAM::PDF;
 
my @Files = glob('*.pdf');
 
open(my $out, ">output.csv") or die "Output: $!\n";
foreach (@Files) {
	my $pdf = CAM::PDF->new($_);
	my $text = $self->getPageText(1);
	my @lines = split(/\n/, $text);
	chomp(@lines);
	
	#Assuming 1st line is title, 2nd line is author,...
	print $out, "$lines[0],$lines[1]\n";
}
close($out);

Open in new window

0
 
LeprechaunAuthor Commented:
This is great, thank you.

The title is preceded by an image (the logo of the db involved). Does that change anything?

For example, most of them have a format like this for the 1st page:
-----------------------------------------
[Graphic with ABC Publishing logo]

Article #1 Title
Author(s): Author #1, Author #2
Source: Journal of Whatever Studies, Vol. 1, Issue 1, 2008, pp. 201-221
Published by: ABC Publishing
Stable URL: http://www.jstor.org/stable/7987897897
Accessed: 09/11/2008 00:42
---------------------------------------------

Does the presence of the image make a difference? What about formatting like italics or boldface?

Thanks.

0
 
Adam314Commented:
The getPageText function will get all of the text from the page in the order in which it exists.  It will not return any images, nor will the presense of images or formatting affect it.  The getPageText function gets just the text (there are other functions to get other things...).


If you post one or a few of the PDFs, I can update the script to work with them.  Otherwise, the script as it is should get you pretty close.
0
Concerto's Cloud Advisory Services

Want to avoid the missteps to gaining all the benefits of the cloud? Learn more about the different assessment options from our Cloud Advisory team.

 
LeprechaunAuthor Commented:
Thank that's really helpful.

I'd gladly take you up on that, but I don't want to get into trouble for posting a file on Experts Exchange that copyrighted and which requires a subscription for access.

Can I send it directly to you somehow?
0
 
Adam314Commented:
No, sending files is against EE policy.  I'm not sure what the policy on attaching files - you could ask in community support.

Or, try running this script, and post the output from it here.


Note that this script requires the CAM::PDF module.  To install it:
    on unix as root: cpan CAM::PDF
    on windows with ActiveState:  ppm install CAM-PDF

#!/usr/bin/perl
use strict;
use warnings;
use CAM::PDF;
 
my @Files = glob('*.pdf');
 
open(my $out, ">output.csv") or die "Output: $!\n";
my $count=0;
foreach (@Files) {
       last if $count++>3;
        my $pdf = CAM::PDF->new($_);
        my $text = $self->getPageText(1);
        print "********** File $count\n$text\n";
}
close($out);

Open in new window

0
 
LeprechaunAuthor Commented:
Okay, it occured to me that I could just create a dummy file in that format. here it is.

You'll notice that it has a graphic at the beginning. They all do, with each db (which have a different arrangement of fields) having its own unique image. The ideal thing would be for me to be able to detect which image is in the beginning and then parse using the appropriate pattern.

Otherwise, I have to manually check each file and put it in a db-specific folder before running the script.

Thanks.

JSTOR-template-1.pdf
0
 
LeprechaunAuthor Commented:
I'm hoping to have a case statement that which determines the pattern to parse for based on the image used.

Also, is there a way for me to view the source of these PDFs to see if there's something else I could search for?

Not all of them have images at the beginning, so it would be nice to see to find something else to look for in those cases.

Thanks!
0
 
Adam314Commented:
Are different files going to have the text in different formats?  If so, can you post files with the other formats.
0
 
LeprechaunAuthor Commented:
Hey Adam314 sorry about the confusion with this thread. Didn't mean for it to get closed. Something came up and I had to put this little project aside for a while.

Anyway, I just tried to run your code and it's giving me a problem with the $self variable. Does it need to be declared?

Thanks.

0
 
Adam314Commented:
The $self (on line 13) is supposed to be $pdf.
0
 
LeprechaunAuthor Commented:
Uh duh. So much for my troubleshooting skills!

I'm getting the output below, and the output.csv file is empty (i.e., created but without data).

The updated code that I'm using is below.

Thanks.


===================
C:\temp\test>perl script.pl
Use of uninitialized value $text in concatenation (.) or string at script.pl lin
e 14.
********** File 1

Use of uninitialized value $text in concatenation (.) or string at script.pl lin
e 14.
********** File 2

Use of uninitialized value $text in concatenation (.) or string at script.pl lin
e 14.
********** File 3

Use of uninitialized value $text in concatenation (.) or string at script.pl lin
e 14.
********** File 4


C:\temp\test>
================================



#!/usr/bin/perl
use strict;
use warnings;
use CAM::PDF;
 
my @Files = glob('*.pdf');
 
open(my $out, ">output.csv") or die "Output: $!\n";
my $count=0;
foreach (@Files) {
       last if $count++>3;
        my $pdf = CAM::PDF->new($_);
        my $text = $pdf->getPageText(1);
        print "********** File $count\n$text\n";
}
close($out);

Open in new window

0
 
LeprechaunAuthor Commented:
As I said, there's a problem with that script even after the variable name correction.  

The filenames aren't being parsed successfully and the .csv file is empty.  Kindly see error messages above.
Thanks.




0
 
Adam314Commented:
I'll take a look at the sample PDF you attached.  If you don't get a response in a day, remind me.
0
 
LeprechaunAuthor Commented:
Great, thanks.
0
 
Adam314Commented:
Are you sure the JSTOR-template-1.pdf is representative of your actual files.  When I run the above code with it, I get this.  Do you get the same using just the JSTOR_template-1.pdf file?
Parsing PDFs: The untold story
Author(s): John Doe; Jane Doe
Source: Journal of Computing, Vol. 22, No. 1 (Jan., 1986), pp. 138-139
Published by: Acme Publishing
 
Blah ablah balh blah

Open in new window

0
 
LeprechaunAuthor Commented:
Hi Adam

Oops, you are right!  I didn't realize that the files I tested on were different from the sample. Sorry about that.

When I run the code against that sample PDF, I don't get any errors, but the .csv file I get is empty.

Thanks.

0
 
Adam314Commented:
Can you post a representative PDF file?  If not, what pages have the data you want?  How is it formatted?

Here is a script that will get the text from one of the pages, and save it to a file named output.txt.  If you post the file, I can help parse the text for what you want.
#!/usr/bin/perl
use strict;
use warnings;
use CAM::PDF;
 
#NOTE: Enter the file name of one of your PDF files
my $pdf = CAM::PDF->new('Enter_A_File_Name_Here.pdf');
 
open(my $out, ">output.txt") or die "Output: $!\n";
#NOTE: Enter the page number that contains the text you want
my $text = $pdf->getPageText(Enter_Your_Page_Number_Here);
print $out $text;
 
close($out);

Open in new window

0
 
LeprechaunAuthor Commented:
I'm sorry, maybe I was unclear. I am just trying to put the info from the first 3 lines of each file (i.e., title, autho & source) into 3 columns in the excel file so it's sortable etc.

Thanks.



0
 
Adam314Commented:
Are the first 3 lines on the first page, or some other page?  Or do you want to look through all the pages until you find 3 lines of text?
0
 
LeprechaunAuthor Commented:
Yeah, they're on the first page.

If you get it to work for the sample PDF file I can probably tweak it for the other formats.

Thanks.
0
 
Adam314Commented:
This works on the attached sample PDF.  It creates a .csv file that can be opened in Excel.
#!/usr/bin/perl
use strict;
use warnings;
use CAM::PDF;
 
#NOTE: Enter the file name of one of your PDF files
my $pdf = CAM::PDF->new('JSTOR-template-1.pdf');
 
open(my $out, ">output.csv") or die "Output: $!\n";
my $text = $pdf->getPageText(1);
print "text:\n$text***\n";
my @lines = split/\n/, $text;
shift @lines while($#lines>2 and $lines[0] !~ /\S/);  #Remove blank lines
s/"/'/g foreach (@lines);  #Replace double-quotes with single quotes
print $out "$lines[0],$lines[1],\"$lines[2]\"\n";
 
close($out);
 
 
 
 
***************************************************
The output i get is:
ýParsing PDFs: The untold storyý,Author(s): John Doe; Jane Doe,"Source: Journal of Computing, Vol. 22, No. 1 (Jan., 1986), pp. 138-139"

Open in new window

0
 
LeprechaunAuthor Commented:
Thanks Adam314 and sorry for not responding for so long. This looks great.
0

Featured Post

NFR key for Veeam Backup for Microsoft Office 365

Veeam is happy to provide a free NFR license (for 1 year, up to 10 users). This license allows for the non‑production use of Veeam Backup for Microsoft Office 365 in your home lab without any feature limitations.

  • 12
  • 10
Tackle projects and never again get stuck behind a technical roadblock.
Join Now