Extracting references from PDF files and comparing with a specified list

Hi all,
I have a series of PDF files (95% ellectronically created, not scanned) with scientific articles and for a non-commercial bibliographic database we need:
1. to extract all references (i.e. bibliography, works cited) from each one, with no fixed format, although most of them use the author's surname-comma-name format (sometimes with several authors), then year (sometimes between parenthesis), then title;
2. to compare those extracted data with a list of previously selected articles and authors, so I can see whether the PDF articles cite any (one or more) of the titles in the list;
3. to present those citations in an easily readable way, such as an Excel table with a column for the citing document and more columns for the cited ones.
Doing all these steps manually would take years, so I was wondering if any expert here could help us.
I have a related question that I will post separately.
Regards,
Francisco
Francisco PérezAsked:
Who is Participating?
 
aikimarkConnect With a Mentor Commented:
I would start with this question thread:
https://www.experts-exchange.com/questions/28254706/Need-help-with-UDF-user-defined-function.html

It is a very similar problem (parsing bibliographic citings).  The results go into an Excel workbook
0
 
Joe Winograd, Fellow&MVEDeveloperCommented:
This expert suggested creating a Gigs project.
Hi Francisco,
First, I see that you joined Experts Exchange today, so let me say, "Welcome aboard!"

My gut feel is that this will require a fairly complex solution, well beyond the typical Q&A. My suggestion is to establish a budget for the project (what is a solution worth to you?) and then take advantage of EE's Gigs feature to post your project:
https://www.experts-exchange.com/gigs/

That said, some other experts might disagree with me and jump in with a solution here in the free Q&A. Regards, Joe
0
 
aikimarkCommented:
With what programming languages are you fluent, Francisco?
0
Improve Your Query Performance Tuning

In this FREE six-day email course, you'll learn from Janis Griffin, Database Performance Evangelist. She'll teach 12 steps that you can use to optimize your queries as much as possible and see measurable results in your work. Get started today!

 
Francisco PérezAuthor Commented:
Thanks for your replies. I'm fluent in a few languages (I'm a linguist), but unfortunately I'm not fluent in any programming languages. I do know a bit about regular expressions, tagged files (HTML, XML) and how to use some formulas in Excel, but not much more, to be honest.
This is for my PhD and I don't have a budget.
0
 
Joe Winograd, Fellow&MVEDeveloperCommented:
Francisco,
Do you want to learn a programming language and develop the solution yourself or do you want someone to build it for you? Regards, Joe
0
 
aikimarkCommented:
There are some free PDF utilities that allow you to export the contents of the text layer of a PDF.  Some of these utilities provide a command line interface that facilitates using wildcards (*.PDF  for example).

Even if you know regular expressions, you are still faced with the task of processing the 'extracted text' files.  You might be able to use Notepad++ for this.  Notepad++ is a free text editor with a lot of neat features.  It might also be possible to use grep if you are in a Linux/Unix environment.
0
 
Joe Winograd, Fellow&MVEDeveloperCommented:
> There are some free PDF utilities that allow you to export the contents of the text layer of a PDF.

Following up on that comment from aikimark, these two EE 5-minute video Micro Tutorials show how to do that:
Xpdf - Command Line Utility for PDF Files
Xpdf - Convert PDF Files to Plain Text Files

Also, for the 5% of the files that are scanned (not electronic), you may need to OCR them if the scanning created image-only PDFs. If so, this other EE 5-minute video Micro Tutorial shows how to do that:
How to OCR pages in a PDF with free software

Both of those software packages are perfect for someone without a budget — they're free. :)

Regards, Joe
1
 
Francisco PérezAuthor Commented:
Thanks for the info.
I know Notepad++, but I prefer to use EmEditor, which seems to work better, I don't know why.
I don't have any problem with converting to txt, doc, html or even with using OCR (I can do all that with Acrobat), but what I need is to identify the references section (usually by the end of the file, though not always) and extract it somehow. Sometimes it's titled "References", sometimes "Works cited", "Bibliography", or even in other languages.
0
 
Joe Winograd, Fellow&MVEDeveloperCommented:
> what I need is to identify the references section (usually by the end of the file, though not always) and extract it somehow. Sometimes it's titled "References", sometimes "Works cited", "Bibliography", or even in other languages.

Sounds like a job for some AI code. :)  In all seriousness, this is tricky stuff. Even after finding the Bibliography/References section (difficult enough by itself), you then have to extract all the components — author's/authors' name(s), article title, date — which are not in a fixed format (even more difficult), and then compare the extracted components with a list of articles to determine if the extracted components match the components on the list (more difficult still). More AI code. :)

Maybe this should be your Ph.D. thesis. :)  Regards, Joe
1
 
Francisco PérezAuthor Commented:
I thought this would be easier for you guys, really.
Finally I've managed to follow these steps:
1) save PDFs as Excel using Acrobat (which works better with line breaks than any other approach I've tried, including PDF-extract);
2) save Excel as XML (Acrobat's conversion into XML adds too many line breaks);
3) isolate author-year-title (which is what I actually need) using regular expressions in EmEditor:
- replace (.*)\. ([1-2][0-9][0-9][0-9])\. (.*?)\. (.*) with \1\.¬\(\2\)\.¬\3\.
which obviously doesn't work when there are errors or different formats (no year, "(eds.)", more than one space, etc.); it could be finetuned, of course;
4) separate columns in Excel with ¬ as delimiter;
Now I'll do the same with my list of publications and then I will use some Excel formula (VLookup, I guess) to identify matching titles and years between both lists. Risk: spelling errors in titles, different punctuation, etc. will mean no matching, even if it's the same title.
0
 
aikimarkCommented:
Doing all these steps manually would take years
@Francisco
From your initial problem statement, it appeared that you needed to process hundreds or thousands of PDF files.  The fact that you seem to have solved a significant portion of the work indicates that there might be some disconnect between the problem description and the actual problem (size).

In any event. Congratulations on making such progress.

Without example data, it is difficult for experts to help you.
0
 
Francisco PérezAuthor Commented:
Thanks, but the truth is I do have hundreds of documents, and more still to come. I just solved a portion of the problem, but if I still need to open each document, save it, save it again, etc., it will take ages. That's why I was coming here for some help.
I'm sorry I forgot to provide examples. Here you have one: http://www.jows.pl/sites/default/files/Gajek.pdf. The format is a bit different and therefore it doesn't work with my approach. The firs references look like this (copied and pasted from the PDF):
Bibliografia
- Brookes, S. (2003) Video Production in the Foreign Language Classroom:
Some Practical Ideas. W: The Internet TESL Journal, t. 9, nr 10 [online]
[dostęp 02.04.2008].
- Canning-Wilson, C. (2000) Practical Aspects of Using Video in the Foreign
Language Classroom. W: The Internet TESL Journal, t. 6, nr 11 [online]
[dostęp 02.04.2008].
- Díaz Cintas, J., Remael, A. (2007) Audiovisual Translation: Subtitling.
Manchester: St. Jerome Publishing.
- Dondis, D. (1972) A Primer of Visual Literacy. Cambridge: MIT Press.
0
 
aikimarkConnect With a Mentor Commented:
You can reduce the regex parsing workload by first parsing or substringing the bibliographic part of the document text.
Here is a regular expression to do that:
Bibliografia(?:(?:\S|\s)+)$

You can also use VBA functions InStr() and Mid()
Example:
strBiblioText = Mid(strFileData, InStr(strFileData, "Bibliografia"))

Open in new window

0
 
Joe Winograd, Fellow&MVEDeveloperCommented:
> I thought this would be easier for you guys, really.

Then you don't understand the complexity of the problem.

> save PDFs as Excel using Acrobat (which works better with line breaks than any other approach I've tried, including PDF-extract)

I found a random paper on the net:
A Wearable Face Recognition System on Google Glass for Assisting Social Interactions

Attached are the Excel file created by Adobe Acrobat XI Pro and the text files created by the Xpdf utility called PDFtoText — one run with the -layout parameter and one with the -raw param. Opinions will vary with respect to "which works better".

> which obviously doesn't work when there are errors or different formats (no year, "(eds.)", more than one space, etc.)

Obviously. Part of the complexity of a real solution.

> it could be finetuned, of course

Of course.

> Risk: spelling errors in titles, different punctuation, etc. will mean no matching, even if it's the same title.

Among many other "risks".

> I just solved a portion of the problem

Yes, a very small portion.

> The format is a bit different and therefore it doesn't work with my approach.

As will be the case with many papers, which is why a complete solution won't be easy. Regards, Joe
prosopagnosia-acrobat.xlsx
prosopagnosia-pdftotext-layout.txt
prosopagnosia-pdftotext-raw.txt
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.