Avatar of Member_2_8044579
Member_2_8044579
 asked on

Extracting references from PDF files and comparing with a specified list

Hi all,
I have a series of PDF files (95% ellectronically created, not scanned) with scientific articles and for a non-commercial bibliographic database we need:
1. to extract all references (i.e. bibliography, works cited) from each one, with no fixed format, although most of them use the author's surname-comma-name format (sometimes with several authors), then year (sometimes between parenthesis), then title;
2. to compare those extracted data with a list of previously selected articles and authors, so I can see whether the PDF articles cite any (one or more) of the titles in the list;
3. to present those citations in an easily readable way, such as an Excel table with a column for the citing document and more columns for the cited ones.
Doing all these steps manually would take years, so I was wondering if any expert here could help us.
I have a related question that I will post separately.
Regards,
Francisco
DatabasesPDFMicrosoft Excel* extract

Avatar of undefined
Last Comment
Joe Winograd

8/22/2022 - Mon
Joe Winograd

Hi Francisco,
First, I see that you joined Experts Exchange today, so let me say, "Welcome aboard!"

My gut feel is that this will require a fairly complex solution, well beyond the typical Q&A. My suggestion is to establish a budget for the project (what is a solution worth to you?) and then take advantage of EE's Gigs feature to post your project:
https://www.experts-exchange.com/gigs/

That said, some other experts might disagree with me and jump in with a solution here in the free Q&A. Regards, Joe
aikimark

With what programming languages are you fluent, Francisco?
Member_2_8044579

ASKER
Thanks for your replies. I'm fluent in a few languages (I'm a linguist), but unfortunately I'm not fluent in any programming languages. I do know a bit about regular expressions, tagged files (HTML, XML) and how to use some formulas in Excel, but not much more, to be honest.
This is for my PhD and I don't have a budget.
Your help has saved me hundreds of hours of internet surfing.
fblack61
Joe Winograd

Francisco,
Do you want to learn a programming language and develop the solution yourself or do you want someone to build it for you? Regards, Joe
aikimark

There are some free PDF utilities that allow you to export the contents of the text layer of a PDF.  Some of these utilities provide a command line interface that facilitates using wildcards (*.PDF  for example).

Even if you know regular expressions, you are still faced with the task of processing the 'extracted text' files.  You might be able to use Notepad++ for this.  Notepad++ is a free text editor with a lot of neat features.  It might also be possible to use grep if you are in a Linux/Unix environment.
Joe Winograd

> There are some free PDF utilities that allow you to export the contents of the text layer of a PDF.

Following up on that comment from aikimark, these two EE 5-minute video Micro Tutorials show how to do that:
Xpdf - Command Line Utility for PDF Files
Xpdf - Convert PDF Files to Plain Text Files

Also, for the 5% of the files that are scanned (not electronic), you may need to OCR them if the scanning created image-only PDFs. If so, this other EE 5-minute video Micro Tutorial shows how to do that:
How to OCR pages in a PDF with free software

Both of those software packages are perfect for someone without a budget — they're free. :)

Regards, Joe
Get an unlimited membership to EE for less than $4 a week.
Unlimited question asking, solutions, articles and more.
Member_2_8044579

ASKER
Thanks for the info.
I know Notepad++, but I prefer to use EmEditor, which seems to work better, I don't know why.
I don't have any problem with converting to txt, doc, html or even with using OCR (I can do all that with Acrobat), but what I need is to identify the references section (usually by the end of the file, though not always) and extract it somehow. Sometimes it's titled "References", sometimes "Works cited", "Bibliography", or even in other languages.
Joe Winograd

> what I need is to identify the references section (usually by the end of the file, though not always) and extract it somehow. Sometimes it's titled "References", sometimes "Works cited", "Bibliography", or even in other languages.

Sounds like a job for some AI code. :)  In all seriousness, this is tricky stuff. Even after finding the Bibliography/References section (difficult enough by itself), you then have to extract all the components — author's/authors' name(s), article title, date — which are not in a fixed format (even more difficult), and then compare the extracted components with a list of articles to determine if the extracted components match the components on the list (more difficult still). More AI code. :)

Maybe this should be your Ph.D. thesis. :)  Regards, Joe
Member_2_8044579

ASKER
I thought this would be easier for you guys, really.
Finally I've managed to follow these steps:
1) save PDFs as Excel using Acrobat (which works better with line breaks than any other approach I've tried, including PDF-extract);
2) save Excel as XML (Acrobat's conversion into XML adds too many line breaks);
3) isolate author-year-title (which is what I actually need) using regular expressions in EmEditor:
- replace (.*)\. ([1-2][0-9][0-9][0-9])\. (.*?)\. (.*) with \1\.¬\(\2\)\.¬\3\.
which obviously doesn't work when there are errors or different formats (no year, "(eds.)", more than one space, etc.); it could be finetuned, of course;
4) separate columns in Excel with ¬ as delimiter;
Now I'll do the same with my list of publications and then I will use some Excel formula (VLookup, I guess) to identify matching titles and years between both lists. Risk: spelling errors in titles, different punctuation, etc. will mean no matching, even if it's the same title.
Experts Exchange has (a) saved my job multiple times, (b) saved me hours, days, and even weeks of work, and often (c) makes me look like a superhero! This place is MAGIC!
Walt Forbes
aikimark

Doing all these steps manually would take years
@Francisco
From your initial problem statement, it appeared that you needed to process hundreds or thousands of PDF files.  The fact that you seem to have solved a significant portion of the work indicates that there might be some disconnect between the problem description and the actual problem (size).

In any event. Congratulations on making such progress.

Without example data, it is difficult for experts to help you.
Member_2_8044579

ASKER
Thanks, but the truth is I do have hundreds of documents, and more still to come. I just solved a portion of the problem, but if I still need to open each document, save it, save it again, etc., it will take ages. That's why I was coming here for some help.
I'm sorry I forgot to provide examples. Here you have one: http://www.jows.pl/sites/default/files/Gajek.pdf. The format is a bit different and therefore it doesn't work with my approach. The firs references look like this (copied and pasted from the PDF):
Bibliografia
- Brookes, S. (2003) Video Production in the Foreign Language Classroom:
Some Practical Ideas. W: The Internet TESL Journal, t. 9, nr 10 [online]
[dostęp 02.04.2008].
- Canning-Wilson, C. (2000) Practical Aspects of Using Video in the Foreign
Language Classroom. W: The Internet TESL Journal, t. 6, nr 11 [online]
[dostęp 02.04.2008].
- Díaz Cintas, J., Remael, A. (2007) Audiovisual Translation: Subtitling.
Manchester: St. Jerome Publishing.
- Dondis, D. (1972) A Primer of Visual Literacy. Cambridge: MIT Press.
ASKER CERTIFIED SOLUTION
aikimark

Log in or sign up to see answer
Become an EE member today7-DAY FREE TRIAL
Members can start a 7-Day Free trial then enjoy unlimited access to the platform
Sign up - Free for 7 days
or
Learn why we charge membership fees
We get it - no one likes a content blocker. Take one extra minute and find out why we block content.
Not exactly the question you had in mind?
Sign up for an EE membership and get your own personalized solution. With an EE membership, you can ask unlimited troubleshooting, research, or opinion questions.
ask a question
SOLUTION
Log in to continue reading
Log In
Sign up - Free for 7 days
Get an unlimited membership to EE for less than $4 a week.
Unlimited question asking, solutions, articles and more.
Joe Winograd

> I thought this would be easier for you guys, really.

Then you don't understand the complexity of the problem.

> save PDFs as Excel using Acrobat (which works better with line breaks than any other approach I've tried, including PDF-extract)

I found a random paper on the net:
A Wearable Face Recognition System on Google Glass for Assisting Social Interactions

Attached are the Excel file created by Adobe Acrobat XI Pro and the text files created by the Xpdf utility called PDFtoText — one run with the -layout parameter and one with the -raw param. Opinions will vary with respect to "which works better".

> which obviously doesn't work when there are errors or different formats (no year, "(eds.)", more than one space, etc.)

Obviously. Part of the complexity of a real solution.

> it could be finetuned, of course

Of course.

> Risk: spelling errors in titles, different punctuation, etc. will mean no matching, even if it's the same title.

Among many other "risks".

> I just solved a portion of the problem

Yes, a very small portion.

> The format is a bit different and therefore it doesn't work with my approach.

As will be the case with many papers, which is why a complete solution won't be easy. Regards, Joe
prosopagnosia-acrobat.xlsx
prosopagnosia-pdftotext-layout.txt
prosopagnosia-pdftotext-raw.txt
Get an unlimited membership to EE for less than $4 a week.
Unlimited question asking, solutions, articles and more.