I have a series of PDF files (95% ellectronically created, not scanned) with scientific articles and for a non-commercial bibliographic database we need:
1. to extract all references (i.e. bibliography, works cited) from each one, with no fixed format, although most of them use the author's surname-comma-name format (sometimes with several authors), then year (sometimes between parenthesis), then title;
2. to compare those extracted data with a list of previously selected articles and authors, so I can see whether the PDF articles cite any (one or more) of the titles in the list;
3. to present those citations in an easily readable way, such as an Excel table with a column for the citing document and more columns for the cited ones.
Doing all these steps manually would take years, so I was wondering if any expert here could help us.
I have a related question that I will post separately.