I have a zillion articles in PDF form that were downloaded from an academic database. The format of all of them is that the first page is a title page that has the title, author, and publication info. The rest various depending on the publication it came from.
I want to generate a text file with the those fields for all of the articles. For example:
Jones, Paul. "yada yada." THE WALL STREET JOURNAL, December 1, 2008.
Smith, John. "blah blah". THE ECONOMIST. June 1, 2006.
and so on.
My ultimate goal is to generate an XML file for each so that I can import this info into a reference manager (e.g., EndNote), but for now I just want to extract this info.
Any suggestions on tools or strategies for writing such a script? I'd prefer .Net, but am open to whatever works.