Solved

Extracting information from a PDF

Posted on 2014-12-16
5
101 Views
Last Modified: 2014-12-16
Here’s my dilemma…I have a .pdf file with 8 columns of info [name, phone, email address, etc]. I want to extract all the email addresses. I’m using Nitro to convert to Excel but every row ends up in one cell. I’ve tried saving as .txt and launching the Import text wizard thinking that I could insert column breaks but nothing is aligned properly. When it’s in .xls format the data is aligned pretty well…is there a formula I can use to segregate the info I need? Or another trick?
0
Comment
Question by:CTmountainbiker
  • 3
  • 2
5 Comments
 
LVL 51

Accepted Solution

by:
Joe Winograd, EE MVE earned 500 total points
ID: 40502653
I suggest trying the Xpdf utility called pdftotext. If you use the -layout parameter, it should keep the column alignment and then any decent text editor will allow you to copy/paste the email column. Here's an EE 5-minute video Micro Tutorial explaining how to download the Xpdf tools:

http://www.experts-exchange.com/VP_213.html

And another 5-minute one explaining pdftotext specifically:

http://www.experts-exchange.com/VP_217.html

If you have any problems, I'll be happy to help. Regards, Joe
0
 

Author Comment

by:CTmountainbiker
ID: 40502894
Downloaded files no problem; however, I'm trying to get the 'pdftotext.exe".  I extract the files but can't find the executable; something flashes quickly on screen but I'm getting a I/O error when running at the dos prompt.
0
 
LVL 51

Expert Comment

by:Joe Winograd, EE MVE
ID: 40502959
There's only one file to download — <xpdfbin-win-3.04.zip>. Unzip it and you'll see a folder called <bin32> (there's also a <bin64> folder, but you don't need it, not even on 64-bit systems). Inside the <bin32> folder you'll find <pdftotext.exe>, which is not an installer — it is simply a stand-alone, command line executable. Open up a command prompt, navigate to wherever <pdftotext.exe> is, and run the command:

pdftotext -layout c:\folder\pdfinput.pdf c:\folder\textoutput.txt

If you don't specify the output file name, it will default to the same name (and path) as the input PDF, but with a file type of TXT. Regards, Joe
0
 

Author Comment

by:CTmountainbiker
ID: 40503110
Thanks very much!  It was my syntax that was messing it up.
0
 
LVL 51

Expert Comment

by:Joe Winograd, EE MVE
ID: 40503136
You're very welcome! I'm glad it worked for you. If you already upvoted my video, thanks! If not, I'd really appreciate it if you click on the upvote arrow under Helpful Votes at the video. Thanks much, Joe
0

Featured Post

Maximize Your Threat Intelligence Reporting

Reporting is one of the most important and least talked about aspects of a world-class threat intelligence program. Here’s how to do it right.

Join & Write a Comment

This very simple solution applies to a narrow cross-section of the "needs to close" variety. In this case, the full message in Event Viewer was in applog, Event ID 1000: Faulting application iexplore.exe, version 8.0.6001.18702, faulting module …
Article by: Leon
Software Metering within our group of companies has always been an afterthought until auditing of software and licensing became a pain point. Orchestrator and SCCM metering gave us the answer and it was an exciting process.
Viewers will learn the different options available in the Backstage view in Excel 2013.
The view will learn how to download and install SIMTOOLS and FORMLIST into Excel, how to use SIMTOOLS to generate a Monte Carlo simulation of 30 sales calls, and how to calculate the conditional probability based on the results of the Monte Carlo …

744 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

9 Experts available now in Live!

Get 1:1 Help Now