Solved

Extracting information from a PDF

Posted on 2014-12-16
5
104 Views
Last Modified: 2014-12-16
Here’s my dilemma…I have a .pdf file with 8 columns of info [name, phone, email address, etc]. I want to extract all the email addresses. I’m using Nitro to convert to Excel but every row ends up in one cell. I’ve tried saving as .txt and launching the Import text wizard thinking that I could insert column breaks but nothing is aligned properly. When it’s in .xls format the data is aligned pretty well…is there a formula I can use to segregate the info I need? Or another trick?
0
Comment
Question by:CTmountainbiker
  • 3
  • 2
5 Comments
 
LVL 52

Accepted Solution

by:
Joe Winograd, EE MVE earned 500 total points
ID: 40502653
I suggest trying the Xpdf utility called pdftotext. If you use the -layout parameter, it should keep the column alignment and then any decent text editor will allow you to copy/paste the email column. Here's an EE 5-minute video Micro Tutorial explaining how to download the Xpdf tools:

http://www.experts-exchange.com/VP_213.html

And another 5-minute one explaining pdftotext specifically:

http://www.experts-exchange.com/VP_217.html

If you have any problems, I'll be happy to help. Regards, Joe
0
 

Author Comment

by:CTmountainbiker
ID: 40502894
Downloaded files no problem; however, I'm trying to get the 'pdftotext.exe".  I extract the files but can't find the executable; something flashes quickly on screen but I'm getting a I/O error when running at the dos prompt.
0
 
LVL 52

Expert Comment

by:Joe Winograd, EE MVE
ID: 40502959
There's only one file to download — <xpdfbin-win-3.04.zip>. Unzip it and you'll see a folder called <bin32> (there's also a <bin64> folder, but you don't need it, not even on 64-bit systems). Inside the <bin32> folder you'll find <pdftotext.exe>, which is not an installer — it is simply a stand-alone, command line executable. Open up a command prompt, navigate to wherever <pdftotext.exe> is, and run the command:

pdftotext -layout c:\folder\pdfinput.pdf c:\folder\textoutput.txt

If you don't specify the output file name, it will default to the same name (and path) as the input PDF, but with a file type of TXT. Regards, Joe
0
 

Author Comment

by:CTmountainbiker
ID: 40503110
Thanks very much!  It was my syntax that was messing it up.
0
 
LVL 52

Expert Comment

by:Joe Winograd, EE MVE
ID: 40503136
You're very welcome! I'm glad it worked for you. If you already upvoted my video, thanks! If not, I'd really appreciate it if you click on the upvote arrow under Helpful Votes at the video. Thanks much, Joe
0

Featured Post

Live: Real-Time Solutions, Start Here

Receive instant 1:1 support from technology experts, using our real-time conversation and whiteboard interface. Your first 5 minutes are always free.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

The System Center Operations Manager 2012, known as SCOM, is a part of the Microsoft system center product that provides the user with infrastructure monitoring and application performance monitoring. SCOM monitors:   Windows or UNIX/LinuxNetwo…
Article by: Leon
Software Metering within our group of companies has always been an afterthought until auditing of software and licensing became a pain point. Orchestrator and SCCM metering gave us the answer and it was an exciting process.
Viewers will learn how to maximize accessibility options in an Excel workbook for users with accessibility issues.
The viewer will learn how to use a discrete random variable to simulate the return on an investment over a period of years, create a Monte Carlo simulation using the discrete random variable, and create a graph to represent the possible returns over…

816 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

11 Experts available now in Live!

Get 1:1 Help Now