Go Premium for a chance to win a PS4. Enter to Win

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 111
  • Last Modified:

Extracting information from a PDF

Here’s my dilemma…I have a .pdf file with 8 columns of info [name, phone, email address, etc]. I want to extract all the email addresses. I’m using Nitro to convert to Excel but every row ends up in one cell. I’ve tried saving as .txt and launching the Import text wizard thinking that I could insert column breaks but nothing is aligned properly. When it’s in .xls format the data is aligned pretty well…is there a formula I can use to segregate the info I need? Or another trick?
0
CTmountainbiker
Asked:
CTmountainbiker
  • 3
  • 2
1 Solution
 
Joe Winograd, EE MVE 2015&2016DeveloperCommented:
I suggest trying the Xpdf utility called pdftotext. If you use the -layout parameter, it should keep the column alignment and then any decent text editor will allow you to copy/paste the email column. Here's an EE 5-minute video Micro Tutorial explaining how to download the Xpdf tools:

http://www.experts-exchange.com/VP_213.html

And another 5-minute one explaining pdftotext specifically:

http://www.experts-exchange.com/VP_217.html

If you have any problems, I'll be happy to help. Regards, Joe
0
 
CTmountainbikerAuthor Commented:
Downloaded files no problem; however, I'm trying to get the 'pdftotext.exe".  I extract the files but can't find the executable; something flashes quickly on screen but I'm getting a I/O error when running at the dos prompt.
0
 
Joe Winograd, EE MVE 2015&2016DeveloperCommented:
There's only one file to download — <xpdfbin-win-3.04.zip>. Unzip it and you'll see a folder called <bin32> (there's also a <bin64> folder, but you don't need it, not even on 64-bit systems). Inside the <bin32> folder you'll find <pdftotext.exe>, which is not an installer — it is simply a stand-alone, command line executable. Open up a command prompt, navigate to wherever <pdftotext.exe> is, and run the command:

pdftotext -layout c:\folder\pdfinput.pdf c:\folder\textoutput.txt

If you don't specify the output file name, it will default to the same name (and path) as the input PDF, but with a file type of TXT. Regards, Joe
0
 
CTmountainbikerAuthor Commented:
Thanks very much!  It was my syntax that was messing it up.
0
 
Joe Winograd, EE MVE 2015&2016DeveloperCommented:
You're very welcome! I'm glad it worked for you. If you already upvoted my video, thanks! If not, I'd really appreciate it if you click on the upvote arrow under Helpful Votes at the video. Thanks much, Joe
0

Featured Post

Free Tool: Path Explorer

An intuitive utility to help find the CSS path to UI elements on a webpage. These paths are used frequently in a variety of front-end development and QA automation tasks.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

  • 3
  • 2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now