<

Go Premium for a chance to win a PS4. Enter to Win

x

Xpdf - Convert PDF Files to Plain Text Files - Part 3

Posted on
30,354 Points
1,955 Views
9 Endorsements
Last Modified:
Experience Level: Beginner
5:03
Joe Winograd, EE MVE 2015&2016
50+ yrs in computer industry. Everything from programming to sales. OS kernel dev on mainframes. CIO. Document imaging. EE MVE 2015 & 2016.
In this third video of the Xpdf series, we discuss and demonstrate the PDFtoText utility, which converts PDF files into plain text files.

Video Steps

1. Download and install the software.

You may have already downloaded and installed the Xpdf tools while watching the first  or second video in the Xpdf series , but if you haven't, then visit the Xpdf website at:

http://www.foolabs.com/xpdf/

Click the Download link and then click the pre-compiled Windows binary ZIP archive to download the Xpdf utilities for Windows.
precompiled binaries

2. Locate the documentation folder for the Xpdf utilities.

Go to the folder where you unzipped the downloaded ZIP file and find the <doc> folder.
documentation folder

3. Read the documentation for the PDFtoText tool.

Go into the <doc> folder and find the plain text file called <pdftotext.txt>.

Open it with any text editor, such as Notepad, and read it. This is the documentation for the PDFtoText tool.
read me

4. Set up a test folder.

Create a test folder.

Copy <pdftotext.exe> from the unzipped <bin32> folder into your test folder.

Copy a sample PDF file into your test folder (in the video and the screenshots below, the file is called <RMP.pdf>).
test folder

5. Set up a command prompt for testing.

Open a command prompt window.

Navigate to your test folder.

Issue a DIR command in the command prompt to be sure that only two files are in it - the PDFtoText executable and the sample PDF file.
cmd prompt dir

6. Run the PDFtoText utility on the sample PDF file.

In the command prompt window, enter the following command:

pdftotext -layout samplefilename.pdf
command line

7. Verify that the text file that was created.

Issue a DIR command in the command prompt to show that the text file was created. There should be one text file with the same file name as the PDF file, but with a file type of TXT.
cmd prompt dir 2

8. View the text file that was created.

Open the text file with whatever text editor you prefer, such as Notepad or WordPad.

That's it! If you find this video to be helpful, please click the thumbs-up icon below. Thank you for watching!
What does it mean to be "Always On"?
What does it mean to be "Always On"?

Is your cloud always on? With an Always On cloud you won't have to worry about downtime for maintenance or software application code updates, ensuring that your bottom line isn't affected.

9
Comment
  • 2
  • 2
5 Comments
 
LVL 1

Expert Comment

by:James Powell
Awesome tool!  Thank you for posting this.  Very useful.
0
 
LVL 56

Author Comment

by:Joe Winograd, EE MVE 2015&2016
You're welcome, James. I'm glad you find it useful. And thanks to you for the comment — authors really appreciate hearing words like that! Regards, Joe
0
 

Expert Comment

by:Ed Woods
I have a file with bullets (see below). When I run 'pdftotext -layout -enc UTF-8'   on this file,  the bullets are turn into some weird character.

Is there a way to tell pdftotext to map these characters to '•'  ?

• Provide electronic monitoring of the network, supporting devices and services, including
network devices, peripherals, switches, routers, servers and applications.
• Provide for the management and security of all ONR legacy network accounts.
• Monitor computer and network activity including system load, response time, available disk
0
 

Expert Comment

by:Ed Woods
And I forgot to say this is a great program. I have been looking for something like this for years.
0
 
LVL 56

Author Comment

by:Joe Winograd, EE MVE 2015&2016
Hi Ed,
Thanks for joining Experts Exchange yesterday and watching my video — much appreciated! I'm very glad to hear that you think pdftotext is a great program — I'm in full agreement!

I'm guessing that the "weird character" that you're getting instead of the bullet ("•") is this:

•

Right? If so, that's happening because of the -enc UTF-8 parameter on your command line. You probably don't want UTF-8 encoding, unless you have characters that aren't part of the Latin1 character set, such as Chinese, Cyrillic, Eastern European, etc. In any case, the way to fix mapping problems is to create a custom xpdfrc file, which is the configuration file used by all of the Xpdf tools, and have it point to a corrected Unicode mapping file. The documentation for it is in a file called xpdfrc.txt in the doc subfolder of the unpacked download (and there's an example of it called sample-xpdfrc in the same doc subfolder). Here's the relevant section from the documentation, which is copyright 1996-2017 Glyph & Cog, LLC (with this small portion of it being copied here under "Fair Use"):
unicodeMap encoding-name map-file

Specifies the file with mapping from Unicode to encoding-name. These encodings are used for text output (see below). Each line of a unicodeMap file represents a range of one or more Unicode characters which maps linearly to a range in the output encoding:

in-start-hex in-end-hex out-start-hex

Entries for single characters can be abbreviated to:

in-hex out-hex

The in-start-hex and in-end-hex fields (or the single in-hex field) specify the Unicode range. The out-start-hex field (or the out-hex field) specifies the start of the output encoding range. The length of the out-start-hex (or out-hex) string determines the length of the output characters (e.g., UTF-8 uses different numbers of bytes to represent characters in different ranges). Entries must be given in increasing Unicode order. Only one file is allowed per encoding; the last specified file is used. The Latin1, ASCII7, Symbol, ZapfDingbats, UTF-8, and UCS-2 encodings are predefined.
I've used this method to solve occasional problems, such as unusual Unicode hyphens. The technique is to create a xpdfrc file with the corrected mappings and then refer to that file in the pdftotext.exe call with the -cfg parameter.

In your case, I suspect the character that's being mapped wrong is U+2022, so I would map that to extended ASCII character 149/x'95' in the Latin1 encoding. Attached is a modified Latin1 file for you. It contains the built-in Latin1 mappings, with my hyphen fix (U+2010 mapped to x'2D', a hyphen) and your bullet fix (U+2022 mapped to x'95', a bullet). It is attached as a .txt file. After downloading it, rename it to:

Latin1.unicodeMap

Then create a text file called customxpdfrc.txt with these two lines in it:

unicodeMap Latin1fixed "c:\folder\Latin1.unicodeMap"
textEncoding Latin1fixed

Of course, c:\folder\ may be wherever you want it.

Then run pdftotext with the -cfg parameter, such as:

pdftotext.exe -layout -cfg customxpdfrc.txt input.pdf output.txt

I tested it here on a PDF with a bullet and it worked fine, but, as always, YMMV, so please let me know if it does or doesn't fix it for you.

Btw, Xpdf's built-in Latin1 encoding maps U+2022 to x'B7', which the Windows Latin1 extended character set calls Middle Dot (aka Georgian Comma). I changed that to x'95' in the attached Latin1 file, but note that x'95' is not in the ISO-8859-1 (Latin-1) specification — it is in the Windows-1252 spec, i.e., code page 1252 (aka WinLatin1).

To give credit where credit is due, my thanks to Derek Noonburg of Glyph & Cog, who explained this method to me when I first ran into the hyphen issue.

Regards, Joe
Latin1.txt
1

Featured Post

Get your Conversational Ransomware Defense e‑book

This e-book gives you an insight into the ransomware threat and reviews the fundamentals of top-notch ransomware preparedness and recovery. To help you protect yourself and your organization. The initial infection may be inevitable, so the best protection is to be fully prepared.

Join & Write a Comment

Want to know how to use Exchange Server Eseutil command? Go through this article as it gives you the know-how.
The main intent of this article is to make you aware of ‘Exchange fail to mount’ error, its effects, causes, and solution.
Total Time: 15:02
Suggested Courses

Keep in touch with Experts Exchange

Tech news and trends delivered to your inbox every month