<

Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium

x

Xpdf - Convert PDF Files to Plain Text Files - Part 3

Posted on
32,544 Points
2,145 Views
9 Endorsements
Last Modified:
Awarded
Experience Level: Beginner
5:03
Joe Winograd, EE MVE 2015&2016
50+ yrs in computer industry. Everything from programming to sales. OS kernel dev on mainframes. CIO. Document imaging. EE MVE 2015 & 2016.
In this third video of the Xpdf series, we discuss and demonstrate the PDFtoText utility, which converts PDF files into plain text files.

Video Steps

1. Download and install the software.

You may have already downloaded and installed the Xpdf tools while watching the first  or second video in the Xpdf series , but if you haven't, then visit the Xpdf website at:

http://www.foolabs.com/xpdf/

Click the Download link and then click the pre-compiled Windows binary ZIP archive to download the Xpdf utilities for Windows.
precompiled binaries

2. Locate the documentation folder for the Xpdf utilities.

Go to the folder where you unzipped the downloaded ZIP file and find the <doc> folder.
documentation folder

3. Read the documentation for the PDFtoText tool.

Go into the <doc> folder and find the plain text file called <pdftotext.txt>.

Open it with any text editor, such as Notepad, and read it. This is the documentation for the PDFtoText tool.
read me

4. Set up a test folder.

Create a test folder.

Copy <pdftotext.exe> from the unzipped <bin32> folder into your test folder.

Copy a sample PDF file into your test folder (in the video and the screenshots below, the file is called <RMP.pdf>).
test folder

5. Set up a command prompt for testing.

Open a command prompt window.

Navigate to your test folder.

Issue a DIR command in the command prompt to be sure that only two files are in it - the PDFtoText executable and the sample PDF file.
cmd prompt dir

6. Run the PDFtoText utility on the sample PDF file.

In the command prompt window, enter the following command:

pdftotext -layout samplefilename.pdf
command line

7. Verify that the text file that was created.

Issue a DIR command in the command prompt to show that the text file was created. There should be one text file with the same file name as the PDF file, but with a file type of TXT.
cmd prompt dir 2

8. View the text file that was created.

Open the text file with whatever text editor you prefer, such as Notepad or WordPad.

That's it! If you find this video to be helpful, please click the thumbs-up icon below. Thank you for watching!
Free Tool: Site Down Detector
Free Tool: Site Down Detector

Helpful to verify reports of your own downtime, or to double check a downed website you are trying to access.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

9
Comment
7 Comments
 
LVL 1

Expert Comment

by:James Powell
Awesome tool!  Thank you for posting this.  Very useful.
0
 
LVL 57

Author Comment

by:Joe Winograd, EE MVE 2015&2016
You're welcome, James. I'm glad you find it useful. And thanks to you for the comment — authors really appreciate hearing words like that! Regards, Joe
0
 

Expert Comment

by:Ed Woods
I have a file with bullets (see below). When I run 'pdftotext -layout -enc UTF-8'   on this file,  the bullets are turn into some weird character.

Is there a way to tell pdftotext to map these characters to '•'  ?

• Provide electronic monitoring of the network, supporting devices and services, including
network devices, peripherals, switches, routers, servers and applications.
• Provide for the management and security of all ONR legacy network accounts.
• Monitor computer and network activity including system load, response time, available disk
0
 

Expert Comment

by:Ed Woods
And I forgot to say this is a great program. I have been looking for something like this for years.
0
 
LVL 57

Author Comment

by:Joe Winograd, EE MVE 2015&2016
Hi Ed,
Thanks for joining Experts Exchange yesterday and watching my video — much appreciated! I'm very glad to hear that you think pdftotext is a great program — I'm in full agreement!

I'm guessing that the "weird character" that you're getting instead of the bullet ("•") is this:

•

Right? If so, that's happening because of the -enc UTF-8 parameter on your command line. You probably don't want UTF-8 encoding, unless you have characters that aren't part of the Latin1 character set, such as Chinese, Cyrillic, Eastern European, etc. In any case, the way to fix mapping problems is to create a custom xpdfrc file, which is the configuration file used by all of the Xpdf tools, and have it point to a corrected Unicode mapping file. The documentation for it is in a file called xpdfrc.txt in the doc subfolder of the unpacked download (and there's an example of it called sample-xpdfrc in the same doc subfolder). Here's the relevant section from the documentation, which is copyright 1996-2017 Glyph & Cog, LLC (with this small portion of it being copied here under "Fair Use"):
unicodeMap encoding-name map-file

Specifies the file with mapping from Unicode to encoding-name. These encodings are used for text output (see below). Each line of a unicodeMap file represents a range of one or more Unicode characters which maps linearly to a range in the output encoding:

in-start-hex in-end-hex out-start-hex

Entries for single characters can be abbreviated to:

in-hex out-hex

The in-start-hex and in-end-hex fields (or the single in-hex field) specify the Unicode range. The out-start-hex field (or the out-hex field) specifies the start of the output encoding range. The length of the out-start-hex (or out-hex) string determines the length of the output characters (e.g., UTF-8 uses different numbers of bytes to represent characters in different ranges). Entries must be given in increasing Unicode order. Only one file is allowed per encoding; the last specified file is used. The Latin1, ASCII7, Symbol, ZapfDingbats, UTF-8, and UCS-2 encodings are predefined.
I've used this method to solve occasional problems, such as unusual Unicode hyphens. The technique is to create a xpdfrc file with the corrected mappings and then refer to that file in the pdftotext.exe call with the -cfg parameter.

In your case, I suspect the character that's being mapped wrong is U+2022, so I would map that to extended ASCII character 149/x'95' in the Latin1 encoding. Attached is a modified Latin1 file for you. It contains the built-in Latin1 mappings, with my hyphen fix (U+2010 mapped to x'2D', a hyphen) and your bullet fix (U+2022 mapped to x'95', a bullet). It is attached as a .txt file. After downloading it, rename it to:

Latin1.unicodeMap

Then create a text file called customxpdfrc.txt with these two lines in it:

unicodeMap Latin1fixed "c:\folder\Latin1.unicodeMap"
textEncoding Latin1fixed

Of course, c:\folder\ may be wherever you want it.

Then run pdftotext with the -cfg parameter, such as:

pdftotext.exe -layout -cfg customxpdfrc.txt input.pdf output.txt

I tested it here on a PDF with a bullet and it worked fine, but, as always, YMMV, so please let me know if it does or doesn't fix it for you.

Btw, Xpdf's built-in Latin1 encoding maps U+2022 to x'B7', which the Windows Latin1 extended character set calls Middle Dot (aka Georgian Comma). I changed that to x'95' in the attached Latin1 file, but note that x'95' is not in the ISO-8859-1 (Latin-1) specification — it is in the Windows-1252 spec, i.e., code page 1252 (aka WinLatin1).

To give credit where credit is due, my thanks to Derek Noonburg of Glyph & Cog, who explained this method to me when I first ran into the hyphen issue.

Regards, Joe
Latin1.txt
1
 

Expert Comment

by:fammi farendra
It works great for regular text
But i have a problem when exporting table inside pdf
especially when some cells in that table are empty (not contain number/text)
0
 
LVL 57

Author Comment

by:Joe Winograd, EE MVE 2015&2016
Hi Fammi,
Thanks for joining Experts Exchange today and watching my video Micro Tutorial — welcome aboard!

Two thoughts for you. First, my video talks about Version 3.04, which was the latest version at the time. There is now Version 4.00, which has some improvements and fixes in it. If you're on 3.04, I recommend upgrading to 4.00.

Second, have you tried the -table option? This is what it does (copied here from the doc file under "Fair Use"):
Table mode is similar to physical layout mode, but optimized for tabular data, with the goal of keeping rows and columns aligned (at the expense of inserting extra whitespace). If the -fixed option is given, character spacing within each line will be determined by the specified character pitch.

That doc refers to the -fixed option, which takes a number as its parameter and is described in the doc as follows ("Fair Use" again):
Specify the character pitch (character width), in points, for physical layout, table, or line printer mode. This is ignored in all other modes.

If those params don't help, please attach a sample PDF with the problem and I'll see what I can do with it (make sure the sample PDF has no private/sensitive information in it). Regards, Joe
0

Featured Post

Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Join & Write a Comment

Measuring Server's processing rate with a simple powershell command. The differences in processing rate also was recorded in different use-cases, when a server in free and busy states.
Choosing the right mix of apps is very much necessary for CPAs for making the most of the latest technology through which they can boost their growth.
Total Time: 15:02
Suggested Courses
Next Video:

Keep in touch with Experts Exchange

Tech news and trends delivered to your inbox every month