Automate opening PDf's and Save them to Text

Frank Bryant
Frank Bryant used Ask the Experts™
on
Got handed an odd project; looking for a solution that will open a PDF, then save it as text and there is no manipulation to the PDF; then open the next PDF and repeat the process until all PDF's within a given folder have been saved to text.

What are my options?
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Joe WinogradDeveloper
Fellow 2017
Most Valuable Expert 2018

Commented:
Hi Frank,
I suggest using Xpdf's PDFtoText to save the PDF as text. These five-minute EE video Micro Tutorials explain how to download the utilities and use PDFtoText:

Xpdf - Command Line Utilities for PDF Files
Xpdf - PDFtoText - Command Line Utility to Convert PDF Files to Plain Text Files

You would write a simple program/script that loops through all the PDFs in the specified folder. I would write it in AutoHotkey, but you may, of course, use whatever language you prefer.

If you wind up liking PDFtoText, you may also like the other tools in the Xpdf library. Here are other five-minute EE video Micro Tutorials for all of the utilities:

Xpdf - PDFimages - Command Line Utility to Extract Images from PDF Files
Xpdf - PDFinfo - Command Line Utility to Retrieve Page Count and Other Information from PDF Files
Xpdf - PDFdetach - Command Line Utility to Detach Attachments from PDF Files
Xpdf - PDFtoPNG - Command Line Utility to Convert a Multi-page PDF File into Separate PNG Files
Xpdf - PDFfonts - Command Line Utility to List Fonts Used in a PDF File
Xpdf - PDFtoHTML - Command Line Utility to Convert a PDF File to HTML
Xpdf - PDFtoPPM - Command Line Utility to Convert a PDF File to PPM, PGM, PBM
Xpdf - PDFtoPS - Command Line Utility to Convert a PDF File to PS (PostScript)
xpdfrc - Configuration File for All Xpdf Utilities

Regards, Joe

Edit: I'm leaving my office now for a few hours and since I haven't heard back from you, decided to send along an AutoHotkey script that does what you want. It processes all the files in the specified folder, creating a TXT file with the same file name as the PDF file:

SourceFolder:="c:\MyPDFs\" ; set this to whatever folder you want
PDFtoTextEXE:="c:\Xpdf\bin32\pdftotext.exe" ; set this to wherever you put the tool
Errors:=""
NumConverted:=0
Loop,Files,%SourceFolder%*.pdf
{
  PDFfile:=A_LoopFilePath
  RunWait,"%PDFtoTextEXE%" -layout "%PDFfile%",,Hide
  If (ErrorLevel!=0)
    Errors:=Errors . PDFfile . "`n"
  Else
    NumConverted:=NumConverted+1
}
If (Errors="")
  Errors:="None"
MsgBox,4096,Done Converting,Number converted: %NumConverted%`n`nErrors converting:`n%Errors%
ExitApp

Open in new window

I trust from the descriptive variable names and the straightforward code that you can understand the script (I don't know if you've had prior AutoHotkey experience). Note that I used the -layout output format option, which generally works well, but you may need to experiment with different output formats depending on your particular PDFs. The other options are:

-lineprinter
-raw
-simple
-table

Their meanings are defined in the documentation (pdftotext.txt in the downloaded doc folder). You may also specify no formatting option, implying the default output, which I've found also works well on most PDFs.
David FavorFractional CTO
Distinguished Expert 2018

Commented:
Easy ways to do this...

1) Use libreoffice in headless mode to convert PDF to many different output formats.

2) Or if you have many files, geez... opening a PDF is a very heavy operation... Skip the open + do a direct convert using Poppler Tools available on all Linux Distros + likely there's even a Windows port.

The syntax I use for this...

pdftotext -enc ASCII7 -nopgbrk -layout "$file" > "$file.txt"

Open in new window

Commented:
David Favor,

Thanks and we where looking into other software alternatives, see my comments below.



Joe Winograd,

Thanks for the info and the code you provided pointed me in the right direction; rewording my searches led me to the code below and all you need is to install Acrobat Pro and then select the VBA Reference in Access. We did find a coworker that had Acrobat Pro and ran a couple of tests on PDFs they had and it worked. So I have submitted a request for Acrobat Pro and now I wait.

Function LoadPDFSaveToText(SomeUser As String, MyPDFsCanBeFoundHere As String, ThePDFFileNameIs As String, TheOutputTextFileNameIs As String)
    ' Open PDF and Save as Text'
    
    Dim AcroXApp As Object
    Dim AcroXAVDoc As Object
    Dim AcroXPDDoc As Object
    
    Dim PDF_PATH As String
    Dim OUTPUT_PATH As String
    
    
    PDF_PATH = "C:\Users\" & SomeUser & "\" & MyPDFsCanBeFoundHere
    OUTPUT_PATH = "C:\Users\" & SomeUser & "\" & MyPDFsCanBeFoundHere
    
    Set AcroXApp = CreateObject("AcroExch.App")
    AcroXApp.Hide
    
    Set AcroXAVDoc = CreateObject("AcroExch.AVDoc")
    AcroXAVDoc.Open PDF_PATH & ThePDFFileNameIs, "Acrobat"
    
    AcroXAVDoc.BringToFront
    
    Set AcroXPDDoc = AcroXAVDoc.GetPDDoc
    
    Dim jsObj As Object
    Set jsObj = AcroXPDDoc.GetJSObject
    
    jsObj.SaveAs OUTPUT_PATH & TheOutputTextFileNameIs, "com.adobe.acrobat.plain-text"
    
    AcroXAVDoc.Close False
    AcroXApp.Hide
    AcroXApp.Exit

    Call LoadUltraEditAndRunMacros(ThisTheCurrentProjectPath, TheTextFileNameToProcessIs, SomeUser)

End Function


Function LoadUltraEditAndRunMacros(ThisTheCurrentProjectPath As String, TheTextFileNameToProcessIs As String, SomeUser As String)
    ' This Loads the Text File(s) and runs the Ultraedit (UE) Housekeeping Macros on it
    
    Dim TheUEMacroPathIs As String
    Dim TheUECommandToExecuteIs As String
    Dim TheArgumentsToPassAre As String
    Dim UE_Loop As Integer
    
    
    TheUEMacroPathIs = "C:\Users\" & SomeUser & "\Documents\"
    
    TheUECommandToExecuteIs = "C:\Program Files\IDM Computer Solutions\UltraEdit\Uedit64.exe"
    
    For UE_Loop = 1 To 4
        Select Case UE_Loop
            Case Is = 1, 2, 4
                Select Case UE_Loop
                    Case Is = 4
                        TheArgumentsToPassAre = " /fni " & ThisTheCurrentProjectPath & "\" & TheTextFileNameToProcessIs & ".txt /m,e=" & Chr(34) & TheUEMacroPathIs & "UE_Macro_" & Format(UE_Loop - 1, "00") & ".mac" & Chr(34)
                    Case Else
                        TheArgumentsToPassAre = " /fni " & ThisTheCurrentProjectPath & "\" & TheTextFileNameToProcessIs & ".txt /m,e=" & Chr(34) & TheUEMacroPathIs & "UE_Macro_" & Format(UE_Loop, "00") & ".mac" & Chr(34)
                End Select
            Case Else
                TheArgumentsToPassAre = " /fni " & ThisTheCurrentProjectPath & "\" & TheTextFileNameToProcessIs & ".txt /m,e,5=" & Chr(34) & TheUEMacroPathIs & "UE_Spaces.mac" & Chr(34)
        End Select
    
        ' Show Errors - for testing purposes
        ' Call Shell(TheUECommandToExecuteIs & TheArgumentsToPassAre, vbMaximizedFocus)
        
        ' Show Nothing
        Call Shell(TheUECommandToExecuteIs & TheArgumentsToPassAre, 1)
        
        DoEvents
    Next
    
End Function

Open in new window

Joe WinogradDeveloper
Fellow 2017
Most Valuable Expert 2018

Commented:
You're welcome, Frank, I'm glad you found a method that works well for you. Regards, Joe

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial