gdunn59
asked on
Access VBA Code to Loop Through Directory and Open PDF Files One at a Time and Scrape Data from Fields
I have some PDF files that are saved to a particular directory. I need to loop through the directory and open the PDFs one at a time, scrape the data from the fields in the PDF, then close that PDF, move to a "Completed" directory, and then open the next one and do the same thing, until all PDFs in the directory have been opened, scraped and moved.
How can this be accomplished?
Thanks,
gdunn59
Here are the two directories:
Directory where all the PDFs are that need to loop through: L:\SharedData\SYSTEMS\_tem p_
Directory where once the PDF has been opened and the data scraped, need to move here: L:\SharedData\SYSTEMS\_tem p_\complet ed
How can this be accomplished?
Thanks,
gdunn59
Here are the two directories:
Directory where all the PDFs are that need to loop through: L:\SharedData\SYSTEMS\_tem
Directory where once the PDF has been opened and the data scraped, need to move here: L:\SharedData\SYSTEMS\_tem
I don't think you can do this directly with Access. PDF is not a file type that Access can read. I think you are going to need the full version of Adobe (on every PC that needs to use this code) to give you libraries that can pull out data from the pdf. The looping through the directory is done using the FSO (file system object) and there are lots of samples available although I can post one if you need it.
ASKER
PatHartman,
We do have the full version of Adobe.
Yes, can you post the loop through directory code for me.
Thanks,
gdunn59
We do have the full version of Adobe.
Yes, can you post the loop through directory code for me.
Thanks,
gdunn59
If you have office you can instance WORD to read the PDF , read data from it and do whatever you want with it, I played a while ago with something similar in this thread: VBA to convert PDF to EXCEL without using external software?
ASKER
I have this code, but I get a runtime error 91 on Line 61. I get the error because the original PDF in the email attachment is disabled. So prior to running the code, I do a File SaveAS and save the PDF with the "Enable All Features" turned off . I want to just be able to go directly to the directory where the PDFs are located that I did a SaveAS with the "Enable All Features" turned off, open one at a time, and then run this code.
I've attached a screen shot showing the "Enable All Features" at the top of the PDF. If I click on this it goes away and then I resave the PDF without this feature on, and the code is then able to scrape the data. Right now the way the code is, it wants to open the PDF with the "Enable All Features" turned on and it fails because it can't access any of the fields in the PDF.
H--MS-Access-Development-Projects-Sy.jpg
Function GetFileText(strFilePath As String) As String
'IAC objects
Dim gPdDoc As Acrobat.CAcroPDDoc
Dim gAvDoc As Acrobat.CAcroAVDoc
'variables
'Dim gPDFPath As String
Dim sName As String
Dim lNum As Integer
'Initialize Acrobat by creating App object
Set gApp = CreateObject("AcroExch.App")
'Set AVDoc object
'gAvDoc = gApp.GetActiveDoc
Set gAvDoc = CreateObject("AcroExch.AVDoc")
' open the PDF
If gAvDoc.Open(strFilePath, "") Then
gApp.Show
'Set PDDoc object and get some data
Set gPdDoc = gAvDoc.GetPDDoc
lNum = gPdDoc.GetNumPages
sName = gPdDoc.GetFileName
GetFileText = Acro_GetPageText(strFilePath, gPdDoc, gAvDoc, gApp)
End If
End Function
Function Acro_GetPageText(ByRef strFilePath As String, gPdDoc As Acrobat.CAcroPDDoc, gAvDoc As Acrobat.AcroAVDoc, gApp As AcroApp, Optional nPage As Long = -1) As String
Dim r As New Acrobat.AcroRect
Dim pdPage As Acrobat.CAcroPDPage
Dim ts As Acrobat.CAcroPDTextSelect
Dim pPoint As Acrobat.CAcroPoint
Dim nTextLoop As Long
Dim nPageLoop As Long
Dim strOut As String
Dim nPageStart As Long
Dim nPageEnd As Long
If nPage = -1 Then 'all pages
nPageStart = 0
nPageEnd = gPdDoc.GetNumPages - 1
Else
nPageStart = nPage
nPageEnd = nPage
End If
For nPageLoop = nPageStart To nPageEnd
Set pdPage = gPdDoc.AcquirePage(nPageLoop)
Set pPoint = pdPage.GetSize
r.Left = 0
r.Top = pPoint.y
r.Bottom = 0
r.Right = pPoint.x
Set ts = gPdDoc.CreateTextSelect(nPageLoop, r)
For nTextLoop = 0 To ts.GetNumText - 1
strOut = strOut & ts.GetText(nTextLoop)
Next
strOut = strOut & vbCrLf '!NOTE: we are putting a crlf after each page to simulate clipboard copy and paste
ts.Destroy
Next
If Len(strOut) > 3 Then strOut = Left(strOut, Len(strOut) - 2) 'ditch extr crlf from between pages
Acro_GetPageText = strOut
ErrHandler:
Set gPdDoc = Nothing
Set gAvDoc = Nothing
Set pdPage = Nothing
Set pPoint = Nothing
Set ts = Nothing
Set gApp = Nothing
Call ErrorHandler(err, "Acro_GetPageText")
End Function
I've attached a screen shot showing the "Enable All Features" at the top of the PDF. If I click on this it goes away and then I resave the PDF without this feature on, and the code is then able to scrape the data. Right now the way the code is, it wants to open the PDF with the "Enable All Features" turned on and it fails because it can't access any of the fields in the PDF.
H--MS-Access-Development-Projects-Sy.jpg
you can disable that option in your acrobat preferences
Choose Preferences.
From the Categories on the left, select Security (Enhanced).
Uncheck Enhanced Security option
Choose Preferences.
From the Categories on the left, select Security (Enhanced).
Uncheck Enhanced Security option
ASKER
I didn't write this code, so I apologize, but here is apparently the code that is actually doing the scrape of the data:
Function AddEmailInspection(rstInspections As ADODB.Recordset, rstExceptions As ADODB.Recordset, mi As Outlook.MailItem, Optional strManualPdfPath As String = "") As Long
' On Error GoTo ErrHandler
Dim strFilePath As String
Dim strFileName As String
Dim strDestFilePath As String
Dim strDestFileName As String
Dim nCADCount As Long
Dim strText As String
Dim strTemp As String
Dim cie As CIError
Dim strSQL As String
Dim strDocNumber As String
Dim strInspDate As String
Dim strOvInspection As String
Dim strInspectorName As String
Dim strLocId As String
Dim strFacId As String
Dim strFrom As String
Dim nPDFTextRetryCount As Long
Dim nPDFTextRetryMax As Long
cie = None
If strManualPdfPath <> "" Then 'manual addition
Call AppendToLogFile("SYSTEM", "AddEmailInspection", "Begin Manual Process. User=" & GetUName)
Else
Call AppendToLogFile("SYSTEM", "AddEmailInspection", "Begin Mail Item Process. Subject=" & mi.Subject)
End If
If strManualPdfPath = "" Then '--------------------------------------> Email Addition
'check the from
strFrom = LCase(Trim(mi.SenderEmailAddress))
' If strFrom <> "dnr_eforms.helpdesk@state.co.us" Then 'this email did not come from expected source
' If InStr(1, strFrom, "state.co.us") < 1 Then 'didnt come from the state
' cie = FromPersonOther
' Else 'came from the state
' cie = FromPersonState
' End If
' GoTo ExceptionCheck 'do NOT check anything else this will override anything else
' End If
' If strFrom <> "dnr_eforms.helpdesk@state.co.us" Then 'this email did not come from expected source
If InStr(1, strFrom, "state.co.us") < 1 Then 'didnt come from the state
cie = FromPersonOther
GoTo ExceptionCheck 'do NOT check anything else this will override anything else
' Else 'came from the state
' cie = FromPersonState
End If
' GoTo ExceptionCheck 'do NOT check anything else this will override anything else
' End If
'Check the subject
strTemp = mi.Subject
If InStr(1, UCase(strTemp), "INSPECTION REPORT [") < 1 Then cie = InvalidSubject
'Check for number of attachments ; there should only be 1
If mi.attachments.Count <> 1 Then cie = InvalidAttachmentCount
If mi.attachments.Count = 1 Then
'Check for the attachment filetype
strTemp = mi.attachments.Item(1).FileName
If UCase(Right(strTemp, 4)) <> ".PDF" Then cie = InvalidAttachmentType
End If
ExceptionCheck:
If cie <> None Then 'there was an error, log it, move it, and jump out
Call AppendToLogFile("SYSTEM", "AddEmailInspection", "Exception Error (" & CIError_ToString(cie) & ") Subject=" & mi.Subject)
rstExceptions.AddNew
rstExceptions.Fields("MAIL_ID") = mi.EntryID
rstExceptions.Fields("MAIL_DATE") = mi.CreationTime
rstExceptions.Fields("MAIL_SUBJECT") = mi.Subject
rstExceptions.Fields("MAIL_FLAG") = cie
If cie = FromPersonState Then
rstExceptions.Fields("MAIL_ACTION") = "Moved to From CO State Folder (" & CIError_ToString(cie) & "[" & strFrom & "])"
ElseIf cie = FromPersonOther Then
rstExceptions.Fields("MAIL_ACTION") = "Moved to From Other Folder (" & CIError_ToString(cie) & "[" & strFrom & "])"
Else
rstExceptions.Fields("MAIL_ACTION") = "Moved to Exceptions Folder (" & CIError_ToString(cie) & ")"
End If
rstExceptions.Update
mi.Categories = "Red Category"
If cie = FromPersonState Then
' mi.Move fldFromState ' commented this out on 1-5-2017 because it was moving the emails to the From State email Folder and it shouldn't be
mi.Move fldOUT ' this is moving the emails from the InBox to the SYS_INSP_D0_NOT_TOUCH email Folder
ElseIf cie = FromPersonOther Then
mi.Move fldFromOther
Else
mi.Move fldERR
End If
GoTo ErrHandler
End If
End If 'end manual addition check '<-------------------------------------- Email Addition
'Assure that temp folder exists and is clear it out...we only gonna do these one at a time
If Dir(GetTempFolderDir, vbDirectory) = "" Then MkDir GetTempFolderDir
AddEmailInspection = KillFolderFiles(GetTempFolderDir)
If AddEmailInspection <> 0 Then Exit Function
If strManualPdfPath = "" Then 'Email Addition
'gather file information
strFilePath = GetTempFolderDir & mi.attachments.Item(1).FileName
strFileName = mi.attachments.Item(1).FileName
'save the attachment
Call AppendToLogFile("SYSTEM", "AddEmailInspection", "Saving PDF(" & mi.attachments.Item(1).FileName & ") and extracting data. Subject=" & mi.Subject)
mi.attachments.Item(1).SaveAsFile strFilePath
Else 'Manual Addition
strFileName = Right(strManualPdfPath, Len(strManualPdfPath) - InStrRev(strManualPdfPath, "\"))
strFilePath = GetTempFolderDir & strFileName
Call AppendToLogFile("SYSTEM", "AddEmailInspection", "Saving Manual PDF(" & strFileName & ") and extracting data. User=" & GetUName)
CopyFile strManualPdfPath, strFilePath, False
End If
'begin pdf data scrape and get api and anorm data first
nPDFTextRetryCount = 1
nPDFTextRetryMax = 10
RetryPDFText:
'looks like we have to assure that the pdf is fully saved or will get blank text on getpdftext
' Call AppendToLogFile("SYSTEM", "AddEmailInspection", "Waiting for File Save (5 Seconds)")
' Call TempTimer(5)
Call AppendToLogFile("SYSTEM", "AddEmailInspection", "Attempting PDF Text Save...")
strText = GetFileText(strFilePath)
If strText = "" Then
' err.Raise -666, , "No PDF File Text Available; Please Try Again."
Call AppendToLogFile("SYSTEM", "AddEmailInspection", "PDF Text Fail: Try " & nPDFTextRetryCount & " of " & nPDFTextRetryMax)
nPDFTextRetryCount = nPDFTextRetryCount + 1
If nPDFTextRetryCount > nPDFTextRetryMax Then
err.Raise -666, , "No PDF File Text Available; Please Try Again."
End If
Call TempTimer(1)
GoTo RetryPDFText
End If
'scrape the other pdf values
strDocNumber = GetDocumentNumber(strText)
strInspDate = GetInspectionDate(strText)
strOvInspection = GetOverallInspection(strText)
strInspectorName = GetInspectorName(strText)
strLocId = GetLocationId(strText)
strFacId = GetFacilityId(strText)
' MsgBox strDocNumber & vbCrLf & strInspDate & vbCrLf & strOvInspection & vbCrLf & strLocId & vbCrLf & strFacId & vbCrLf
Call GatherDocumentWells(strDocNumber, strText) 'NOTE: we are not checking for errors!!!!
AddEmailInspection = GetSysDateCAD(strDocNumber, CDate(strInspDate), strText)
If AddEmailInspection <> 0 Then Exit Function
'check for doc# existence in the current configuration of 3 tables
cie = None 'assure flagged as ok to add unless something changes it
'Check for docnumber existence in INSPECTIONS
If GetRecordCount("SELECT * FROM INSPECTIONS WHERE DOCUMENT_ID='" & strDocNumber & "'") > 0 Then
cie = DuplicateItem
End If
'It was a Valid Inspection
If cie = None Then
If strManualPdfPath = "" Then 'Email Addition
Call AppendToLogFile("SYSTEM", "AddEmailInspection", "Adding data to DB Table Subject=" & mi.Subject)
mi.Categories = "Green Category"
Else 'Manual Addition
Call AppendToLogFile("SYSTEM", "AddEmailInspection", "Adding data to DB Table User=" & GetUName)
End If
strDestFilePath = AssureSaveFileName(InspectionsPdfFolder, strFileName)
strDestFileName = Right(strDestFilePath, Len(strDestFilePath) - InStrRev(strDestFilePath, "\"))
nCADCount = GetRecordCount("TEMP_FILE_DATES")
If strManualPdfPath = "" Then 'Email Addtion (can not use mi since getting object not set error for unknown reason...heh)
'add the record to inspections table
rstInspections.AddNew
rstInspections.Fields("MAIL_ID") = mi.EntryID
rstInspections.Fields("MAIL_DATE") = mi.CreationTime
rstInspections.Fields("MAIL_SUBJECT") = mi.Subject
rstInspections.Fields("MAIL_PDF_NAME") = strDestFileName
rstInspections.Fields("DOCUMENT_ID") = strDocNumber
rstInspections.Fields("INSPECTION_DATE") = IIf(strInspDate = "0", Null, CDate(strInspDate))
rstInspections.Fields("INSPECTION_RATING") = IIf(strOvInspection = "", Null, strOvInspection)
rstInspections.Fields("COGCC_INSPECTOR") = strInspectorName
rstInspections.Fields("FACILITY_ID") = strFacId
rstInspections.Fields("LOCATION_ID") = strLocId
rstInspections.Fields("FOLLOWUP_REQUIRED") = IIf(strOvInspection = 0, 0, -1)
rstInspections.Fields("CREATED_BY") = "SYS"
rstInspections.Fields("IS_RESOLVED") = IIf(nCADCount = 0, -1, 0) 'no cads then its resolved
Else 'Manual Addition
'add the record to inspections table
rstInspections.AddNew
rstInspections.Fields("MAIL_ID") = Null
rstInspections.Fields("MAIL_DATE") = Null
rstInspections.Fields("MAIL_SUBJECT") = Null
rstInspections.Fields("MAIL_PDF_NAME") = strDestFileName
rstInspections.Fields("DOCUMENT_ID") = strDocNumber
rstInspections.Fields("INSPECTION_DATE") = IIf(strInspDate = "0", Null, CDate(strInspDate))
rstInspections.Fields("INSPECTION_RATING") = IIf(strOvInspection = "", Null, strOvInspection)
rstInspections.Fields("COGCC_INSPECTOR") = strInspectorName
rstInspections.Fields("FACILITY_ID") = strFacId
rstInspections.Fields("LOCATION_ID") = strLocId
rstInspections.Fields("FOLLOWUP_REQUIRED") = IIf(strOvInspection = 0, 0, -1)
rstInspections.Fields("CREATED_BY") = GetUName
rstInspections.Fields("IS_RESOLVED") = IIf(nCADCount = 0, -1, 0) 'no cads then its resolved
End If
rstInspections.Update
If strManualPdfPath = "" Then 'Email Addition
Call AppendToLogFile("SYSTEM", "AddEmailInspection", "Moving Inspection Subject=" & mi.Subject) 'do a from and to?
mi.Move fldOUT
End If
CopyFile strFilePath, strDestFilePath, 0
DoCmd.SetWarnings False
'append the wells data that we gathered before
DoCmd.RunSQL "INSERT INTO INSPECTION_WELLS SELECT TEMP_INSPECTION_WELLS.* FROM TEMP_INSPECTION_WELLS;"
'It was a Dupe
Else
If strManualPdfPath = "" Then 'Email Addition
Call AppendToLogFile("SYSTEM", "AddEmailInspection", "Disregarding Duplicate Doc Number(" & strDocNumber & ") Subject=" & mi.Subject)
mi.Categories = "Yellow Category"
rstExceptions.AddNew
rstExceptions.Fields("MAIL_ID") = mi.EntryID
rstExceptions.Fields("MAIL_DATE") = mi.CreationTime
rstExceptions.Fields("MAIL_SUBJECT") = mi.Subject
rstExceptions.Fields("DOCUMENT_ID") = strDocNumber
rstExceptions.Fields("MAIL_FLAG") = cie
rstExceptions.Fields("MAIL_ACTION") = "Moved to Duplicates Folder (" & CIError_ToString(cie) & ")"
rstExceptions.Update
Call AppendToLogFile("SYSTEM", "ProcessNewEmail", "Moving Duplicate Subject=" & mi.Subject) 'do a from and to?
mi.Move fldDUPE
Else 'Manual Addition
rstExceptions.AddNew
rstExceptions.Fields("MAIL_ID").Value = strFileName
rstExceptions.Fields("DOCUMENT_ID").Value = "BSDOCNO"
rstExceptions.Fields("MAIL_FLAG").Value = cie
rstExceptions.Fields("MAIL_ACTION").Value = "None - Manual Addition (" & CIError_ToString(cie) & ")"
rstExceptions.Update
Call AppendToLogFile("SYSTEM", "ProcessNewEmail", "Ignoring Duplicate User=" & GetUName) 'do a from and to?
End If
End If
If strManualPdfPath = "" Then 'Email Addition
Call AppendToLogFile("SYSTEM", "AddEmailInspection", "End Mail Item Process. Subject=" & mi.Subject)
Else 'Manual Addition
Call AppendToLogFile("SYSTEM", "AddEmailInspection", "End Manual Process. User=" & GetUName)
End If
ErrHandler:
DoCmd.SetWarnings True 'NOTE: silent error, but returning to caller
AddEmailInspection = ErrorHandler(err, "AddEmailInspection")
End Function
ASKER
More code.
Function AddInspFromPDF() As Long
On Error GoTo ErrHandler
Dim strPath As String
Dim rstInspections As ADODB.Recordset
Dim rstExceptions As ADODB.Recordset
Dim mi As Outlook.MailItem
Dim strAction As String
If GetRstStringValue("GLOBAL_SYS", "STATUS", "NULL") <> "NULL" Then
err.Raise -777, , "The System is Reporting that it is Currently doing a Mail Run. Please try again when the System is Done."
Exit Function
End If
strPath = ShowCdOpen(, , , , "*.pdf|*pdf")
If strPath = "" Then Exit Function
Set rstInspections = New ADODB.Recordset
Set rstExceptions = New ADODB.Recordset
'clear the temp run table
' DoCmd.SetWarnings False
' DoCmd.RunSQL "DELETE * FROM INSPECTIONS_EXP"
' DoCmd.SetWarnings True
'create the recordset object to use for this run
rstInspections.CursorLocation = adUseClient
rstInspections.CursorType = adOpenForwardOnly
rstInspections.LockType = adLockOptimistic
rstInspections.Open "INSPECTIONS", CurrentProject.Connection
rstExceptions.CursorLocation = adUseClient
rstExceptions.CursorType = adOpenForwardOnly
rstExceptions.LockType = adLockOptimistic
rstExceptions.Open "INSPECTIONS_EXP", CurrentProject.Connection
AddInspFromPDF = AddEmailInspection(rstInspections, rstExceptions, mi, strPath) 'process the current item
If AddInspFromPDF <> 0 Then
rstInspections.Close
rstExceptions.Close
Set rstInspections = Nothing
Set rstExceptions = Nothing
Set mi = Nothing
DoCmd.SetWarnings True
Exit Function
End If
rstInspections.Close
rstExceptions.Close
strAction = GetRstStringValue("INSPECTIONS_EXP", "MAIL_ACTION", "")
If strAction <> "" Then
MsgBox "This document# was not added; Action Descr: " & vbCrLf & vbCrLf & strAction, vbExclamation, "Uh oh..."
Else
DoCmd.OpenForm "frm_Inspection", , , , , , "frm_Inspections" & ";" & GetRstStringValue("SELECT DOCUMENT_ID FROM INSPECTIONS WHERE CREATED_BY = getuname() ORDER BY CREATED_DATE DESC;", "DOCUMENT_ID")
DoCmd.Close acForm, "frm_Main"
End If
ErrHandler:
Set rstInspections = Nothing
Set rstExceptions = Nothing
Set mi = Nothing
DoCmd.SetWarnings True
AddInspFromPDF = ErrorHandler(err, "AddInspFromPDF")
End Function
ASKER
Arana,
I wish it was that simply. The company has restricted users from changing this option.
Thanks,
gdunn59
I wish it was that simply. The company has restricted users from changing this option.
Thanks,
gdunn59
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.