Link to home
Start Free TrialLog in
Avatar of gdunn59
gdunn59

asked on

Access VBA Code to Loop Through Directory and Open PDF Files One at a Time and Scrape Data from Fields

I have some PDF files that are saved to a particular directory.  I need to loop through the directory and open the PDFs one at a time, scrape the data from the fields in the PDF, then close that PDF, move to a "Completed" directory, and then open the next one and do the same thing, until all PDFs in the directory have been opened, scraped and moved.

How can this be accomplished?

Thanks,
gdunn59

Here are the two directories:

Directory where all the PDFs are that need to loop through:   L:\SharedData\SYSTEMS\_temp_

Directory where once the PDF has been opened and the data scraped, need to move here:  L:\SharedData\SYSTEMS\_temp_\completed
Avatar of PatHartman
PatHartman
Flag of United States of America image

I don't think you can do this directly with Access.  PDF is not a file type that Access can read.  I think you are going to need the full version of Adobe (on every PC that needs to use this code) to give you libraries that can pull out data from the pdf.  The looping through the directory is done using the FSO (file system object) and there are lots of samples available although I can post one if you need it.
Avatar of gdunn59
gdunn59

ASKER

PatHartman,

We do have the full version of Adobe.

Yes, can you post the loop through directory code for me.

Thanks,
gdunn59
If you have office you can instance WORD to read the PDF , read data from it and do whatever you want with it, I played a while ago with something similar in this thread: VBA to convert PDF to EXCEL without using external software?
Avatar of gdunn59

ASKER

I have this code, but I get a runtime error 91 on Line 61. I get the error because the original PDF in the email attachment is disabled.  So prior to running the code, I do a File SaveAS and save the PDF with the "Enable All Features" turned off .  I want to just be able to go directly to the directory where the PDFs are located that I did a SaveAS with the "Enable All Features" turned off, open one at a time, and then run this code.

Function GetFileText(strFilePath As String) As String
'IAC objects
Dim gPdDoc As Acrobat.CAcroPDDoc
Dim gAvDoc As Acrobat.CAcroAVDoc

'variables
'Dim gPDFPath As String
Dim sName As String
Dim lNum As Integer

'Initialize Acrobat by creating App object
Set gApp = CreateObject("AcroExch.App")

'Set AVDoc object
'gAvDoc = gApp.GetActiveDoc
Set gAvDoc = CreateObject("AcroExch.AVDoc")

' open the PDF
If gAvDoc.Open(strFilePath, "") Then
    gApp.Show
    'Set PDDoc object and get some data
    Set gPdDoc = gAvDoc.GetPDDoc
    lNum = gPdDoc.GetNumPages
    sName = gPdDoc.GetFileName
    GetFileText = Acro_GetPageText(strFilePath, gPdDoc, gAvDoc, gApp)
End If

End Function

Function Acro_GetPageText(ByRef strFilePath As String, gPdDoc As Acrobat.CAcroPDDoc, gAvDoc As Acrobat.AcroAVDoc, gApp As AcroApp, Optional nPage As Long = -1) As String
  Dim r As New Acrobat.AcroRect
  Dim pdPage As Acrobat.CAcroPDPage
  Dim ts As Acrobat.CAcroPDTextSelect
  Dim pPoint As Acrobat.CAcroPoint
  Dim nTextLoop As Long
  Dim nPageLoop As Long
  Dim strOut As String
  Dim nPageStart As Long
  Dim nPageEnd As Long
  
  If nPage = -1 Then 'all pages
    nPageStart = 0
    nPageEnd = gPdDoc.GetNumPages - 1
  Else
    nPageStart = nPage
    nPageEnd = nPage
  End If

  For nPageLoop = nPageStart To nPageEnd

    Set pdPage = gPdDoc.AcquirePage(nPageLoop)
    Set pPoint = pdPage.GetSize

    r.Left = 0
    r.Top = pPoint.y
    r.Bottom = 0
    r.Right = pPoint.x

    Set ts = gPdDoc.CreateTextSelect(nPageLoop, r)
        
    For nTextLoop = 0 To ts.GetNumText - 1
      strOut = strOut & ts.GetText(nTextLoop)
    Next

    strOut = strOut & vbCrLf '!NOTE: we are putting a crlf after each page to simulate clipboard copy and paste
    ts.Destroy
  Next

  If Len(strOut) > 3 Then strOut = Left(strOut, Len(strOut) - 2) 'ditch extr crlf from between pages

  Acro_GetPageText = strOut
  
ErrHandler:
  Set gPdDoc = Nothing
  Set gAvDoc = Nothing
  Set pdPage = Nothing
  Set pPoint = Nothing
  Set ts = Nothing
  Set gApp = Nothing
  Call ErrorHandler(err, "Acro_GetPageText")
  
End Function

Open in new window


I've attached a screen shot showing the "Enable All Features" at the top of the PDF.  If I click on this it goes away and then I resave the PDF without this feature on, and the code is then able to scrape the data.  Right now the way the code is, it wants to open the PDF with the "Enable All Features" turned on and it fails because it can't access any of the fields in the PDF.
H--MS-Access-Development-Projects-Sy.jpg
you can disable that option in your acrobat preferences
Choose Preferences.
From the Categories on the left, select Security (Enhanced).
Uncheck Enhanced Security option
Avatar of gdunn59

ASKER

I didn't write this code, so I apologize, but here is apparently the code that is actually doing the scrape of the data:

Function AddEmailInspection(rstInspections As ADODB.Recordset, rstExceptions As ADODB.Recordset, mi As Outlook.MailItem, Optional strManualPdfPath As String = "") As Long
'  On Error GoTo ErrHandler
  Dim strFilePath As String
  Dim strFileName As String
  Dim strDestFilePath As String
  Dim strDestFileName As String
  Dim nCADCount As Long
  Dim strText As String
  Dim strTemp As String
  Dim cie As CIError
  Dim strSQL As String

  Dim strDocNumber As String
  Dim strInspDate As String
  Dim strOvInspection As String
  Dim strInspectorName As String
  Dim strLocId As String
  Dim strFacId As String
  Dim strFrom As String

  Dim nPDFTextRetryCount As Long
  Dim nPDFTextRetryMax As Long
  

  cie = None
  
  If strManualPdfPath <> "" Then 'manual addition
    Call AppendToLogFile("SYSTEM", "AddEmailInspection", "Begin Manual Process. User=" & GetUName)
  Else
    Call AppendToLogFile("SYSTEM", "AddEmailInspection", "Begin Mail Item Process. Subject=" & mi.Subject)
  End If
  
  If strManualPdfPath = "" Then '--------------------------------------> Email Addition
    'check the from
    strFrom = LCase(Trim(mi.SenderEmailAddress))
'    If strFrom <> "dnr_eforms.helpdesk@state.co.us" Then 'this email did not come from expected source
'      If InStr(1, strFrom, "state.co.us") < 1 Then 'didnt come from the state
'        cie = FromPersonOther
'      Else 'came from the state
'        cie = FromPersonState
'      End If
'      GoTo ExceptionCheck 'do NOT check anything else this will override anything else
'    End If

'    If strFrom <> "dnr_eforms.helpdesk@state.co.us" Then 'this email did not come from expected source
      If InStr(1, strFrom, "state.co.us") < 1 Then 'didnt come from the state
        cie = FromPersonOther
        GoTo ExceptionCheck 'do NOT check anything else this will override anything else
'      Else 'came from the state
'        cie = FromPersonState
      End If
'      GoTo ExceptionCheck 'do NOT check anything else this will override anything else
'    End If

  
    'Check the subject
    strTemp = mi.Subject
    If InStr(1, UCase(strTemp), "INSPECTION REPORT [") < 1 Then cie = InvalidSubject

    'Check for number of attachments ; there should only be 1
    If mi.attachments.Count <> 1 Then cie = InvalidAttachmentCount
  
    If mi.attachments.Count = 1 Then
      'Check for the attachment filetype
      strTemp = mi.attachments.Item(1).FileName
      If UCase(Right(strTemp, 4)) <> ".PDF" Then cie = InvalidAttachmentType
    End If
  
ExceptionCheck:
    If cie <> None Then  'there was an error, log it, move it, and jump out
      Call AppendToLogFile("SYSTEM", "AddEmailInspection", "Exception Error (" & CIError_ToString(cie) & ") Subject=" & mi.Subject)
    
      rstExceptions.AddNew
      rstExceptions.Fields("MAIL_ID") = mi.EntryID
      rstExceptions.Fields("MAIL_DATE") = mi.CreationTime
      rstExceptions.Fields("MAIL_SUBJECT") = mi.Subject
      rstExceptions.Fields("MAIL_FLAG") = cie
    
      If cie = FromPersonState Then
        rstExceptions.Fields("MAIL_ACTION") = "Moved to From CO State Folder (" & CIError_ToString(cie) & "[" & strFrom & "])"
      ElseIf cie = FromPersonOther Then
        rstExceptions.Fields("MAIL_ACTION") = "Moved to From Other Folder (" & CIError_ToString(cie) & "[" & strFrom & "])"
      Else
        rstExceptions.Fields("MAIL_ACTION") = "Moved to Exceptions Folder (" & CIError_ToString(cie) & ")"
      End If
    
      rstExceptions.Update

      mi.Categories = "Red Category"
    
      If cie = FromPersonState Then
'        mi.Move fldFromState  ' commented this out on 1-5-2017 because it was moving the emails to the From State email Folder and it shouldn't be
        mi.Move fldOUT   ' this is moving the emails from the InBox to the SYS_INSP_D0_NOT_TOUCH email Folder
      ElseIf cie = FromPersonOther Then
        mi.Move fldFromOther
      Else
        mi.Move fldERR
      End If
    
      GoTo ErrHandler
    End If
  
  End If 'end manual addition check '<-------------------------------------- Email Addition
  
  'Assure that temp folder exists and is clear it out...we only gonna do these one at a time
  If Dir(GetTempFolderDir, vbDirectory) = "" Then MkDir GetTempFolderDir
   AddEmailInspection = KillFolderFiles(GetTempFolderDir)
  If AddEmailInspection <> 0 Then Exit Function
  
  If strManualPdfPath = "" Then 'Email Addition
  
    'gather file information
    strFilePath = GetTempFolderDir & mi.attachments.Item(1).FileName
    strFileName = mi.attachments.Item(1).FileName
  
    'save the attachment
    Call AppendToLogFile("SYSTEM", "AddEmailInspection", "Saving PDF(" & mi.attachments.Item(1).FileName & ") and extracting data. Subject=" & mi.Subject)
    mi.attachments.Item(1).SaveAsFile strFilePath
  Else 'Manual Addition
  
    strFileName = Right(strManualPdfPath, Len(strManualPdfPath) - InStrRev(strManualPdfPath, "\"))
    strFilePath = GetTempFolderDir & strFileName
    Call AppendToLogFile("SYSTEM", "AddEmailInspection", "Saving Manual PDF(" & strFileName & ") and extracting data. User=" & GetUName)
    CopyFile strManualPdfPath, strFilePath, False
  End If

  'begin pdf data scrape and get api and anorm data first


  nPDFTextRetryCount = 1
  nPDFTextRetryMax = 10
RetryPDFText:
  
  'looks like we have to assure that the pdf is fully saved or will get blank text on getpdftext
'  Call AppendToLogFile("SYSTEM", "AddEmailInspection", "Waiting for File Save (5 Seconds)")
 ' Call TempTimer(5)
 
  Call AppendToLogFile("SYSTEM", "AddEmailInspection", "Attempting PDF Text Save...")
  strText = GetFileText(strFilePath)
  If strText = "" Then 
   ' err.Raise -666, , "No PDF File Text Available; Please Try Again."
   
     Call AppendToLogFile("SYSTEM", "AddEmailInspection", "PDF Text Fail: Try " & nPDFTextRetryCount & " of " & nPDFTextRetryMax)
     nPDFTextRetryCount = nPDFTextRetryCount + 1
     If nPDFTextRetryCount > nPDFTextRetryMax Then
       err.Raise -666, , "No PDF File Text Available; Please Try Again."
     End If
     Call TempTimer(1)
     GoTo RetryPDFText
   
  End If

  'scrape the other pdf values
  strDocNumber = GetDocumentNumber(strText)
  strInspDate = GetInspectionDate(strText)
  strOvInspection = GetOverallInspection(strText)
  strInspectorName = GetInspectorName(strText)
  strLocId = GetLocationId(strText)
  strFacId = GetFacilityId(strText)
  
  
 ' MsgBox strDocNumber & vbCrLf & strInspDate & vbCrLf & strOvInspection & vbCrLf & strLocId & vbCrLf & strFacId & vbCrLf
  
  
 
  Call GatherDocumentWells(strDocNumber, strText)  'NOTE: we are not checking for errors!!!!

  AddEmailInspection = GetSysDateCAD(strDocNumber, CDate(strInspDate), strText)
  If AddEmailInspection <> 0 Then Exit Function

  
  'check for doc# existence in the current configuration of 3 tables
  
  cie = None  'assure flagged as ok to add unless something changes it
  'Check for docnumber existence in INSPECTIONS
  If GetRecordCount("SELECT * FROM INSPECTIONS WHERE DOCUMENT_ID='" & strDocNumber & "'") > 0 Then
  
    cie = DuplicateItem
  End If
  
  'It was a Valid Inspection
  If cie = None Then
  
    If strManualPdfPath = "" Then 'Email Addition
      Call AppendToLogFile("SYSTEM", "AddEmailInspection", "Adding data to DB Table Subject=" & mi.Subject)
      mi.Categories = "Green Category"
    Else 'Manual Addition
      Call AppendToLogFile("SYSTEM", "AddEmailInspection", "Adding data to DB Table User=" & GetUName)
    End If
    
    strDestFilePath = AssureSaveFileName(InspectionsPdfFolder, strFileName)
    strDestFileName = Right(strDestFilePath, Len(strDestFilePath) - InStrRev(strDestFilePath, "\"))
  
    
    nCADCount = GetRecordCount("TEMP_FILE_DATES")
    If strManualPdfPath = "" Then 'Email Addtion (can not use mi since getting object not set error for unknown reason...heh)
      'add the record to inspections table
      rstInspections.AddNew
      rstInspections.Fields("MAIL_ID") = mi.EntryID
      rstInspections.Fields("MAIL_DATE") = mi.CreationTime
      rstInspections.Fields("MAIL_SUBJECT") = mi.Subject
      rstInspections.Fields("MAIL_PDF_NAME") = strDestFileName
      rstInspections.Fields("DOCUMENT_ID") = strDocNumber
      rstInspections.Fields("INSPECTION_DATE") = IIf(strInspDate = "0", Null, CDate(strInspDate))
      rstInspections.Fields("INSPECTION_RATING") = IIf(strOvInspection = "", Null, strOvInspection)
      rstInspections.Fields("COGCC_INSPECTOR") = strInspectorName
      rstInspections.Fields("FACILITY_ID") = strFacId
      rstInspections.Fields("LOCATION_ID") = strLocId
      rstInspections.Fields("FOLLOWUP_REQUIRED") = IIf(strOvInspection = 0, 0, -1)
      rstInspections.Fields("CREATED_BY") = "SYS"
      rstInspections.Fields("IS_RESOLVED") = IIf(nCADCount = 0, -1, 0) 'no cads then its resolved
    
    Else 'Manual Addition
      'add the record to inspections table
      rstInspections.AddNew
      rstInspections.Fields("MAIL_ID") = Null
      rstInspections.Fields("MAIL_DATE") = Null
      rstInspections.Fields("MAIL_SUBJECT") = Null
      rstInspections.Fields("MAIL_PDF_NAME") = strDestFileName
      rstInspections.Fields("DOCUMENT_ID") = strDocNumber
      rstInspections.Fields("INSPECTION_DATE") = IIf(strInspDate = "0", Null, CDate(strInspDate))
      rstInspections.Fields("INSPECTION_RATING") = IIf(strOvInspection = "", Null, strOvInspection)
      rstInspections.Fields("COGCC_INSPECTOR") = strInspectorName
      rstInspections.Fields("FACILITY_ID") = strFacId
      rstInspections.Fields("LOCATION_ID") = strLocId
      rstInspections.Fields("FOLLOWUP_REQUIRED") = IIf(strOvInspection = 0, 0, -1)
      rstInspections.Fields("CREATED_BY") = GetUName
      rstInspections.Fields("IS_RESOLVED") = IIf(nCADCount = 0, -1, 0) 'no cads then its resolved
    
    End If
    
    rstInspections.Update
    If strManualPdfPath = "" Then 'Email Addition
      Call AppendToLogFile("SYSTEM", "AddEmailInspection", "Moving Inspection Subject=" & mi.Subject) 'do a from and to?
      mi.Move fldOUT
    End If
    
    CopyFile strFilePath, strDestFilePath, 0
    DoCmd.SetWarnings False
    
    'append the wells data that we gathered before
    DoCmd.RunSQL "INSERT INTO INSPECTION_WELLS SELECT TEMP_INSPECTION_WELLS.* FROM TEMP_INSPECTION_WELLS;"


  'It was a Dupe
  Else
  
    If strManualPdfPath = "" Then 'Email Addition
      Call AppendToLogFile("SYSTEM", "AddEmailInspection", "Disregarding Duplicate Doc Number(" & strDocNumber & ") Subject=" & mi.Subject)
      mi.Categories = "Yellow Category"
      rstExceptions.AddNew
      rstExceptions.Fields("MAIL_ID") = mi.EntryID
      rstExceptions.Fields("MAIL_DATE") = mi.CreationTime
      rstExceptions.Fields("MAIL_SUBJECT") = mi.Subject
      rstExceptions.Fields("DOCUMENT_ID") = strDocNumber
      rstExceptions.Fields("MAIL_FLAG") = cie
      rstExceptions.Fields("MAIL_ACTION") = "Moved to Duplicates Folder (" & CIError_ToString(cie) & ")"
      rstExceptions.Update

      Call AppendToLogFile("SYSTEM", "ProcessNewEmail", "Moving Duplicate Subject=" & mi.Subject) 'do a from and to?
      mi.Move fldDUPE
    
    Else 'Manual Addition
      rstExceptions.AddNew
      rstExceptions.Fields("MAIL_ID").Value = strFileName   
      rstExceptions.Fields("DOCUMENT_ID").Value = "BSDOCNO" 
      rstExceptions.Fields("MAIL_FLAG").Value = cie
      rstExceptions.Fields("MAIL_ACTION").Value = "None - Manual Addition (" & CIError_ToString(cie) & ")"
      rstExceptions.Update

      Call AppendToLogFile("SYSTEM", "ProcessNewEmail", "Ignoring Duplicate User=" & GetUName) 'do a from and to?
    
    End If
    
  End If

  If strManualPdfPath = "" Then 'Email Addition
    Call AppendToLogFile("SYSTEM", "AddEmailInspection", "End Mail Item Process. Subject=" & mi.Subject)
  Else 'Manual Addition
    Call AppendToLogFile("SYSTEM", "AddEmailInspection", "End Manual Process. User=" & GetUName)
  End If

ErrHandler:
  DoCmd.SetWarnings True 'NOTE: silent error, but returning to caller
  AddEmailInspection = ErrorHandler(err, "AddEmailInspection")

End Function

Open in new window

Avatar of gdunn59

ASKER

More code.

Function AddInspFromPDF() As Long
  On Error GoTo ErrHandler
  Dim strPath As String
  Dim rstInspections As ADODB.Recordset
  Dim rstExceptions As ADODB.Recordset
  Dim mi As Outlook.MailItem
  Dim strAction As String
  
  If GetRstStringValue("GLOBAL_SYS", "STATUS", "NULL") <> "NULL" Then
    err.Raise -777, , "The System is Reporting that it is Currently doing a Mail Run. Please try again when the System is Done."
    Exit Function
  End If
  
  strPath = ShowCdOpen(, , , , "*.pdf|*pdf")
  If strPath = "" Then Exit Function
  
  Set rstInspections = New ADODB.Recordset
  Set rstExceptions = New ADODB.Recordset

 'clear the temp run table
'  DoCmd.SetWarnings False
'  DoCmd.RunSQL "DELETE * FROM INSPECTIONS_EXP"
'  DoCmd.SetWarnings True
  
  'create the recordset object to use for this run
  rstInspections.CursorLocation = adUseClient
  rstInspections.CursorType = adOpenForwardOnly
  rstInspections.LockType = adLockOptimistic
  rstInspections.Open "INSPECTIONS", CurrentProject.Connection
  
  rstExceptions.CursorLocation = adUseClient
  rstExceptions.CursorType = adOpenForwardOnly
  rstExceptions.LockType = adLockOptimistic
  rstExceptions.Open "INSPECTIONS_EXP", CurrentProject.Connection
  
  AddInspFromPDF = AddEmailInspection(rstInspections, rstExceptions, mi, strPath)  'process the current item
  If AddInspFromPDF <> 0 Then
    rstInspections.Close
    rstExceptions.Close
    Set rstInspections = Nothing
    Set rstExceptions = Nothing
    Set mi = Nothing
    DoCmd.SetWarnings True
    Exit Function
  End If
   
  rstInspections.Close
  rstExceptions.Close
  
  strAction = GetRstStringValue("INSPECTIONS_EXP", "MAIL_ACTION", "")
  If strAction <> "" Then
    MsgBox "This document# was not added; Action Descr: " & vbCrLf & vbCrLf & strAction, vbExclamation, "Uh oh..."
  Else
    DoCmd.OpenForm "frm_Inspection", , , , , , "frm_Inspections" & ";" & GetRstStringValue("SELECT DOCUMENT_ID FROM INSPECTIONS WHERE CREATED_BY = getuname() ORDER BY CREATED_DATE DESC;", "DOCUMENT_ID")
    DoCmd.Close acForm, "frm_Main"
  End If
  
ErrHandler:
  Set rstInspections = Nothing
  Set rstExceptions = Nothing
  Set mi = Nothing
  DoCmd.SetWarnings True
  AddInspFromPDF = ErrorHandler(err, "AddInspFromPDF")
End Function

Open in new window

Avatar of gdunn59

ASKER

Arana,

I wish it was that simply.  The company has restricted users from changing this option.

Thanks,
gdunn59
ASKER CERTIFIED SOLUTION
Avatar of PatHartman
PatHartman
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial