How do I use VBA to select first line of each page of a PDF document and create a bookmark?

Posted on 2011-02-23
Last Modified: 2013-11-10
I have a 1,500 page PDF document. I can open the PDF file with the script below from VBA (using Access 2007), see the document, and see that I"m highlighting the words 'Weekday','Route', and 'Block' in the document. I can also get the total number of pages in the document.

What I want to do is highlight the first line of text on each page in the document and create a bookmark for each page based on that first line of text, so at the end of this process I'll have 1,500 bookmarks. Any idea what object in Adobe can access a portion of a page (like the first line)?

Code that works is below. Try it yourself if you have Access and Adobe Pro, just substitute your document name for the Path and PDF file.

Function Add_Traincard_Bookmarks(PathAndPDF_File As String) As Boolean

Dim Exch As Object
Dim AVDocu As Object
Dim AVPageView As Object
Dim PDDocu As Object
Dim PDPage As Object
Dim PDText As Object

Dim numPages As Integer
Dim bFile As Boolean
Dim bShow As Boolean
Dim iPageNumber As Integer

Set Exch = CreateObject("AcroExch.App")
Set AVDocu = CreateObject("AcroExch.AVDoc")
Set PDDocu = CreateObject("AcroExch.PDDoc")

AVDocu.Open PathAndPDF_File, PathAndPDF_File

Debug.Print bShow
bShow = Exch.Show()
Debug.Print bShow

Set PDDocu = AVDocu.GetPDDoc
numPages = PDDocu.GetNumPages()
Debug.Print numPages

Set AVPageView = AVDocu.GetAVPageView

AVDocu.FindText "WEEKDAY", True, True, True

AVDocu.FindText "ROUTE", True, True, True
AVDocu.FindText "BLOCK", True, True, True

AVDocu.Close (0)


Set Exch = Nothing
Set PDDocu = Nothing
Set AVDocu = Nothing

End Function
Question by:AztecCyclocross
  • 3
  • 3
LVL 11

Expert Comment

ID: 34967124
have a look at the attached
Sub dasdf()
Dim Exch As Object
Dim AVDocu As Object
Dim AVPageView As Object
Dim PDDocu As Object
Dim PDPage As Object
Dim PDText As Object

Dim PDBookmark As Object

Dim numPages As Integer
Dim bFile As Boolean
Dim bShow As Boolean
Dim iPageNumber As Integer

Dim i As Long, j As Long

Set Exch = CreateObject("AcroExch.App")
Set AVDocu = CreateObject("AcroExch.AVDoc")
Set PDDocu = CreateObject("AcroExch.PDDoc")

AVDocu.Open PathAndPDF_File, PathAndPDF_File

Debug.Print bShow
bShow = Exch.Show()
Debug.Print bShow

Set PDDocu = AVDocu.GetPDDoc
numPages = PDDocu.GetNumPages()
Debug.Print numPages

Set AVPageView = AVDocu.GetAVPageView

Dim bookmarkstr As String

'AVDocu.FindText "WEEKDAY", True, True, True
'AVDocu.FindText "ROUTE", True, True, True
'AVDocu.FindText "BLOCK", True, True, True
Dim jso As Object
Set jso = PDDocu.GetJSObject

For i = 0 To numPages - 1
    AVDocu.GetAVPageView.GoTo i

    pageWordCount = jso.getPageNumWords(i)
    'for each word
    j = 0
    bookmarkstr = jso.getPageNthWord(i, j)
    j = 1
    Do While j < 10 Or j < pageWordCount - 1
        bookmarkstr = bookmarkstr & " " & jso.getPageNthWord(i, j)
        j = j + 1

    'Create BookMark Object
    Set PDBookmark = CreateObject("AcroExch.PDBookmark", "")
    'execute the menu item
    Exch.MenuItemExecute ("NewBookmark")
    'set bookmark title
    btitle = PDBookmark.GetByTitle(PDDocu, "Untitled")
    btitle = PDBookmark.SetTitle(bookmarkstr)

Next i

Exch.MenuItemExecute ("Save")
AVDocu.Close (0)


Set Exch = Nothing
Set PDDocu = Nothing
Set AVDocu = Nothing
End Sub

Open in new window


Author Comment

ID: 34981583
Thanks. The getPageNthWord worked well on my first test document. I was able to extract all the words contants of a 4 page version of my PDF document and read it into a database table in Access so I could figure out which words on page I want to use for bookmark. It seems the words I want to show are in a heaader and arent' the first words stored in the document, but I should be able to figure it out.

However, before I started writing the code to parse words and create bookmarks, I tried reading another PDF document, and found that the same code would not read a second document that was very similar and created by the same process as the first document. I was able to count the number of words, using the getPageNumWords but when I cycled through all i, j values, none returned any text strings with getPageNthWord method. I tested my code against the first document again, and was still able to read words, so I did more research.

I observed that on the first document I could search manaully for text in the document using the ctrl-F from within the application, but I coudn't do so on the second document. So I theorized that the second document was saving the words as graphics. So on the second document I ran the Document > OCR Text Recognition > Recoginize Text using OCR ... option from within the application. But I get a message "Acrobat could not perform recogntion (OCR) on the page because: this page contains renderable text."

So I found a link on the adobe website that tells me in Solution 2 how convert the document to a set of TIFF files, use OCR on the documents recombine these pages into a PDF file. This could work, but seems awfully convoluted. Solution 1 on this link says to "Obtain a version of the document that does not contain renderable (editable) text". I've talked to the developer of the PDF files and found that to him this process is a 'black box' and he doesn't think he can control whether PDF documents contain text or graphics. Any suggestions how to create a PDF file that is all text elements (so I don't have to do any OCR) or all graphics elements (so I can use the OCR on the PDF file directly without having to break my 1,500 page document into 1,500 tiff files then recombine) would be appreciated. Any ideas?

Or, maybe alternatively, is ther some other way to read the words from the second document that is different than the getPageNthWord method?

LVL 11

Expert Comment

ID: 34981959
AztecCyclocross, if you can't select the text or find the text, I don't think it would be possible without doing a OCR. and even scripting the OCR, it may not be always accurate.

how is the document being created? what is the source?
Independent Software Vendors: We Want Your Opinion

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!


Author Comment

ID: 35013400

The PDF document is created via Business Objects Crystal Reports. I think version IX.

I've part-way written a program to implement Solution 2 in the following link I've been able to create a folder via VBA, then take a test file and break it into the tiff files and save in a sub folder.

Where I am now hung up is finding a way to execute the Adobe Acrobat Pro "Documents | OCR Text Recognition | Recognize Text in Multiple Files Using OCR ..." I can do this step manually using the interface, and it seems to give good results as it creates nice PDF files in my sub directory that can now be searched using cntl-F, but I'd like to do it directly from VBA using the application object or the jsobject or any other object, but I can't seem to find a reference on how to access this tool via a script of any sort.  Any ideas would be helpful. I'd like access the objects and pass them the file location where the tiff files are stored and the file location where I'd want to create the new searchable PDF files, and then the program could take over and do it's work on converting them.

Then I would need to recomine these PDF documents somehow using VBA code. I'm not sure how to do that yet either.

Or, if there is a simple switch somewhere in the Crystal Reports that will export these as 'searchable' PDF documents directly, that would save me the time of writing all of this code to break into Tiff files and then convert using OCR then recombine in one PDF so I can apply bookmarks. But neither I nor the Crystal Reports developer seem to know what that switch would be or if it even exists. Any thoughts on this would be helpful as well.

LVL 11

Expert Comment

ID: 35014159

have a look at

It doesn't look like it is possible, unless you use sendkeys.

Accepted Solution

AztecCyclocross earned 0 total points
ID: 35059698
Hi thydzik --

I was able to solve this problem with the code listed below. I decided I didn't want to write the 'send keys' portion to automate the production of the OCR readable PDF files. I'll do that manually. The code below creates bookmarks from a file and saves them to table in Access, and then applys parent and children bookmarks to the Acrobat document in step 4.

The code reads in all words in a pdf document and stored the words in a table called tblContainer. At the same the code developed 3 potential strings to use as a bookmark text. One is called the parent, the other two bookmark text and corrected bookmark text. The parent and corrected bookmark text are what I eventually apply as bookmarks. The only tricks here are I only need text from every other page (hence the mod 2), and that I need to find certain words like 'Route' and 'Block' and get the word that immediately follows these words. I find these by looping.

The only portion of the code that may be of interest here is how i apply the bookmarks and parent bookmarks. It uses the java script object of Acrobat accessable via VBA to apply children to the bookmark root object. code is below. Also, applying color to the bookmark may be of interest, though I haven't tested the color portion yet.The code below works except I haven't tested passing the color if there is a detected error in reading the bookmark object.

The color itself has to be passed as a variant array through the javascript object--it took me awhile to find documentation on how this is supposed to work..This code should make the bookmark yellow if the code things the bookmark might be in error.

Also, I anticipate modifying this script to do some more document handling, so there are some seeming dead ends where I get information about the path and file name, but those are irrelevant to the portion in Step 4.

Function acAdd_Bookmarks_To_File()

'Needed for Step 1, aquiring path and file names
Dim lngCount As Long 'length of path and file name
Dim lngI As Long 'integer counter variable
Dim strStore1 As String
Dim strStore2 As String
Dim strStore3 As String
Dim strStore4 As String
Dim strStore5 As String
Dim strPathOnly As String
Dim strFileOnly As String
Dim strFileOnlyWithoutExtension As String
Dim fDialog As Office.FileDialog
Dim varFile As Variant
Dim strPathAndFileName As String

'Needed for Step 2, Determining File Type
Dim bHeadway As Boolean
Dim bTraincard As Boolean
Dim strSubDirectory As String
Dim strNewPathAndFile As String
Dim strNewFilePath As String
Dim strNewPath As String
Dim lngPageNumFromObject As Long

'Needed for Step 3, opening PdF file

Dim AcrobatApp As Object
Dim AcrobatAVDocument As Object
Dim AcrobatPDDocument As Object
Dim AcrobatPageView As Object
Dim AcrobatJSObject As Object
Dim lngNumberPages As Long
Dim retVal As Double
Dim lngMessage As Long
Dim db As DAO.Database
Dim rec As DAO.Recordset
Dim rec2 As DAO.Recordset
Dim strSQL As String
Dim lngJ As String
Dim lngPageWordCount As Long
Dim lngPageWordMax As Long
Dim strPotentialBookmark As String
Dim strParent As String
Dim strWord As String
Dim strServiceName As String
Dim strRouteNum As String
Dim strBlockNum As String
Dim bolServiceTag As Boolean
Dim bolRouteTag As Boolean
Dim bolBlockTag As Boolean
Dim lngServiceWordNum As Long
Dim lngRouteWordNum As Long
Dim lngBlockWordNum As Long
Dim strCorrectedBookmark As String
Dim bolAllGood As Boolean
Dim bolGotNewData As Boolean

'Variable for Step 4 to apply bookmarks to PDF file

Dim rec3 As DAO.Recordset
Dim vBookmarkRoot As Variant
Dim BookmarkRootObject As Object
Dim vBookmarkChild As Variant
Dim BookmarkChildObject As Object
Dim strBookmarkParent As String
Dim strBookmarkText As String
Dim bTitle As Boolean
Dim lngPage As Long
Dim strOldParent As String
Dim lngBookmarkCounter As Long
Dim lngParentCounter As Long
Dim Color(0 To 4) As Variant
Dim strPageNumberText As String
Dim vBookmarkColor As Variant
Dim vBookmarkAction As Variant
Dim lngPageApparant As Long


bolGotNewData = False

'Step 1: Get the path where the file is stored

Set fDialog = Application.FileDialog(msoFileDialogFilePicker)

With fDialog
    .AllowMultiSelect = False
    .Title = "Please select one Acrobat file that has been made searchable"
    .Filters.Add "Acrobat Files", "*.PDF"

    If .Show = True Then
        For Each varFile In .SelectedItems
            strPathAndFileName = varFile
    End If

End With

lngCount = Len(strPathAndFileName)
For lngI = 0 To lngCount
    strStore1 = Left(strPathAndFileName, lngCount - lngI)
    strStore2 = Right(strStore1, 1)
    strStore3 = Right(strPathAndFileName, lngI)
    'Debug.Print strStore1, strStore2, strStore3
    If strStore2 = "\" Then
        strPathOnly = Left(strStore1, Len(strStore1))
        strFileOnly = strStore3
        Exit For
    End If

For lngI = 0 To lngCount
    strStore4 = Left(strFileOnly, lngCount - lngI)
    strStore5 = Right(strStore4, 1)
    'Debug.Print strStore4, strStore5
    If strStore5 = "." Then
        strFileOnlyWithoutExtension = Left(strStore4, Len(strStore4) - 1)
    Exit For
    End If

'Debug.Print strPathOnly
'Debug.Print strFileOnly
'Debug.Print strFileOnlyWithoutExtension

'Step 2 = Determine type of file, set sub directory string
bHeadway = False
bTraincard = False

If InStr(1, UCase(strFileOnly), "HEADWAY") Then
    bHeadway = True
    strSubDirectory = "Headway_TIFFs"
End If

If InStr(1, UCase(strFileOnly), "TRAINCARD") Then
    bTraincard = True
    strSubDirectory = "Traincard_TIFFs"
End If

strNewFilePath = strFileOnly & strSubDirectory
Dir (strPathOnly)

'Step 3 = Open file in Adobe Acrobat, read all contents into a database table called tblContainer, while reading the document
' find the words 'Route' and 'Block' and look for a service like ('Weekday, Saturday, Sunday, or Custom') and
' put together some strings that can potentially be used as bookmark text and save them to a table called
' bookmark text

Set db = CurrentDb
strSQL = "Delete from tblContainer"
db.Execute strSQL
strSQL = "select page_num, long_i, long_j, text_string from tblContainer"
Set rec = db.OpenRecordset(strSQL)
strSQL = "Delete from tblBookmarkText"
db.Execute strSQL
strSQL = "Select page_num, parent, corrected_bookmark_text, all_good, potential_bookmark_text from tblBookmarkText"
Set rec2 = db.OpenRecordset(strSQL)

If UCase(Right(strFileOnly, 3)) Like "PDF" Then
    Set AcrobatApp = CreateObject("AcroExch.App")
    Set AcrobatAVDocument = CreateObject("AcroExch.AVDoc")
    'Set AcrobatPageView = CreateObject("AcroExch.AVPageView")
    Set AcrobatPDDocument = CreateObject("AcroExch.PDdoc")
    AcrobatAVDocument.Open strPathAndFileName, strPathAndFileName
    Set AcrobatPageView = AcrobatAVDocument.GetAVPageView
    Set AcrobatPDDocument = AcrobatAVDocument.GetPDDoc
    Set AcrobatJSObject = AcrobatPDDocument.GetJSObject
    lngNumberPages = AcrobatPDDocument.GetNumPages()
    strParent = ""
    For lngI = 1 To lngNumberPages - 1
        'AcrobatJSObject.GoToPage (lngI)
        AcrobatAVDocument.GetAVPageView.Goto lngI
        lngPageWordCount = AcrobatJSObject.getpageNumWords(lngI)
        lngJ = 0
        strPotentialBookmark = ""
        If lngPageWordCount < 4000 Then
            lngPageWordMax = lngPageWordCount
            lngPageWordMax = 4000
        End If
        bolServiceTag = False
        bolRouteTag = False
        bolBlockTag = False
        bolAllGood = False
        lngRouteWordNum = 0
        lngBlockWordNum = 0
        strCorrectedBookmark = ""
        strServiceName = ""
        Do While lngJ < lngPageWordMax - 1
            lngPageNumFromObject = AcrobatJSObject.pageNum
            If lngPageNumFromObject Mod 2 = 1 Then
                rec("page_num") = AcrobatJSObject.pageNum
                rec("long_i") = lngI
                rec("long_j") = lngJ
                rec("text_string") = AcrobatJSObject.getpageNthWord(lngI, lngJ)
                strWord = UCase(AcrobatJSObject.getpageNthWord(lngI, lngJ))
                If lngJ = 0 And AcrobatJSObject.getpageNthWord(lngI, lngJ) <> UCase(strParent) Then
                    If Len(strWord) > 1 Then
                        strParent = UCase(Left(strWord, 1)) & LCase(Right(strWord, Len(strWord) - 1))
                        strParent = strWord
                    End If
                End If
                If bolServiceTag = False And (strWord Like "WEEKDAY" Or strWord Like "SATURDAY" Or strWord Like "SUNDAY" Or strWord Like "CUSTOM") Then
                    strServiceName = UCase(Left(strWord, 1)) & LCase(Right(strWord, Len(strWord) - 1))
                    bolServiceTag = True
                End If
                If bolRouteTag = False And strWord Like "ROUTE" Then
                    lngRouteWordNum = lngJ
                End If
                If bolRouteTag = False And lngRouteWordNum + 1 > 1 And lngRouteWordNum + 1 = lngJ And Len(Trim(strWord)) <= 4 Then
                    strRouteNum = strWord
                    bolRouteTag = True
                End If
                If bolBlockTag = False And strWord Like "BLOCK" Then
                    lngBlockWordNum = lngJ
                End If
                If bolBlockTag = False And lngBlockWordNum + 1 > 1 And lngBlockWordNum + 1 = lngJ And Len(Trim(strWord)) <= 2 Then
                    strBlockNum = strWord
                    bolBlockTag = True
                End If
                If lngJ > 0 And lngJ <= 30 Then
                    strPotentialBookmark = strPotentialBookmark & " " & AcrobatJSObject.getpageNthWord(lngI, lngJ)
                End If
                bolAllGood = bolServiceTag And bolRouteTag And bolBlockTag
                If bolAllGood = True Then
                    Exit Do
                End If
            End If
            lngJ = lngJ + 1
        If lngPageNumFromObject Mod 2 = 1 Then
            bolAllGood = bolServiceTag And bolRouteTag And bolBlockTag
            strCorrectedBookmark = strServiceName & " Route " & strRouteNum & " Block " & strBlockNum
            If lngPageNumFromObject Mod 7 = 0 Then Debug.Print bolAllGood, lngPageNumFromObject, strCorrectedBookmark
            rec2("page_num") = AcrobatJSObject.pageNum
            rec2("parent") = strParent
            rec2("corrected_bookmark_text") = strCorrectedBookmark
            rec2("all_good") = bolAllGood
            rec2("potential_bookmark_text") = strPotentialBookmark
         End If
bolGotNewData = True
strStorageLocation = strPathOnly & strSubDirectory & "_OCR"
Set rec2 = Nothing
End If

'Step 4 Apply the bookmarks to the Acrobat document by reading the corrected_bookmark_text from the tblBookmarkText.
'Assign each bookmark child to an appropriate parent bookmark.

strSQL = "Select page_num, parent, corrected_bookmark_text, all_good, potential_bookmark_text from tblBookmarkText order by clng(page_num)"
Set rec3 = db.OpenRecordset(strSQL)
strOldParent = ""
lngBookmarkCounter = 0
lngParentCounter = 0

Color(0) = "CMYK"
Color(1) = 0#
Color(2) = 0#
Color(3) = 1#
Color(4) = 0#

If rec3.BOF = False And rec3.EOF = False And bolGotNewData = True Then

    'Set AcrobatBookmark = CreateObject("AcroExch.PDBookmark", "")
    Do While rec3.EOF = False
       strBookmarkParent = rec3("parent")
       strBookmarkText = rec3("corrected_bookmark_text")
       bolAllGood = False
       bolAllGood = rec3("all_good")
       lngPage = rec3("page_num")
       Set BookmarkRootObject = AcrobatJSObject.BookmarkRoot
       If strBookmarkParent <> strOldParent Then
            lngPageApparant = lngPage
            AcrobatAVDocument.GetAVPageView.Goto lngPageApparant

            BookmarkRootObject.createChild strBookmarkParent, "pageNum=" & lngPage, lngParentCounter
            vBookmarkChild = BookmarkRootObject.Children
            Set BookmarkChildObject = vBookmarkChild(lngParentCounter)
            strPageNumberText = "pageNum =" & lngPageApparant
            BookmarkChildObject.createChild strBookmarkText, "pageNum=" & lngPage, lngBookmarkCounter
            If bolAllGood = False Then
                BookmarkChildObject.Children(lngBookmarkCounter).Color = Color()
            End If
            strOldParent = strBookmarkParent
            lngParentCounter = lngParentCounter + 1
            lngBookmarkCounter = lngBookmarkCounter + 1
            AcrobatAVDocument.GetAVPageView.Goto lngPageApparant
            strPageNumberText = "pageNum =" & lngPageApparant
            BookmarkChildObject.createChild strBookmarkText, "pageNum=" & lngPage, lngBookmarkCounter
            lngBookmarkCounter = lngBookmarkCounter + 1
        End If

End If

AcrobatAVDocument.Close (0)


Set rec = Nothing
Set rec3 = Nothing
Set db = Nothing
Set AcrobatAVDocument = Nothing
Set AcrobatPageView = Nothing
Set AcrobatPDDocument = Nothing
Set AcrobatJSObject = Nothing
Set AcrobatApp = Nothing

acAdd_Bookmarks_To_File = True

End Function

LVL 31

Expert Comment

by:James Murrell
ID: 36413474
This question has been classified as abandoned and is closed as part of the Cleanup Program. See the recommendation for more details.

Featured Post

Secure Your Active Directory - April 20, 2017

Active Directory plays a critical role in your company’s IT infrastructure and keeping it secure in today’s hacker-infested world is a must.
Microsoft published 300+ pages of guidance, but who has the time, money, and resources to implement? Register now to find an easier way.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Access custom database properties are useful for storing miscellaneous bits of information in a format that persists through database closing and reopening.  This article shows how to create and use them.
Did you know that more than 4 billion data records have been recorded as lost or stolen since 2013? It was a staggering number brought to our attention during last week’s ManageEngine webinar, where attendees received a comprehensive look at the ma…
In this seventh video of the Xpdf series, we discuss and demonstrate the PDFfonts utility, which lists all the fonts used in a PDF file. It does this via a command line interface, making it suitable for use in programs, scripts, batch files — any pl…
Access reports are powerful and flexible. Learn how to create a query and then a grouped report using the wizard. Modify the report design after the wizard is done to make it look better. There will be another video to explain how to put the final p…

733 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question