Solved

How do I use VBA to select first line of each page of a PDF document and create a bookmark?

Posted on 2011-02-23
8
4,355 Views
Last Modified: 2013-11-10
I have a 1,500 page PDF document. I can open the PDF file with the script below from VBA (using Access 2007), see the document, and see that I"m highlighting the words 'Weekday','Route', and 'Block' in the document. I can also get the total number of pages in the document.

What I want to do is highlight the first line of text on each page in the document and create a bookmark for each page based on that first line of text, so at the end of this process I'll have 1,500 bookmarks. Any idea what object in Adobe can access a portion of a page (like the first line)?

Code that works is below. Try it yourself if you have Access and Adobe Pro, just substitute your document name for the Path and PDF file.

Function Add_Traincard_Bookmarks(PathAndPDF_File As String) As Boolean

Dim Exch As Object
Dim AVDocu As Object
Dim AVPageView As Object
Dim PDDocu As Object
Dim PDPage As Object
Dim PDText As Object

Dim numPages As Integer
Dim bFile As Boolean
Dim bShow As Boolean
Dim iPageNumber As Integer


Set Exch = CreateObject("AcroExch.App")
Set AVDocu = CreateObject("AcroExch.AVDoc")
Set PDDocu = CreateObject("AcroExch.PDDoc")

AVDocu.Open PathAndPDF_File, PathAndPDF_File

Debug.Print bShow
bShow = Exch.Show()
Debug.Print bShow

Set PDDocu = AVDocu.GetPDDoc
numPages = PDDocu.GetNumPages()
Debug.Print numPages

Set AVPageView = AVDocu.GetAVPageView

AVDocu.FindText "WEEKDAY", True, True, True

AVDocu.FindText "ROUTE", True, True, True
AVDocu.FindText "BLOCK", True, True, True

PDDocu.Close
AVDocu.Close (0)

Exch.Exit

Set Exch = Nothing
Set PDDocu = Nothing
Set AVDocu = Nothing

End Function
0
Comment
Question by:AztecCyclocross
  • 3
  • 3
8 Comments
 
LVL 11

Expert Comment

by:thydzik
Comment Utility
have a look at the attached
Sub dasdf()
Dim Exch As Object
Dim AVDocu As Object
Dim AVPageView As Object
Dim PDDocu As Object
Dim PDPage As Object
Dim PDText As Object

Dim PDBookmark As Object

Dim numPages As Integer
Dim bFile As Boolean
Dim bShow As Boolean
Dim iPageNumber As Integer

Dim i As Long, j As Long

Set Exch = CreateObject("AcroExch.App")
Set AVDocu = CreateObject("AcroExch.AVDoc")
Set PDDocu = CreateObject("AcroExch.PDDoc")

AVDocu.Open PathAndPDF_File, PathAndPDF_File

Debug.Print bShow
bShow = Exch.Show()
Debug.Print bShow

Set PDDocu = AVDocu.GetPDDoc
numPages = PDDocu.GetNumPages()
Debug.Print numPages

Set AVPageView = AVDocu.GetAVPageView

Dim bookmarkstr As String

'AVDocu.FindText "WEEKDAY", True, True, True
'AVDocu.FindText "ROUTE", True, True, True
'AVDocu.FindText "BLOCK", True, True, True
Dim jso As Object
Set jso = PDDocu.GetJSObject

For i = 0 To numPages - 1
 
    AVDocu.GetAVPageView.GoTo i

    
    pageWordCount = jso.getPageNumWords(i)
    'for each word
    j = 0
    bookmarkstr = jso.getPageNthWord(i, j)
    j = 1
    Do While j < 10 Or j < pageWordCount - 1
        bookmarkstr = bookmarkstr & " " & jso.getPageNthWord(i, j)
        j = j + 1
    Loop

    'Create BookMark Object
    Set PDBookmark = CreateObject("AcroExch.PDBookmark", "")
    'execute the menu item
    Exch.MenuItemExecute ("NewBookmark")
    'set bookmark title
    btitle = PDBookmark.GetByTitle(PDDocu, "Untitled")
    btitle = PDBookmark.SetTitle(bookmarkstr)

Next i

Exch.MenuItemExecute ("Save")
PDDocu.Close
AVDocu.Close (0)

Exch.Exit

Set Exch = Nothing
Set PDDocu = Nothing
Set AVDocu = Nothing
End Sub

Open in new window

0
 

Author Comment

by:AztecCyclocross
Comment Utility
Thanks. The getPageNthWord worked well on my first test document. I was able to extract all the words contants of a 4 page version of my PDF document and read it into a database table in Access so I could figure out which words on page I want to use for bookmark. It seems the words I want to show are in a heaader and arent' the first words stored in the document, but I should be able to figure it out.

However, before I started writing the code to parse words and create bookmarks, I tried reading another PDF document, and found that the same code would not read a second document that was very similar and created by the same process as the first document. I was able to count the number of words, using the getPageNumWords but when I cycled through all i, j values, none returned any text strings with getPageNthWord method. I tested my code against the first document again, and was still able to read words, so I did more research.

I observed that on the first document I could search manaully for text in the document using the ctrl-F from within the application, but I coudn't do so on the second document. So I theorized that the second document was saving the words as graphics. So on the second document I ran the Document > OCR Text Recognition > Recoginize Text using OCR ... option from within the application. But I get a message "Acrobat could not perform recogntion (OCR) on the page because: this page contains renderable text."

So I found a link on the adobe website http://kb2.adobe.com/cps/333/333110.html that tells me in Solution 2 how convert the document to a set of TIFF files, use OCR on the documents recombine these pages into a PDF file. This could work, but seems awfully convoluted. Solution 1 on this link says to "Obtain a version of the document that does not contain renderable (editable) text". I've talked to the developer of the PDF files and found that to him this process is a 'black box' and he doesn't think he can control whether PDF documents contain text or graphics. Any suggestions how to create a PDF file that is all text elements (so I don't have to do any OCR) or all graphics elements (so I can use the OCR on the PDF file directly without having to break my 1,500 page document into 1,500 tiff files then recombine) would be appreciated. Any ideas?

Or, maybe alternatively, is ther some other way to read the words from the second document that is different than the getPageNthWord method?




0
 
LVL 11

Expert Comment

by:thydzik
Comment Utility
AztecCyclocross, if you can't select the text or find the text, I don't think it would be possible without doing a OCR. and even scripting the OCR, it may not be always accurate.

how is the document being created? what is the source?
0
Complete VMware vSphere® ESX(i) & Hyper-V Backup

Capture your entire system, including the host, with patented disk imaging integrated with VMware VADP / Microsoft VSS and RCT. RTOs is as low as 15 seconds with Acronis Active Restore™. You can enjoy unlimited P2V/V2V migrations from any source (even from a different hypervisor)

 

Author Comment

by:AztecCyclocross
Comment Utility
thydzik,

The PDF document is created via Business Objects Crystal Reports. I think version IX.

I've part-way written a program to implement Solution 2 in the following link http://kb2.adobe.com/cps/333/333110.html. I've been able to create a folder via VBA, then take a test file and break it into the tiff files and save in a sub folder.

Where I am now hung up is finding a way to execute the Adobe Acrobat Pro "Documents | OCR Text Recognition | Recognize Text in Multiple Files Using OCR ..." I can do this step manually using the interface, and it seems to give good results as it creates nice PDF files in my sub directory that can now be searched using cntl-F, but I'd like to do it directly from VBA using the application object or the jsobject or any other object, but I can't seem to find a reference on how to access this tool via a script of any sort.  Any ideas would be helpful. I'd like access the objects and pass them the file location where the tiff files are stored and the file location where I'd want to create the new searchable PDF files, and then the program could take over and do it's work on converting them.

Then I would need to recomine these PDF documents somehow using VBA code. I'm not sure how to do that yet either.

Or, if there is a simple switch somewhere in the Crystal Reports that will export these as 'searchable' PDF documents directly, that would save me the time of writing all of this code to break into Tiff files and then convert using OCR then recombine in one PDF so I can apply bookmarks. But neither I nor the Crystal Reports developer seem to know what that switch would be or if it even exists. Any thoughts on this would be helpful as well.









0
 
LVL 11

Expert Comment

by:thydzik
Comment Utility
AztecCyclocross,

have a look at
http://groups.google.com/group/adobe.acrobat.sdk/browse_thread/thread/c850ced121f06151?pli=1&safe=on

It doesn't look like it is possible, unless you use sendkeys.
0
 

Accepted Solution

by:
AztecCyclocross earned 0 total points
Comment Utility
Hi thydzik --

I was able to solve this problem with the code listed below. I decided I didn't want to write the 'send keys' portion to automate the production of the OCR readable PDF files. I'll do that manually. The code below creates bookmarks from a file and saves them to table in Access, and then applys parent and children bookmarks to the Acrobat document in step 4.

The code reads in all words in a pdf document and stored the words in a table called tblContainer. At the same the code developed 3 potential strings to use as a bookmark text. One is called the parent, the other two bookmark text and corrected bookmark text. The parent and corrected bookmark text are what I eventually apply as bookmarks. The only tricks here are I only need text from every other page (hence the mod 2), and that I need to find certain words like 'Route' and 'Block' and get the word that immediately follows these words. I find these by looping.

The only portion of the code that may be of interest here is how i apply the bookmarks and parent bookmarks. It uses the java script object of Acrobat accessable via VBA to apply children to the bookmark root object. code is below. Also, applying color to the bookmark may be of interest, though I haven't tested the color portion yet.The code below works except I haven't tested passing the color if there is a detected error in reading the bookmark object.

The color itself has to be passed as a variant array through the javascript object--it took me awhile to find documentation on how this is supposed to work..This code should make the bookmark yellow if the code things the bookmark might be in error.

Also, I anticipate modifying this script to do some more document handling, so there are some seeming dead ends where I get information about the path and file name, but those are irrelevant to the portion in Step 4.



Function acAdd_Bookmarks_To_File()


'Needed for Step 1, aquiring path and file names
Dim lngCount As Long 'length of path and file name
Dim lngI As Long 'integer counter variable
Dim strStore1 As String
Dim strStore2 As String
Dim strStore3 As String
Dim strStore4 As String
Dim strStore5 As String
Dim strPathOnly As String
Dim strFileOnly As String
Dim strFileOnlyWithoutExtension As String
Dim fDialog As Office.FileDialog
Dim varFile As Variant
Dim strPathAndFileName As String

'Needed for Step 2, Determining File Type
Dim bHeadway As Boolean
Dim bTraincard As Boolean
Dim strSubDirectory As String
Dim strNewPathAndFile As String
Dim strNewFilePath As String
Dim strNewPath As String
Dim lngPageNumFromObject As Long

'Needed for Step 3, opening PdF file

Dim AcrobatApp As Object
Dim AcrobatAVDocument As Object
Dim AcrobatPDDocument As Object
Dim AcrobatPageView As Object
Dim AcrobatJSObject As Object
Dim lngNumberPages As Long
Dim retVal As Double
Dim lngMessage As Long
Dim db As DAO.Database
Dim rec As DAO.Recordset
Dim rec2 As DAO.Recordset
Dim strSQL As String
Dim lngJ As String
Dim lngPageWordCount As Long
Dim lngPageWordMax As Long
Dim strPotentialBookmark As String
Dim strParent As String
Dim strWord As String
Dim strServiceName As String
Dim strRouteNum As String
Dim strBlockNum As String
Dim bolServiceTag As Boolean
Dim bolRouteTag As Boolean
Dim bolBlockTag As Boolean
Dim lngServiceWordNum As Long
Dim lngRouteWordNum As Long
Dim lngBlockWordNum As Long
Dim strCorrectedBookmark As String
Dim bolAllGood As Boolean
Dim bolGotNewData As Boolean

'Variable for Step 4 to apply bookmarks to PDF file

Dim rec3 As DAO.Recordset
Dim vBookmarkRoot As Variant
Dim BookmarkRootObject As Object
Dim vBookmarkChild As Variant
Dim BookmarkChildObject As Object
Dim strBookmarkParent As String
Dim strBookmarkText As String
Dim bTitle As Boolean
Dim lngPage As Long
Dim strOldParent As String
Dim lngBookmarkCounter As Long
Dim lngParentCounter As Long
Dim Color(0 To 4) As Variant
Dim strPageNumberText As String
Dim vBookmarkColor As Variant
Dim vBookmarkAction As Variant
Dim lngPageApparant As Long

'----------------------------

bolGotNewData = False

'Step 1: Get the path where the file is stored

Set fDialog = Application.FileDialog(msoFileDialogFilePicker)

With fDialog
    .AllowMultiSelect = False
    .Title = "Please select one Acrobat file that has been made searchable"
    .Filters.Clear
    .Filters.Add "Acrobat Files", "*.PDF"

    If .Show = True Then
        For Each varFile In .SelectedItems
 
            strPathAndFileName = varFile
        Next
    Else
    End If

End With



lngCount = Len(strPathAndFileName)
For lngI = 0 To lngCount
    strStore1 = Left(strPathAndFileName, lngCount - lngI)
    strStore2 = Right(strStore1, 1)
    strStore3 = Right(strPathAndFileName, lngI)
    'Debug.Print strStore1, strStore2, strStore3
    If strStore2 = "\" Then
        strPathOnly = Left(strStore1, Len(strStore1))
        strFileOnly = strStore3
        Exit For
    End If
Next

For lngI = 0 To lngCount
    strStore4 = Left(strFileOnly, lngCount - lngI)
    strStore5 = Right(strStore4, 1)
    'Debug.Print strStore4, strStore5
    If strStore5 = "." Then
        strFileOnlyWithoutExtension = Left(strStore4, Len(strStore4) - 1)
    Exit For
    End If
Next

'Debug.Print strPathOnly
'Debug.Print strFileOnly
'Debug.Print strFileOnlyWithoutExtension

'Step 2 = Determine type of file, set sub directory string
bHeadway = False
bTraincard = False

If InStr(1, UCase(strFileOnly), "HEADWAY") Then
    bHeadway = True
    strSubDirectory = "Headway_TIFFs"
End If

If InStr(1, UCase(strFileOnly), "TRAINCARD") Then
    bTraincard = True
    strSubDirectory = "Traincard_TIFFs"
End If

strNewFilePath = strFileOnly & strSubDirectory
Dir (strPathOnly)

'Step 3 = Open file in Adobe Acrobat, read all contents into a database table called tblContainer, while reading the document
' find the words 'Route' and 'Block' and look for a service like ('Weekday, Saturday, Sunday, or Custom') and
' put together some strings that can potentially be used as bookmark text and save them to a table called
' bookmark text

Set db = CurrentDb
strSQL = "Delete from tblContainer"
db.Execute strSQL
strSQL = "select page_num, long_i, long_j, text_string from tblContainer"
Set rec = db.OpenRecordset(strSQL)
strSQL = "Delete from tblBookmarkText"
db.Execute strSQL
strSQL = "Select page_num, parent, corrected_bookmark_text, all_good, potential_bookmark_text from tblBookmarkText"
Set rec2 = db.OpenRecordset(strSQL)

If UCase(Right(strFileOnly, 3)) Like "PDF" Then
   
    Set AcrobatApp = CreateObject("AcroExch.App")
    Set AcrobatAVDocument = CreateObject("AcroExch.AVDoc")
    'Set AcrobatPageView = CreateObject("AcroExch.AVPageView")
    Set AcrobatPDDocument = CreateObject("AcroExch.PDdoc")
 
    AcrobatAVDocument.Open strPathAndFileName, strPathAndFileName
     
    AcrobatApp.Show
       
    Set AcrobatPageView = AcrobatAVDocument.GetAVPageView
    Set AcrobatPDDocument = AcrobatAVDocument.GetPDDoc
    Set AcrobatJSObject = AcrobatPDDocument.GetJSObject
           
    lngNumberPages = AcrobatPDDocument.GetNumPages()
   
    strParent = ""
       
    For lngI = 1 To lngNumberPages - 1
       
        'AcrobatJSObject.GoToPage (lngI)
        AcrobatAVDocument.GetAVPageView.Goto lngI
        lngPageWordCount = AcrobatJSObject.getpageNumWords(lngI)
        lngJ = 0
        strPotentialBookmark = ""
       
        If lngPageWordCount < 4000 Then
            lngPageWordMax = lngPageWordCount
        Else
            lngPageWordMax = 4000
        End If
       
        bolServiceTag = False
        bolRouteTag = False
        bolBlockTag = False
        bolAllGood = False
        lngRouteWordNum = 0
        lngBlockWordNum = 0
        strCorrectedBookmark = ""
        strServiceName = ""
       
        Do While lngJ < lngPageWordMax - 1
           
            lngPageNumFromObject = AcrobatJSObject.pageNum
           
            If lngPageNumFromObject Mod 2 = 1 Then
           
                rec.AddNew
               
                rec("page_num") = AcrobatJSObject.pageNum
                rec("long_i") = lngI
                rec("long_j") = lngJ
                rec("text_string") = AcrobatJSObject.getpageNthWord(lngI, lngJ)
                rec.Update
                 
                strWord = UCase(AcrobatJSObject.getpageNthWord(lngI, lngJ))
               
                If lngJ = 0 And AcrobatJSObject.getpageNthWord(lngI, lngJ) <> UCase(strParent) Then
                    If Len(strWord) > 1 Then
                        strParent = UCase(Left(strWord, 1)) & LCase(Right(strWord, Len(strWord) - 1))
                    Else
                        strParent = strWord
                    End If
                End If
               
                If bolServiceTag = False And (strWord Like "WEEKDAY" Or strWord Like "SATURDAY" Or strWord Like "SUNDAY" Or strWord Like "CUSTOM") Then
                    strServiceName = UCase(Left(strWord, 1)) & LCase(Right(strWord, Len(strWord) - 1))
                    bolServiceTag = True
                End If
                   
                If bolRouteTag = False And strWord Like "ROUTE" Then
                    lngRouteWordNum = lngJ
                End If
               
                If bolRouteTag = False And lngRouteWordNum + 1 > 1 And lngRouteWordNum + 1 = lngJ And Len(Trim(strWord)) <= 4 Then
                    strRouteNum = strWord
                    bolRouteTag = True
                End If
               
                If bolBlockTag = False And strWord Like "BLOCK" Then
                    lngBlockWordNum = lngJ
                End If
                   
                If bolBlockTag = False And lngBlockWordNum + 1 > 1 And lngBlockWordNum + 1 = lngJ And Len(Trim(strWord)) <= 2 Then
                    strBlockNum = strWord
                    bolBlockTag = True
                End If
               
                If lngJ > 0 And lngJ <= 30 Then
                    strPotentialBookmark = strPotentialBookmark & " " & AcrobatJSObject.getpageNthWord(lngI, lngJ)
                End If
               
                bolAllGood = bolServiceTag And bolRouteTag And bolBlockTag
                If bolAllGood = True Then
                    Exit Do
                End If
           
            End If
           
            lngJ = lngJ + 1
                       
        Loop
       
   
        If lngPageNumFromObject Mod 2 = 1 Then
       
            bolAllGood = bolServiceTag And bolRouteTag And bolBlockTag
            strCorrectedBookmark = strServiceName & " Route " & strRouteNum & " Block " & strBlockNum
            If lngPageNumFromObject Mod 7 = 0 Then Debug.Print bolAllGood, lngPageNumFromObject, strCorrectedBookmark
       
            rec2.AddNew
            rec2("page_num") = AcrobatJSObject.pageNum
            rec2("parent") = strParent
            rec2("corrected_bookmark_text") = strCorrectedBookmark
            rec2("all_good") = bolAllGood
            rec2("potential_bookmark_text") = strPotentialBookmark
            rec2.Update
         
         End If
       
    Next
   
bolGotNewData = True
strStorageLocation = strPathOnly & strSubDirectory & "_OCR"
rec2.Close
Set rec2 = Nothing
         
End If

'Step 4 Apply the bookmarks to the Acrobat document by reading the corrected_bookmark_text from the tblBookmarkText.
'Assign each bookmark child to an appropriate parent bookmark.

strSQL = "Select page_num, parent, corrected_bookmark_text, all_good, potential_bookmark_text from tblBookmarkText order by clng(page_num)"
Set rec3 = db.OpenRecordset(strSQL)
strOldParent = ""
lngBookmarkCounter = 0
lngParentCounter = 0

Color(0) = "CMYK"
Color(1) = 0#
Color(2) = 0#
Color(3) = 1#
Color(4) = 0#


If rec3.BOF = False And rec3.EOF = False And bolGotNewData = True Then


    rec3.MoveFirst
    'Set AcrobatBookmark = CreateObject("AcroExch.PDBookmark", "")
   
    Do While rec3.EOF = False
   
       strBookmarkParent = rec3("parent")
       strBookmarkText = rec3("corrected_bookmark_text")
       bolAllGood = False
       bolAllGood = rec3("all_good")
       lngPage = rec3("page_num")
       
       Set BookmarkRootObject = AcrobatJSObject.BookmarkRoot
             
       If strBookmarkParent <> strOldParent Then
       
            lngPageApparant = lngPage
       
            AcrobatAVDocument.GetAVPageView.Goto lngPageApparant
           

            BookmarkRootObject.createChild strBookmarkParent, "pageNum=" & lngPage, lngParentCounter
            vBookmarkChild = BookmarkRootObject.Children
            Set BookmarkChildObject = vBookmarkChild(lngParentCounter)
            strPageNumberText = "pageNum =" & lngPageApparant
            BookmarkChildObject.createChild strBookmarkText, "pageNum=" & lngPage, lngBookmarkCounter
           
            If bolAllGood = False Then
                BookmarkChildObject.Children(lngBookmarkCounter).Color = Color()
            End If
           
            strOldParent = strBookmarkParent
           
            lngParentCounter = lngParentCounter + 1
            lngBookmarkCounter = lngBookmarkCounter + 1
           
        Else
            AcrobatAVDocument.GetAVPageView.Goto lngPageApparant
            strPageNumberText = "pageNum =" & lngPageApparant
            BookmarkChildObject.createChild strBookmarkText, "pageNum=" & lngPage, lngBookmarkCounter
            lngBookmarkCounter = lngBookmarkCounter + 1
           
        End If
   
        rec3.MoveNext
   
    Loop


End If


AcrobatAVDocument.Close (0)

AcrobatApp.Exit

rec.Close
rec3.Close
Set rec = Nothing
Set rec3 = Nothing
Set db = Nothing
Set AcrobatAVDocument = Nothing
Set AcrobatPageView = Nothing
Set AcrobatPDDocument = Nothing
Set AcrobatJSObject = Nothing
Set AcrobatApp = Nothing

acAdd_Bookmarks_To_File = True

End Function

0
 
LVL 31

Expert Comment

by:James Murrell
Comment Utility
This question has been classified as abandoned and is closed as part of the Cleanup Program. See the recommendation for more details.
0

Featured Post

How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

Join & Write a Comment

This is about my first experience with programming Arduino.
Although it can be difficult to imagine, someday your child will have a career of his or her own. He or she will likely start a family, buy a home and start having their own children. So, while being a kid is still extremely important, it’s also …
In Microsoft Access, learn how to “cascade” or have the displayed data of one combo control depend upon what’s entered in another. Base the dependent combo on a query for its row source: Add a reference to the first combo on the form as criteria i…
What’s inside an Access Desktop Database. Will look at the basic interface, Navigation Pane (Database Container), Tables, Queries, Forms, Report, Macro’s, and VBA code.

772 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

10 Experts available now in Live!

Get 1:1 Help Now