asked on

How do I use VBA to select first line of each page of a PDF document and create a bookmark?

I have a 1,500 page PDF document. I can open the PDF file with the script below from VBA (using Access 2007), see the document, and see that I"m highlighting the words 'Weekday','Route', and 'Block' in the document. I can also get the total number of pages in the document.

What I want to do is highlight the first line of text on each page in the document and create a bookmark for each page based on that first line of text, so at the end of this process I'll have 1,500 bookmarks. Any idea what object in Adobe can access a portion of a page (like the first line)?

Code that works is below. Try it yourself if you have Access and Adobe Pro, just substitute your document name for the Path and PDF file.

Function Add_Traincard_Bookmarks(PathAndPDF_File As String) As Boolean

Dim Exch As Object
Dim AVDocu As Object
Dim AVPageView As Object
Dim PDDocu As Object
Dim PDPage As Object
Dim PDText As Object

Dim numPages As Integer
Dim bFile As Boolean
Dim bShow As Boolean
Dim iPageNumber As Integer

Set Exch = CreateObject("AcroExch.App")
Set AVDocu = CreateObject("AcroExch.AVDoc")
Set PDDocu = CreateObject("AcroExch.PDDoc")

AVDocu.Open PathAndPDF_File, PathAndPDF_File

Debug.Print bShow
bShow = Exch.Show()
Debug.Print bShow

Set PDDocu = AVDocu.GetPDDoc
numPages = PDDocu.GetNumPages()
Debug.Print numPages

Set AVPageView = AVDocu.GetAVPageView

AVDocu.FindText "WEEKDAY", True, True, True

AVDocu.FindText "ROUTE", True, True, True
AVDocu.FindText "BLOCK", True, True, True

PDDocu.Close
AVDocu.Close (0)

Exch.Exit

Set Exch = Nothing
Set PDDocu = Nothing
Set AVDocu = Nothing

End Function

Travis Hydzik

have a look at the attached

Sub dasdf()
Dim Exch As Object
Dim AVDocu As Object
Dim AVPageView As Object
Dim PDDocu As Object
Dim PDPage As Object
Dim PDText As Object

Dim PDBookmark As Object

Dim numPages As Integer
Dim bFile As Boolean
Dim bShow As Boolean
Dim iPageNumber As Integer

Dim i As Long, j As Long

Set Exch = CreateObject("AcroExch.App")
Set AVDocu = CreateObject("AcroExch.AVDoc")
Set PDDocu = CreateObject("AcroExch.PDDoc")

AVDocu.Open PathAndPDF_File, PathAndPDF_File

Debug.Print bShow
bShow = Exch.Show()
Debug.Print bShow

Set PDDocu = AVDocu.GetPDDoc
numPages = PDDocu.GetNumPages()
Debug.Print numPages

Set AVPageView = AVDocu.GetAVPageView

Dim bookmarkstr As String

'AVDocu.FindText "WEEKDAY", True, True, True
'AVDocu.FindText "ROUTE", True, True, True
'AVDocu.FindText "BLOCK", True, True, True
Dim jso As Object
Set jso = PDDocu.GetJSObject

For i = 0 To numPages - 1
 
    AVDocu.GetAVPageView.GoTo i

    
    pageWordCount = jso.getPageNumWords(i)
    'for each word
    j = 0
    bookmarkstr = jso.getPageNthWord(i, j)
    j = 1
    Do While j < 10 Or j < pageWordCount - 1
        bookmarkstr = bookmarkstr & " " & jso.getPageNthWord(i, j)
        j = j + 1
    Loop

    'Create BookMark Object
    Set PDBookmark = CreateObject("AcroExch.PDBookmark", "")
    'execute the menu item
    Exch.MenuItemExecute ("NewBookmark")
    'set bookmark title
    btitle = PDBookmark.GetByTitle(PDDocu, "Untitled")
    btitle = PDBookmark.SetTitle(bookmarkstr)

Next i

Exch.MenuItemExecute ("Save")
PDDocu.Close
AVDocu.Close (0)

Exch.Exit

Set Exch = Nothing
Set PDDocu = Nothing
Set AVDocu = Nothing
End Sub

Open in new window

AztecCyclocross

ASKER

Thanks. The getPageNthWord worked well on my first test document. I was able to extract all the words contants of a 4 page version of my PDF document and read it into a database table in Access so I could figure out which words on page I want to use for bookmark. It seems the words I want to show are in a heaader and arent' the first words stored in the document, but I should be able to figure it out.

However, before I started writing the code to parse words and create bookmarks, I tried reading another PDF document, and found that the same code would not read a second document that was very similar and created by the same process as the first document. I was able to count the number of words, using the getPageNumWords but when I cycled through all i, j values, none returned any text strings with getPageNthWord method. I tested my code against the first document again, and was still able to read words, so I did more research.

I observed that on the first document I could search manaully for text in the document using the ctrl-F from within the application, but I coudn't do so on the second document. So I theorized that the second document was saving the words as graphics. So on the second document I ran the Document > OCR Text Recognition > Recoginize Text using OCR ... option from within the application. But I get a message "Acrobat could not perform recogntion (OCR) on the page because: this page contains renderable text."

So I found a link on the adobe website http://kb2.adobe.com/cps/333/333110.html that tells me in Solution 2 how convert the document to a set of TIFF files, use OCR on the documents recombine these pages into a PDF file. This could work, but seems awfully convoluted. Solution 1 on this link says to "Obtain a version of the document that does not contain renderable (editable) text". I've talked to the developer of the PDF files and found that to him this process is a 'black box' and he doesn't think he can control whether PDF documents contain text or graphics. Any suggestions how to create a PDF file that is all text elements (so I don't have to do any OCR) or all graphics elements (so I can use the OCR on the PDF file directly without having to break my 1,500 page document into 1,500 tiff files then recombine) would be appreciated. Any ideas?

Or, maybe alternatively, is ther some other way to read the words from the second document that is different than the getPageNthWord method?

Travis Hydzik

AztecCyclocross, if you can't select the text or find the text, I don't think it would be possible without doing a OCR. and even scripting the OCR, it may not be always accurate.

how is the document being created? what is the source?

AztecCyclocross

ASKER

thydzik,

The PDF document is created via Business Objects Crystal Reports. I think version IX.

I've part-way written a program to implement Solution 2 in the following link http://kb2.adobe.com/cps/333/333110.html. I've been able to create a folder via VBA, then take a test file and break it into the tiff files and save in a sub folder.

Where I am now hung up is finding a way to execute the Adobe Acrobat Pro "Documents | OCR Text Recognition | Recognize Text in Multiple Files Using OCR ..." I can do this step manually using the interface, and it seems to give good results as it creates nice PDF files in my sub directory that can now be searched using cntl-F, but I'd like to do it directly from VBA using the application object or the jsobject or any other object, but I can't seem to find a reference on how to access this tool via a script of any sort. Any ideas would be helpful. I'd like access the objects and pass them the file location where the tiff files are stored and the file location where I'd want to create the new searchable PDF files, and then the program could take over and do it's work on converting them.

Then I would need to recomine these PDF documents somehow using VBA code. I'm not sure how to do that yet either.

Or, if there is a simple switch somewhere in the Crystal Reports that will export these as 'searchable' PDF documents directly, that would save me the time of writing all of this code to break into Tiff files and then convert using OCR then recombine in one PDF so I can apply bookmarks. But neither I nor the Crystal Reports developer seem to know what that switch would be or if it even exists. Any thoughts on this would be helpful as well.

Travis Hydzik

AztecCyclocross,

have a look at
http://groups.google.com/group/adobe.acrobat.sdk/browse_thread/thread/c850ced121f06151?pli=1&safe=on

It doesn't look like it is possible, unless you use sendkeys.

ASKER CERTIFIED SOLUTION

AztecCyclocross

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

James Murrell

This question has been classified as abandoned and is closed as part of the Cleanup Program. See the recommendation for more details.