AztecCyclocross
asked on
How do I use VBA to select first line of each page of a PDF document and create a bookmark?
I have a 1,500 page PDF document. I can open the PDF file with the script below from VBA (using Access 2007), see the document, and see that I"m highlighting the words 'Weekday','Route', and 'Block' in the document. I can also get the total number of pages in the document.
What I want to do is highlight the first line of text on each page in the document and create a bookmark for each page based on that first line of text, so at the end of this process I'll have 1,500 bookmarks. Any idea what object in Adobe can access a portion of a page (like the first line)?
Code that works is below. Try it yourself if you have Access and Adobe Pro, just substitute your document name for the Path and PDF file.
Function Add_Traincard_Bookmarks(Pa thAndPDF_F ile As String) As Boolean
Dim Exch As Object
Dim AVDocu As Object
Dim AVPageView As Object
Dim PDDocu As Object
Dim PDPage As Object
Dim PDText As Object
Dim numPages As Integer
Dim bFile As Boolean
Dim bShow As Boolean
Dim iPageNumber As Integer
Set Exch = CreateObject("AcroExch.App ")
Set AVDocu = CreateObject("AcroExch.AVD oc")
Set PDDocu = CreateObject("AcroExch.PDD oc")
AVDocu.Open PathAndPDF_File, PathAndPDF_File
Debug.Print bShow
bShow = Exch.Show()
Debug.Print bShow
Set PDDocu = AVDocu.GetPDDoc
numPages = PDDocu.GetNumPages()
Debug.Print numPages
Set AVPageView = AVDocu.GetAVPageView
AVDocu.FindText "WEEKDAY", True, True, True
AVDocu.FindText "ROUTE", True, True, True
AVDocu.FindText "BLOCK", True, True, True
PDDocu.Close
AVDocu.Close (0)
Exch.Exit
Set Exch = Nothing
Set PDDocu = Nothing
Set AVDocu = Nothing
End Function
What I want to do is highlight the first line of text on each page in the document and create a bookmark for each page based on that first line of text, so at the end of this process I'll have 1,500 bookmarks. Any idea what object in Adobe can access a portion of a page (like the first line)?
Code that works is below. Try it yourself if you have Access and Adobe Pro, just substitute your document name for the Path and PDF file.
Function Add_Traincard_Bookmarks(Pa
Dim Exch As Object
Dim AVDocu As Object
Dim AVPageView As Object
Dim PDDocu As Object
Dim PDPage As Object
Dim PDText As Object
Dim numPages As Integer
Dim bFile As Boolean
Dim bShow As Boolean
Dim iPageNumber As Integer
Set Exch = CreateObject("AcroExch.App
Set AVDocu = CreateObject("AcroExch.AVD
Set PDDocu = CreateObject("AcroExch.PDD
AVDocu.Open PathAndPDF_File, PathAndPDF_File
Debug.Print bShow
bShow = Exch.Show()
Debug.Print bShow
Set PDDocu = AVDocu.GetPDDoc
numPages = PDDocu.GetNumPages()
Debug.Print numPages
Set AVPageView = AVDocu.GetAVPageView
AVDocu.FindText "WEEKDAY", True, True, True
AVDocu.FindText "ROUTE", True, True, True
AVDocu.FindText "BLOCK", True, True, True
PDDocu.Close
AVDocu.Close (0)
Exch.Exit
Set Exch = Nothing
Set PDDocu = Nothing
Set AVDocu = Nothing
End Function
ASKER
Thanks. The getPageNthWord worked well on my first test document. I was able to extract all the words contants of a 4 page version of my PDF document and read it into a database table in Access so I could figure out which words on page I want to use for bookmark. It seems the words I want to show are in a heaader and arent' the first words stored in the document, but I should be able to figure it out.
However, before I started writing the code to parse words and create bookmarks, I tried reading another PDF document, and found that the same code would not read a second document that was very similar and created by the same process as the first document. I was able to count the number of words, using the getPageNumWords but when I cycled through all i, j values, none returned any text strings with getPageNthWord method. I tested my code against the first document again, and was still able to read words, so I did more research.
I observed that on the first document I could search manaully for text in the document using the ctrl-F from within the application, but I coudn't do so on the second document. So I theorized that the second document was saving the words as graphics. So on the second document I ran the Document > OCR Text Recognition > Recoginize Text using OCR ... option from within the application. But I get a message "Acrobat could not perform recogntion (OCR) on the page because: this page contains renderable text."
So I found a link on the adobe website http://kb2.adobe.com/cps/333/333110.html that tells me in Solution 2 how convert the document to a set of TIFF files, use OCR on the documents recombine these pages into a PDF file. This could work, but seems awfully convoluted. Solution 1 on this link says to "Obtain a version of the document that does not contain renderable (editable) text". I've talked to the developer of the PDF files and found that to him this process is a 'black box' and he doesn't think he can control whether PDF documents contain text or graphics. Any suggestions how to create a PDF file that is all text elements (so I don't have to do any OCR) or all graphics elements (so I can use the OCR on the PDF file directly without having to break my 1,500 page document into 1,500 tiff files then recombine) would be appreciated. Any ideas?
Or, maybe alternatively, is ther some other way to read the words from the second document that is different than the getPageNthWord method?
However, before I started writing the code to parse words and create bookmarks, I tried reading another PDF document, and found that the same code would not read a second document that was very similar and created by the same process as the first document. I was able to count the number of words, using the getPageNumWords but when I cycled through all i, j values, none returned any text strings with getPageNthWord method. I tested my code against the first document again, and was still able to read words, so I did more research.
I observed that on the first document I could search manaully for text in the document using the ctrl-F from within the application, but I coudn't do so on the second document. So I theorized that the second document was saving the words as graphics. So on the second document I ran the Document > OCR Text Recognition > Recoginize Text using OCR ... option from within the application. But I get a message "Acrobat could not perform recogntion (OCR) on the page because: this page contains renderable text."
So I found a link on the adobe website http://kb2.adobe.com/cps/333/333110.html that tells me in Solution 2 how convert the document to a set of TIFF files, use OCR on the documents recombine these pages into a PDF file. This could work, but seems awfully convoluted. Solution 1 on this link says to "Obtain a version of the document that does not contain renderable (editable) text". I've talked to the developer of the PDF files and found that to him this process is a 'black box' and he doesn't think he can control whether PDF documents contain text or graphics. Any suggestions how to create a PDF file that is all text elements (so I don't have to do any OCR) or all graphics elements (so I can use the OCR on the PDF file directly without having to break my 1,500 page document into 1,500 tiff files then recombine) would be appreciated. Any ideas?
Or, maybe alternatively, is ther some other way to read the words from the second document that is different than the getPageNthWord method?
AztecCyclocross, if you can't select the text or find the text, I don't think it would be possible without doing a OCR. and even scripting the OCR, it may not be always accurate.
how is the document being created? what is the source?
how is the document being created? what is the source?
ASKER
thydzik,
The PDF document is created via Business Objects Crystal Reports. I think version IX.
I've part-way written a program to implement Solution 2 in the following link http://kb2.adobe.com/cps/333/333110.html. I've been able to create a folder via VBA, then take a test file and break it into the tiff files and save in a sub folder.
Where I am now hung up is finding a way to execute the Adobe Acrobat Pro "Documents | OCR Text Recognition | Recognize Text in Multiple Files Using OCR ..." I can do this step manually using the interface, and it seems to give good results as it creates nice PDF files in my sub directory that can now be searched using cntl-F, but I'd like to do it directly from VBA using the application object or the jsobject or any other object, but I can't seem to find a reference on how to access this tool via a script of any sort. Any ideas would be helpful. I'd like access the objects and pass them the file location where the tiff files are stored and the file location where I'd want to create the new searchable PDF files, and then the program could take over and do it's work on converting them.
Then I would need to recomine these PDF documents somehow using VBA code. I'm not sure how to do that yet either.
Or, if there is a simple switch somewhere in the Crystal Reports that will export these as 'searchable' PDF documents directly, that would save me the time of writing all of this code to break into Tiff files and then convert using OCR then recombine in one PDF so I can apply bookmarks. But neither I nor the Crystal Reports developer seem to know what that switch would be or if it even exists. Any thoughts on this would be helpful as well.
The PDF document is created via Business Objects Crystal Reports. I think version IX.
I've part-way written a program to implement Solution 2 in the following link http://kb2.adobe.com/cps/333/333110.html. I've been able to create a folder via VBA, then take a test file and break it into the tiff files and save in a sub folder.
Where I am now hung up is finding a way to execute the Adobe Acrobat Pro "Documents | OCR Text Recognition | Recognize Text in Multiple Files Using OCR ..." I can do this step manually using the interface, and it seems to give good results as it creates nice PDF files in my sub directory that can now be searched using cntl-F, but I'd like to do it directly from VBA using the application object or the jsobject or any other object, but I can't seem to find a reference on how to access this tool via a script of any sort. Any ideas would be helpful. I'd like access the objects and pass them the file location where the tiff files are stored and the file location where I'd want to create the new searchable PDF files, and then the program could take over and do it's work on converting them.
Then I would need to recomine these PDF documents somehow using VBA code. I'm not sure how to do that yet either.
Or, if there is a simple switch somewhere in the Crystal Reports that will export these as 'searchable' PDF documents directly, that would save me the time of writing all of this code to break into Tiff files and then convert using OCR then recombine in one PDF so I can apply bookmarks. But neither I nor the Crystal Reports developer seem to know what that switch would be or if it even exists. Any thoughts on this would be helpful as well.
AztecCyclocross,
have a look at
http://groups.google.com/group/adobe.acrobat.sdk/browse_thread/thread/c850ced121f06151?pli=1&safe=on
It doesn't look like it is possible, unless you use sendkeys.
have a look at
http://groups.google.com/group/adobe.acrobat.sdk/browse_thread/thread/c850ced121f06151?pli=1&safe=on
It doesn't look like it is possible, unless you use sendkeys.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
This question has been classified as abandoned and is closed as part of the Cleanup Program. See the recommendation for more details.
Open in new window