• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 478
  • Last Modified:

using Ifilter to examine a PDF downloaded in a Twebbrowser


There is an excellent discussion of using an Ifilter to get the text out of a PDF document here

http://www.experts-exchange.com/Programming/Programming_Languages/Delphi/Q_20293579.html

but my problem is a bit different.

I want to extract the text of a pdf that has been downloaded in a browser (IE, Twebbrowser) but I want to know

a) when that download is complete .. can I use OnDocumentComplete or does that only work for the HTML pages

b) where the pdf is , so I can examine it. I suppose it is in a cache somewhere, but how can I find it/establish the correspondence between the original pdf url and the name in the cache?

thanks
0
Mutley2003
Asked:
Mutley2003
  • 2
  • 2
1 Solution
 
JaccoCommented:
I have tried the following:

I monitored all events generated by the TWebBrowser. The last OnDownloadComplete should mark the correct moment but it occurs three times. Also a OnDocumentComplete occurs once. The last OnDownloadComplete is the one that comes after the OnDocumentComplete.

Then I started looking for the download PDF but it is nowhere. IE probable directly streams it to the AcrobatReader and it not on disk...

You could download the PDF using the IdHTTP component (of Indy) and save the PDF to a file and inspect it then. Let me know if you need a sample of that.

Regards Jacco
0
 
Mutley2003Author Commented:
Hi Jacco

I also monitored TWebBrowser events and got
onBeforeNavigate2 not busy , loading
http://www.fia.com/resources/documents/1797101136__Appendix_L_a.pdf
onDownloadBegin busy , loading
onDownloadComplete not busy , loading
onDownloadBegin busy , loading
onNavigateComplete2 busy , loading
http://www.fia.com/resources/documents/1797101136__Appendix_L_a.pdf
onCommandStateChange busy , interactive
onDownloadComplete not busy , interactive
onDocumentComplete not busy , complete
http://www.fia.com/resources/documents/1797101136__Appendix_L_a.pdf
onDocumentComplete not busy , complete
http://www.fia.com/resources/documents/1797101136__Appendix_L_a.pdf
onDownloadBegin busy , complete
onDownloadComplete not busy , complete

as you say, a whole bunch of completion events.

This reminds me of what TwebBrowser does with frames.



as for using Indy and a direct download, thanks for the idea but that won't work for what I want.

So that leaves the problem

b) where the pdf is , so I can examine it. I suppose it is in a cache somewhere, but how can I find it/establish the correspondence between the original pdf url and the name in the cache?

and you suggest that
". IE probable directly streams it to the AcrobatReader and it not on disk"

I guess that is possible and I might believe it if I had a good utility app that watched changes to the disk .. some wrapper around FindFirstChangeNotification or some such.


Also, I vaguely remember that there is a mechanism for telling IE how to handle certain filetypes .. it is not plugins, not pluggable protocols .. the name escapes me.  If I knew how that worked, then maybe I would know what IE does with PDF.


any ideas?


 
0
 
JaccoCommented:
I have searched my whole C drive and found nothing. I really think the PDF exists only in memory.

Regards Jacco
0
 
Mutley2003Author Commented:
well, when I get some time I will monitor disk changes with FindFirstChangeNotification
and let you know what I find out
0

Featured Post

Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

  • 2
  • 2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now