Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 2336
  • Last Modified:

Create Searchable PDF from tif Image and OCR Text with VB.net

We are running an OCR package that can provide me with a tif image and an OCR text file which includes all of the data from the document and relitave coordinates of that data.  What I want to do is create a PDF from the imgage file with this data (in the background) which make the PDF searchable.

I obviously want to do this in an automated fashion preferable written in VB with some controls, etc.  I have ADOBE professional... is there some controls in that package I can leverage in VB.net to accompolish this?

Tks,
J
0
jimtxas
Asked:
jimtxas
  • 6
  • 3
  • 2
1 Solution
 
Karl Heinz KremerCommented:
Acrobat Professional (or any Adobe package) does not allow you to create these files - at least not with the VB API. You probably can do it within a plug-in.

If you want to do this, you need a pretty good understanding of the PDF spec, because you will create a PDF file from scratch. For this, you need a PDF toolkit or a PDF library that allows you to do this. This will be a major project.

It would be much easier to just take a OCR package that already creates PDF files in the correct format (e.g. Abbyy's FineReader or ScanSoft's OmniPage).
0
 
jimtxasAuthor Commented:
Can you suggest any toolkits that would accompolish this?  We cannot use a different OCR package.  The system we are using is an extremely powerful data processing/extraction engine with an investment in excess of $1M.  The software outputs 3 components: the data requested for extraction, tif image, full OCR 'map' of all the data ocr'd in the document...
0
 
Karl Heinz KremerCommented:
I don't have any experience with tools for VB. I would do this either from scratch - without a PDF library (the PDF format is relatively simple to write), or with an Acrobat plug-in (this has to be C or C++), or with either the Adobe PDF Library (expensive - http://partners.adobe.com/public/developer/pdf/library/index.html) or the Appligent sPDF library (which is API compatible to Adobe's library and the plug-in API - http://www.appligent.com/developers/developers.html).

If you want to create your own PDF creator, you need to read and understand the PDF Reference: http://partners.adobe.com/public/developer/pdf/index_reference.html

0
Upgrade your Question Security!

Your question, your audience. Choose who sees your identity—and your question—with question security.

 
michaelbuddyCommented:
I know you have this searching database technology you want to take advantage of, but I want to offer an alternative suggestion if I may.

this may work for you.  

If you're image have unique names and other data, you could in your source page document, pagemaker, indesign quark or whatever.  Create a text box with the data you need for the image.

drop that text box Behind the image.  Then when you want to do a search based on that data, the text find will take you to the image, because the text box is hidden behind it.


that may be more work than you want to do, but it will make it so more people could search it without special plugins.

or you could just caption the image underneath it.

0
 
Karl Heinz KremerCommented:
michaelbuddy, this would allow you to search and find the _page_ the search string is on, but not the exact position. You would see Acrobat highlight one or more areas on your page that have nothing to do with your search string. It's better than not having any search capabilities, but nowhere near what you get with the normal "hidden text" mode that Accrobat supports.
0
 
michaelbuddyCommented:
I see.  

have you looked at any of the products from Enfocus.  They might have something that works for you.  I know you can do a lot of PDF diagnostics with it.

check out http://www.enfocus.com

we use that company for checking our pdfs for print, but I couldn't tell you all the products they have, it's quite a few.
0
 
Karl Heinz KremerCommented:
The only SDK that Enfocus distributes is for preflighing PDFs, there is nothing to create PDF.
0
 
jimtxasAuthor Commented:
khkremer,  will you email me directly at jimtx@arn.net

Tks,
J
0
 
Karl Heinz KremerCommented:
The EE membership agreement does not allow any discussion outside of the EE forum, and it also does not allow any email addresses in EE comments. Please continue the discussion in this forum.
0
 
jimtxasAuthor Commented:
Sorry, I just wanted to make a propoisition for some contracted assistance...
0
 
Karl Heinz KremerCommented:
I'm not available for any contract work. Sorry.
0

Featured Post

Keep up with what's happening at Experts Exchange!

Sign up to receive Decoded, a new monthly digest with product updates, feature release info, continuing education opportunities, and more.

  • 6
  • 3
  • 2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now