Solved

VB6 code to programmatically convert pdf to excel

Posted on 2016-08-30
21
54 Views
Last Modified: 2016-09-02
Looking for VB6 code to programmatically convert PDF to Excel.
0
Comment
Question by:fbroccoli
  • 10
  • 8
  • 3
21 Comments
 
LVL 35

Expert Comment

by:Kimputer
Comment Utility
To convert PDF to Excel, you need EXTREME knowledge of PDF (and Excel for that matter). To have this knowledge, means you can earn money with it. Marketleader seems to be Nuance (for decades).
http://www.nuance.com/for-business/imaging-solutions/document-conversion/power-pdf-converter/how-to/convert-pdf-to-excel/index.htm
If they make money on it, it probably means you will need about 2000 man hours of VB6 coding to finish a successful PDF to Excel conversion. I don't think that's really worth it, and therefore I suggest you just buy the product (it's 100 bucks).
There are many free online and offline solutions, most are ad supported or adware, so try at your own risk.
0
 

Author Comment

by:fbroccoli
Comment Utility
Thank you Kimputer for your prompt response.  I will pursue the product you suggest if I cannot get this to work (below).  I started playing with this after submitting my question:

Public acroavdoc As Acrobat.CAcroAVDoc
Public acroPDDoc As Acrobat.CAcroPDDoc

Set acroavdoc = CreateObject("AcroExch.AVDoc")
acroavdoc.Open("c:\test.pdf", "Accessing PDF's")
Set acroPDDoc = acroavdoc.GetPDDoc()
acropddoc.Save(1,"c:\testconvert.pdf")


The parms for acropddoc.Save are (nType as integer, sFullPath as string)
The above code snippet will save to pdf so I'm thinking/hoping that if I can find the correct nType enumeration for an Excel file type, then it will create an Excel file.  And then, finding all the enumerations, I will also be able to save to MSWord, HTML, XML....etc...

Thank you again for your response,
fbroccoli
0
 
LVL 35

Expert Comment

by:Kimputer
Comment Utility
Acrobat only exposes the most simple routines to the user. The advanced coding, will never be made public (it's how they earn their money).
I've seen workarounds, but that involves the VB6 code opening Acrobat, and using SendKeys. Not a very nice solution.
0
 

Author Comment

by:fbroccoli
Comment Utility
I looked into Power PDF Converter, but it appears it requires user intervention.  For what I need, I can just have the user click on Save As=>Excel.  I do appreciate your help and sorry I said "prompt response", I meant "quick response".  I rarely use Experts Exchange, but have never received a response as quickly as I did yours.  I used to waiting much longer.  Thanks again.
0
 
LVL 35

Expert Comment

by:Kimputer
Comment Utility
Most software like these don't expose API to the public (especially if a lot of settings need to be adjusted in GUI environment). I suggest you try again with Power PDF Converter, IN COMBINATION with AutoHotKey (since the windows/buttons/screens will probably quite predictable, it means you can "click" quickly).
Depending on the speed of the PC, you might not even see the window open/close and flashing
https://autohotkey.com/download/
0
 
LVL 51

Accepted Solution

by:
Joe Winograd, EE MVE earned 500 total points
Comment Utility
> but it appears it requires user intervention

It does if you use its GUI, but not if you use its command line interface (CLI). Here's an EE article that shows how to use the CLI of Power PDF Advanced:
Batch Conversion of PDF, TIFF, and Other Image Formats via Command Line Interface to PDF, PDF Searchable, and TIFF

Following up on kimputer's excellent comments, PDF-to-Excel is tough stuff! I've had excellent results with this free online tool:
http://www.pdftoexcel.org/

It does a good (but not perfect) job of maintaining the formatting, which is always the trick with any PDF-to-Excel (or PDF-to-Word) conversion. Another nice feature of this tool is that it performs OCR if the file is an image-only PDF, thereby automatically creating the text. Of course, you want local (VB6) code, so an online tool won't do it for you, but there is a local install of it, available for purchase and download (not free, but it has a 7-day free trial):
http://www.investintech.com/prod_downloadsa2e.htm

Another local install (not free, but reasonably priced at $27) is Boxoft PDF to Excel:
http://www.boxoft.com/pdf-to-excel/

Yet another local install worth trying is A-PDF to Excel (also not free, but reasonably priced at $39, and there's a free trial):
http://www.a-pdf.com/to-excel/index.htm

Both Boxoft PDF to Excel and A-PDF to Excel require that the PDF have text (not just an image), e.g., if your docs are scanned images, the PDFs need to have been processed with OCR software. You may do that with any OCR tool that you have. Then the A-PDF and Boxoft products will be able to process the text generated by OCR and attempt to create a properly formatted Excel spreadsheet. If you don't have an OCR tool, Boxoft has a free one:
http://www.boxoft.com/free-ocr/

And A-PDF has a reasonably priced one ($27):
http://www.a-pdf.com/ocr/index.htm

And here's a 5-minute EE video Micro Tutorial that shows another free product to perform OCR on PDFs:
How to OCR pages in a PDF with free software

Here's another approach instead of getting two products (one for OCR and one for conversion to Excel). Take a look at Nuance's PaperPort. Nuance, of course, is the same company that makes Power PDF. The standard edition of PaperPort (i.e., non-Professional) can do what you want. Although the list price is $99, the street price is usually less than half that. Right now, the latest version (14) is just $25 at Amazon:
http://www.amazon.com/Nuance-Communications-Inc-6809A-G00-140-Paperport/dp/B005CELKLM/

It has been built-in OCR and a built-in feature to convert PDF docs to Excel. It can even scan directly to an Excel spreadsheet, as explained in these 5-minute EE video Micro Tutorials:
How to create custom scanning profiles in PaperPort - Part 1
How to create custom scanning profiles in PaperPort - Part 2

The Amazon purchase is for PaperPort 14.0, but this EE article explains how to upgrade for free to the latest point release, 14.5:
PaperPort 14 - Free Upgrade to Version 14.5

And this article shows how to apply the latest patch update (Patch 1):
How to install the Patch 1 update for PaperPort 14.5

Of course, if your PDF already has text (i.e., is not just a raster image/bitmap/graphic), then ignore all of my recommendations with respect to OCR.

As a disclaimer, I want to emphasize that I have no affiliation with any of the companies mentioned in this post and no financial interest in them whatsoever. I am simply a happy user/customer. Regards, Joe

Update: When I submitted this post, I see that kimputer commented on AutoHotkey. If you'd like to learn more about that, this EE article should help:
AutoHotkey - Getting Started
0
 

Author Comment

by:fbroccoli
Comment Utility
Thank you Kimputer and Mr. Joe Winograd,

I have parsed many a pdf in the past but these new ones appear to have been scanned (or at least contain no text I can extract).   So, which software, or combination of software, would you suggest I purchase to convert these pdf's to Excel using a CLI with zero end user intervention?  

Thanks again for all your time.
0
 
LVL 51

Expert Comment

by:Joe Winograd, EE MVE
Comment Utility
If they contain no text, then they're surely image-only PDFs, containing only raster graphics/images, so the method you use will have to perform OCR in order for conversion to Excel to be meaningful.

If you have many files with varying layouts, I don't think you'll get anything to work well — it's going to be error-prone. But if all the files have the same exact layout (only the data values are different), then you may be able to get an automated method to be reasonably effective in creating an Excel spreadsheet. But it's still not going to be 100% accurate, and for two reasons — OCR errors and PDF-to-Excel conversion errors.

So, first question: do all the files have the same exact layout or are the layouts different? Can you post a sample file (being careful not to include private/sensitive info)? Regards, Joe
0
 

Author Comment

by:fbroccoli
Comment Utility
They all have the same layout, however, they do contain both private and sensitive information.
0
 
LVL 51

Expert Comment

by:Joe Winograd, EE MVE
Comment Utility
Since it's a raster image, it should be easy to redact the private/sensitive info with an imaging tool. I'd like to see the table structure — i.e., what will become the columns and rows of the Excel spreadsheet.
0
IT, Stop Being Called Into Every Meeting

Highfive is so simple that setting up every meeting room takes just minutes and every employee will be able to start or join a call from any room with ease. Never be called into a meeting just to get it started again. This is how video conferencing should work!

 

Author Comment

by:fbroccoli
Comment Utility
Well, I'm not sure if this will be of any help.  With the few tools I have, I captured the screen, blacked out the sensitive info, converted back to pdf.  These two attachments actually appear on one page of the original pdf that I wish I could send to you.
Image01.pdf
Image02.pdf
0
 
LVL 51

Expert Comment

by:Joe Winograd, EE MVE
Comment Utility
> appear on one page of the original pdf that I wish I could send to you

Attached are the two images on one page. Also, I OCRed the PDF, which was just a raster image, as we suspected. It now has text from the OCR process in addition to the raster image.
Image01-Image02-combined-and-OCR.pdf
0
 

Author Comment

by:fbroccoli
Comment Utility
That is great!  So what/which software do you recommend I purchase to either convert to pdf where I can parse text or convert to Excel using CLI?   Thank you so much!
0
 

Author Comment

by:fbroccoli
Comment Utility
Sorry, I should have included this in my last comment.  I'm kinda leaning toward ocr then conversion to Excel since these documents are apparently being scanned.  Grabbing the text via position may not be as consistent as reading cells from Excel.
0
 
LVL 51

Expert Comment

by:Joe Winograd, EE MVE
Comment Utility
Yes, they're probably being scanned.

I really don't know what is going to work best for your particular docs. I don't even know if anything will work well. You should experiment with the software I mentioned earlier.

Here's a good first test. OCR the docs with the free software described in my 5-minute EE video Micro Tutorial:
How to OCR pages in a PDF with free software

You won't be able to call this with your VB6 code, but we'll worry about that later. For now, we just need the docs to have text in them via OCR. Then use the A-PDF To Excel product for the conversion (free trial available), which does have a command line interface that you could call from VB6 code. I emphasize that I have no idea how well it will do on your docs. Regards, Joe
0
 

Author Comment

by:fbroccoli
Comment Utility
Understood and thanks Joe.  The OCR'd doc you sent me did not do a great job, however, I did have to screen capture the pdf, black out the sensitive information, save as image and convert back to pdf for the test.  So, I'm not giving up hope.  Worst case, I have the users open the pdf, click Save as, and export to text, or Excel or Word, and then run my program.
0
 
LVL 51

Expert Comment

by:Joe Winograd, EE MVE
Comment Utility
> Worst case, I have the users open the pdf, click Save as, and export to text, or Excel or Word, and then run my program.

No need for that! If the PDF has text, then you can easily grab it in your VB6 code by calling one of the (free!) command line Xpdf utilities. Here's a 5-minute EE video Micro Tutorial that shows how to download and install them:
Xpdf - Command Line Utility for PDF Files

And here's another one that shows specifically how to extract the text:
Xpdf - Convert PDF Files to Plain Text Files

Before calling this from your VB6 code, experiment with the five different settings for the output mode:

-layout
-lineprinter
-raw
-table
<null/default>

Regards, Joe
0
 

Author Comment

by:fbroccoli
Comment Utility
Joe, I retract my last statement that the ocr did not do a great job.  I'm an idiot.  I forgot that I had to black out the sensitive information in the document I sent you.  My bad.
0
 
LVL 51

Expert Comment

by:Joe Winograd, EE MVE
Comment Utility
No problem, but I was surprised by the comment because the OCR looked reasonably accurate to me (not perfect, of course). Thanks for the correction. :)
0
 

Author Closing Comment

by:fbroccoli
Comment Utility
Thank you Joe for all your help and time!
0
 
LVL 51

Expert Comment

by:Joe Winograd, EE MVE
Comment Utility
You're very welcome. Good luck on the project! Regards, Joe
0

Featured Post

How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

Join & Write a Comment

Introduction While answering a recent question (http://www.experts-exchange.com/Q_27402310.html) in the VB classic zone, I wrote some VB code in the (Office) VBA environment, rather than fire up my older PC.  I didn't post completely correct code o…
If you have ever used Microsoft Word then you know that it has a good spell checker and it may have occurred to you that the ability to check spelling might be a nice piece of functionality to add to certain applications of yours. Well the code that…
Get people started with the utilization of class modules. Class modules can be a powerful tool in Microsoft Access. They allow you to create self-contained objects that encapsulate functionality. They can easily hide the complexity of a process from…
This lesson covers basic error handling code in Microsoft Excel using VBA. This is the first lesson in a 3-part series that uses code to loop through an Excel spreadsheet in VBA and then fix errors, taking advantage of error handling code. This l…

763 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

9 Experts available now in Live!

Get 1:1 Help Now