• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 199
  • Last Modified:

Can anyone suggest a way of converting data from a pdf file into editable text

I have a pdf file of several invoices for analysis.  The pdf files have the invoice as a picture, rather than scanned text.
Is there a good way to convert this picture to editable data?

I cannot include my example, as it has confidential information contained.

Many thanks
David Phelops
0
David Phelops
Asked:
David Phelops
  • 6
  • 4
  • 2
2 Solutions
 
Joe Winograd, EE MVE 2015&2016DeveloperCommented:
Hi David,
You need to convert the image-only PDF to a searchable PDF with OCR. If you have Adobe Acrobat (not Reader), it has built-in OCR via Tools>Recognize Text in Version X (10) or Tools>Text Recognition in Version XI (11) or something similar in other versions. If you don't have Acrobat and want a free way to do it, look at this 5-minute EE video Micro Tutorial:
How to OCR pages in a PDF with free software

Regards, Joe
0
 
David PhelopsAuthor Commented:
Thanks very much -  I will have a look at those options.  Help much appreciated.
0
 
Joe Winograd, EE MVE 2015&2016DeveloperCommented:
You're welcome. If you have any problems, I can probably help you through it with either Acrobat X or XI, or the free PDF-XChange Editor demonstrated in the video mentioned above. If you have other OCR software on your computer, that would work, too. For example, here are other articles/videos about two other software packages that can do it (but neither is free):

Batch Conversion of PDF, TIFF, and Other Image Formats via Command Line Interface to PDF, PDF Searchable, and TIFF

PaperPort - How To Create Searchable PDF Files

Convert Scanned Image-Only PDF Files to PDF Searchable Image Files via OCR with Power PDF Advanced

Regards, Joe
1
Transaction-level recovery for Oracle database

Veeam Explore for Oracle delivers low RTOs and RPOs with agentless transaction log backup and transaction-level recovery of Oracle databases. You can restore the database to a precise point in time, even to a specific transaction.

 
David PhelopsAuthor Commented:
I have tried it with Acrobat 7, using the command: Document>>Recognize text using OCR, but nothing appears to be output

I have attached a page with the least information, so you can see what I am working with.
Scanned-from-a-Xerox-Multifunction-D.pdf
0
 
Joe Winograd, EE MVE 2015&2016DeveloperCommented:
> I have tried it with Acrobat 7, using the command: Document>>Recognize text using OCR, but nothing appears to be output

It doesn't output anything. It puts the text from the OCR process back into the same file along with the scanned image. You should be able to select/copy/paste the text after doing the OCR. However, that is a poor quality scan, so the OCR is not very accurate. Attached is the searchable PDF that I created from it using Acrobat XI. Regards, Joe
ocr-via-Acrobat-XI.pdf
1
 
pgm554Commented:
For $79 bucks you can buy Wondershare PDF Converter Pro,does a great job of conversion

https://pdf.wondershare.com/pdf-converter-pro/
0
 
Joe Winograd, EE MVE 2015&2016DeveloperCommented:
pgm554,
Could you please post the results from Wondershare? I doubt very much that it can do "a great job of conversion" on that document.

David,
Attached is another OCR result, this one from PaperPort 14.5 (just $25 at Amazon), which uses OmniPage 19 under the covers for its OCR. It is top caliber OCR, but without manual correction, no OCR is going to perform with high accuracy on that scanned document because of its low image quality.

Regards, Joe
ocr-via-PP14-OP19.pdf
1
 
David PhelopsAuthor Commented:
Hi Joe

Thanks very much for all your help - downloaded the free PDF exchange editor, which worked to an extent, but, as you say, the quality of the original was very poor, so I have worked most of the weekend to try and turn a load of text into analyzable data tables. (I'm not doing that again in a hurry!)

I noticed, even on Acrobat 7, that there is a feature that will copy the data as a table, but even that is not wholly accurate.

Is there any further software that can perform miracle of data conversion, or is it inevitable that there will be a heavy element of manual correction?

Your help is very much appreciated.  Thanks Joe.

David
0
 
Joe Winograd, EE MVE 2015&2016DeveloperCommented:
> but even that is not wholly accurate.

No OCR is wholly accurate, but it can be very good when the source document is high quality.

> Is there any further software that can perform miracle of data conversion

Not on a document of such poor image quality as the one you posted.

> or is it inevitable that there will be a heavy element of manual correction?

Yes, on a document of such poor image quality as the one you posted.

Regards, Joe
1
 
pgm554Commented:
Wondershare convert
ocr-via-PP14-OP19.docx
1
 
David PhelopsAuthor Commented:
Thanks you  both very much for your help - Ironically - in the end, the conversion was so poor for a lot of the data, it worked out quicker and more accurate to retype the required fields!

My lesson, apart from finding recommended software - always better than trawling through hundreds of programmes - is never to accept such appalling quality documents to work on.

I ended up working through the night to get the information in on time.

I will investigate and try some of the software you have recomended.

Thanks from a bleary eyed David
0
 
Joe Winograd, EE MVE 2015&2016DeveloperCommented:
> it worked out quicker and more accurate to retype the required fields!

Turns out that is often the case. I worked in the high-end document scanning/imaging/management arena for 20 years (million dollar systems) and most of the companies went with "heads-down" data entry instead of OCR. Unless you have a very clean document with relatively little formatting, manual data entry is often the better way to go — as you have sadly discovered. Get some rest. :)  Regards, Joe
1

Featured Post

Keep up with what's happening at Experts Exchange!

Sign up to receive Decoded, a new monthly digest with product updates, feature release info, continuing education opportunities, and more.

  • 6
  • 4
  • 2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now