Solved

Can anyone suggest a way of converting data from a pdf file into editable text

Posted on 2016-09-08
12
55 Views
Last Modified: 2016-09-12
I have a pdf file of several invoices for analysis.  The pdf files have the invoice as a picture, rather than scanned text.
Is there a good way to convert this picture to editable data?

I cannot include my example, as it has confidential information contained.

Many thanks
David Phelops
0
Comment
Question by:David Phelops
  • 6
  • 4
  • 2
12 Comments
 
LVL 51

Expert Comment

by:Joe Winograd, EE MVE
ID: 41790445
Hi David,
You need to convert the image-only PDF to a searchable PDF with OCR. If you have Adobe Acrobat (not Reader), it has built-in OCR via Tools>Recognize Text in Version X (10) or Tools>Text Recognition in Version XI (11) or something similar in other versions. If you don't have Acrobat and want a free way to do it, look at this 5-minute EE video Micro Tutorial:
How to OCR pages in a PDF with free software

Regards, Joe
0
 

Author Comment

by:David Phelops
ID: 41790467
Thanks very much -  I will have a look at those options.  Help much appreciated.
0
 
LVL 51

Accepted Solution

by:
Joe Winograd, EE MVE earned 400 total points
ID: 41790483
You're welcome. If you have any problems, I can probably help you through it with either Acrobat X or XI, or the free PDF-XChange Editor demonstrated in the video mentioned above. If you have other OCR software on your computer, that would work, too. For example, here are other articles/videos about two other software packages that can do it (but neither is free):

Batch Conversion of PDF, TIFF, and Other Image Formats via Command Line Interface to PDF, PDF Searchable, and TIFF

PaperPort - How To Create Searchable PDF Files

Convert Scanned Image-Only PDF Files to PDF Searchable Image Files via OCR with Power PDF Advanced

Regards, Joe
1
 

Author Comment

by:David Phelops
ID: 41790494
I have tried it with Acrobat 7, using the command: Document>>Recognize text using OCR, but nothing appears to be output

I have attached a page with the least information, so you can see what I am working with.
Scanned-from-a-Xerox-Multifunction-D.pdf
0
 
LVL 51

Expert Comment

by:Joe Winograd, EE MVE
ID: 41790532
> I have tried it with Acrobat 7, using the command: Document>>Recognize text using OCR, but nothing appears to be output

It doesn't output anything. It puts the text from the OCR process back into the same file along with the scanned image. You should be able to select/copy/paste the text after doing the OCR. However, that is a poor quality scan, so the OCR is not very accurate. Attached is the searchable PDF that I created from it using Acrobat XI. Regards, Joe
ocr-via-Acrobat-XI.pdf
1
 
LVL 30

Expert Comment

by:pgm554
ID: 41790609
For $79 bucks you can buy Wondershare PDF Converter Pro,does a great job of conversion

https://pdf.wondershare.com/pdf-converter-pro/
0
How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

 
LVL 51

Expert Comment

by:Joe Winograd, EE MVE
ID: 41790705
pgm554,
Could you please post the results from Wondershare? I doubt very much that it can do "a great job of conversion" on that document.

David,
Attached is another OCR result, this one from PaperPort 14.5 (just $25 at Amazon), which uses OmniPage 19 under the covers for its OCR. It is top caliber OCR, but without manual correction, no OCR is going to perform with high accuracy on that scanned document because of its low image quality.

Regards, Joe
ocr-via-PP14-OP19.pdf
1
 

Author Comment

by:David Phelops
ID: 41793538
Hi Joe

Thanks very much for all your help - downloaded the free PDF exchange editor, which worked to an extent, but, as you say, the quality of the original was very poor, so I have worked most of the weekend to try and turn a load of text into analyzable data tables. (I'm not doing that again in a hurry!)

I noticed, even on Acrobat 7, that there is a feature that will copy the data as a table, but even that is not wholly accurate.

Is there any further software that can perform miracle of data conversion, or is it inevitable that there will be a heavy element of manual correction?

Your help is very much appreciated.  Thanks Joe.

David
0
 
LVL 51

Expert Comment

by:Joe Winograd, EE MVE
ID: 41793615
> but even that is not wholly accurate.

No OCR is wholly accurate, but it can be very good when the source document is high quality.

> Is there any further software that can perform miracle of data conversion

Not on a document of such poor image quality as the one you posted.

> or is it inevitable that there will be a heavy element of manual correction?

Yes, on a document of such poor image quality as the one you posted.

Regards, Joe
1
 
LVL 30

Assisted Solution

by:pgm554
pgm554 earned 100 total points
ID: 41793629
Wondershare convert
ocr-via-PP14-OP19.docx
1
 

Author Comment

by:David Phelops
ID: 41793940
Thanks you  both very much for your help - Ironically - in the end, the conversion was so poor for a lot of the data, it worked out quicker and more accurate to retype the required fields!

My lesson, apart from finding recommended software - always better than trawling through hundreds of programmes - is never to accept such appalling quality documents to work on.

I ended up working through the night to get the information in on time.

I will investigate and try some of the software you have recomended.

Thanks from a bleary eyed David
0
 
LVL 51

Expert Comment

by:Joe Winograd, EE MVE
ID: 41794275
> it worked out quicker and more accurate to retype the required fields!

Turns out that is often the case. I worked in the high-end document scanning/imaging/management arena for 20 years (million dollar systems) and most of the companies went with "heads-down" data entry instead of OCR. Unless you have a very clean document with relatively little formatting, manual data entry is often the better way to go — as you have sadly discovered. Get some rest. :)  Regards, Joe
1

Featured Post

Do You Know the 4 Main Threat Actor Types?

Do you know the main threat actor types? Most attackers fall into one of four categories, each with their own favored tactics, techniques, and procedures.

Join & Write a Comment

Suggested Solutions

This article focuses on how to remove password security from multiple PDF files by Adobe Acrobat program. Sometimes it is essential to access the stored data items and to print, edit as well as copy content from Portable Document Format files in abs…
Microsoft Office Picture Manager was included in Office 2003, 2007, and 2010, but not in Office 2013. Users had hopes that it would be in Office 2016/Office 365, but it is not. Fortunately, the same zero-cost technique that works to install it with …
Microsoft Office Picture Manager is not included in Office 2013. This comes as quite a surprise to users upgrading from earlier versions of Office, such as 2007 and 2010, where Picture Manager was included as a standard application. This video expla…
This video Micro Tutorial is the second in a two-part series that shows how to create and use custom scanning profiles in Nuance's PaperPort 14.5 (http://www.experts-exchange.com/articles/17490/). But the ability to create custom scanning profiles a…

744 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

14 Experts available now in Live!

Get 1:1 Help Now