Solved

Can anyone suggest a way of converting data from a pdf file into editable text

Posted on 2016-09-08
12
175 Views
Last Modified: 2016-09-12
I have a pdf file of several invoices for analysis.  The pdf files have the invoice as a picture, rather than scanned text.
Is there a good way to convert this picture to editable data?

I cannot include my example, as it has confidential information contained.

Many thanks
David Phelops
0
Comment
Question by:David Phelops
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 6
  • 4
  • 2
12 Comments
 
LVL 54

Expert Comment

by:Joe Winograd, EE MVE 2015&2016
ID: 41790445
Hi David,
You need to convert the image-only PDF to a searchable PDF with OCR. If you have Adobe Acrobat (not Reader), it has built-in OCR via Tools>Recognize Text in Version X (10) or Tools>Text Recognition in Version XI (11) or something similar in other versions. If you don't have Acrobat and want a free way to do it, look at this 5-minute EE video Micro Tutorial:
How to OCR pages in a PDF with free software

Regards, Joe
0
 

Author Comment

by:David Phelops
ID: 41790467
Thanks very much -  I will have a look at those options.  Help much appreciated.
0
 
LVL 54

Accepted Solution

by:
Joe Winograd, EE MVE 2015&2016 earned 400 total points
ID: 41790483
You're welcome. If you have any problems, I can probably help you through it with either Acrobat X or XI, or the free PDF-XChange Editor demonstrated in the video mentioned above. If you have other OCR software on your computer, that would work, too. For example, here are other articles/videos about two other software packages that can do it (but neither is free):

Batch Conversion of PDF, TIFF, and Other Image Formats via Command Line Interface to PDF, PDF Searchable, and TIFF

PaperPort - How To Create Searchable PDF Files

Convert Scanned Image-Only PDF Files to PDF Searchable Image Files via OCR with Power PDF Advanced

Regards, Joe
1
NFR key for Veeam Backup for Microsoft Office 365

Veeam is happy to provide a free NFR license (for 1 year, up to 10 users). This license allows for the non‑production use of Veeam Backup for Microsoft Office 365 in your home lab without any feature limitations.

 

Author Comment

by:David Phelops
ID: 41790494
I have tried it with Acrobat 7, using the command: Document>>Recognize text using OCR, but nothing appears to be output

I have attached a page with the least information, so you can see what I am working with.
Scanned-from-a-Xerox-Multifunction-D.pdf
0
 
LVL 54

Expert Comment

by:Joe Winograd, EE MVE 2015&2016
ID: 41790532
> I have tried it with Acrobat 7, using the command: Document>>Recognize text using OCR, but nothing appears to be output

It doesn't output anything. It puts the text from the OCR process back into the same file along with the scanned image. You should be able to select/copy/paste the text after doing the OCR. However, that is a poor quality scan, so the OCR is not very accurate. Attached is the searchable PDF that I created from it using Acrobat XI. Regards, Joe
ocr-via-Acrobat-XI.pdf
1
 
LVL 30

Expert Comment

by:pgm554
ID: 41790609
For $79 bucks you can buy Wondershare PDF Converter Pro,does a great job of conversion

https://pdf.wondershare.com/pdf-converter-pro/
0
 
LVL 54

Expert Comment

by:Joe Winograd, EE MVE 2015&2016
ID: 41790705
pgm554,
Could you please post the results from Wondershare? I doubt very much that it can do "a great job of conversion" on that document.

David,
Attached is another OCR result, this one from PaperPort 14.5 (just $25 at Amazon), which uses OmniPage 19 under the covers for its OCR. It is top caliber OCR, but without manual correction, no OCR is going to perform with high accuracy on that scanned document because of its low image quality.

Regards, Joe
ocr-via-PP14-OP19.pdf
1
 

Author Comment

by:David Phelops
ID: 41793538
Hi Joe

Thanks very much for all your help - downloaded the free PDF exchange editor, which worked to an extent, but, as you say, the quality of the original was very poor, so I have worked most of the weekend to try and turn a load of text into analyzable data tables. (I'm not doing that again in a hurry!)

I noticed, even on Acrobat 7, that there is a feature that will copy the data as a table, but even that is not wholly accurate.

Is there any further software that can perform miracle of data conversion, or is it inevitable that there will be a heavy element of manual correction?

Your help is very much appreciated.  Thanks Joe.

David
0
 
LVL 54

Expert Comment

by:Joe Winograd, EE MVE 2015&2016
ID: 41793615
> but even that is not wholly accurate.

No OCR is wholly accurate, but it can be very good when the source document is high quality.

> Is there any further software that can perform miracle of data conversion

Not on a document of such poor image quality as the one you posted.

> or is it inevitable that there will be a heavy element of manual correction?

Yes, on a document of such poor image quality as the one you posted.

Regards, Joe
1
 
LVL 30

Assisted Solution

by:pgm554
pgm554 earned 100 total points
ID: 41793629
Wondershare convert
ocr-via-PP14-OP19.docx
1
 

Author Comment

by:David Phelops
ID: 41793940
Thanks you  both very much for your help - Ironically - in the end, the conversion was so poor for a lot of the data, it worked out quicker and more accurate to retype the required fields!

My lesson, apart from finding recommended software - always better than trawling through hundreds of programmes - is never to accept such appalling quality documents to work on.

I ended up working through the night to get the information in on time.

I will investigate and try some of the software you have recomended.

Thanks from a bleary eyed David
0
 
LVL 54

Expert Comment

by:Joe Winograd, EE MVE 2015&2016
ID: 41794275
> it worked out quicker and more accurate to retype the required fields!

Turns out that is often the case. I worked in the high-end document scanning/imaging/management arena for 20 years (million dollar systems) and most of the companies went with "heads-down" data entry instead of OCR. Unless you have a very clean document with relatively little formatting, manual data entry is often the better way to go — as you have sadly discovered. Get some rest. :)  Regards, Joe
1

Featured Post

[Webinar] How Hackers Steal Your Credentials

Do You Know How Hackers Steal Your Credentials? Join us and Skyport Systems to learn how hackers steal your credentials and why Active Directory must be secure to stop them. Thursday, July 13, 2017 10:00 A.M. PDT

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

This article was inspired by a question here at Experts Exchange (http://www.experts-exchange.com/Software/Photos_Graphics/Images_and_Photos/Q_28629170.html). The requirements stated in that question are (1) reduce the file size of a large number of…
Microsoft Office Picture Manager was included in Office 2003, 2007, and 2010, but not in Office 2013. Users had hopes that it would be in Office 2016/Office 365, but it is not. Fortunately, the same zero-cost technique that works to install it with …
In this fifth video of the Xpdf series, we discuss and demonstrate the PDFdetach utility, which is able to list and, more importantly, extract attachments that are embedded in PDF files. It does this via a command line interface, making it suitable …
Sometimes we receive PDF files that are in the wrong orientation. They may be sideways or even upside down. This most commonly happens with scanned or faxed documents. It is possible to rotate the view of these PDFs with the free Adobe Reader produc…

726 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question