Tech or Treat! Write an article about your scariest tech disaster to win gadgets!Learn more

x
?
Solved

Can anyone suggest a way of converting data from a pdf file into editable text

Posted on 2016-09-08
12
Medium Priority
?
192 Views
Last Modified: 2016-09-12
I have a pdf file of several invoices for analysis.  The pdf files have the invoice as a picture, rather than scanned text.
Is there a good way to convert this picture to editable data?

I cannot include my example, as it has confidential information contained.

Many thanks
David Phelops
0
Comment
Question by:David Phelops
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 6
  • 4
  • 2
12 Comments
 
LVL 56

Expert Comment

by:Joe Winograd, EE MVE 2015&2016
ID: 41790445
Hi David,
You need to convert the image-only PDF to a searchable PDF with OCR. If you have Adobe Acrobat (not Reader), it has built-in OCR via Tools>Recognize Text in Version X (10) or Tools>Text Recognition in Version XI (11) or something similar in other versions. If you don't have Acrobat and want a free way to do it, look at this 5-minute EE video Micro Tutorial:
How to OCR pages in a PDF with free software

Regards, Joe
0
 

Author Comment

by:David Phelops
ID: 41790467
Thanks very much -  I will have a look at those options.  Help much appreciated.
0
 
LVL 56

Accepted Solution

by:
Joe Winograd, EE MVE 2015&2016 earned 1600 total points
ID: 41790483
You're welcome. If you have any problems, I can probably help you through it with either Acrobat X or XI, or the free PDF-XChange Editor demonstrated in the video mentioned above. If you have other OCR software on your computer, that would work, too. For example, here are other articles/videos about two other software packages that can do it (but neither is free):

Batch Conversion of PDF, TIFF, and Other Image Formats via Command Line Interface to PDF, PDF Searchable, and TIFF

PaperPort - How To Create Searchable PDF Files

Convert Scanned Image-Only PDF Files to PDF Searchable Image Files via OCR with Power PDF Advanced

Regards, Joe
1
NFR key for Veeam Agent for Linux

Veeam is happy to provide a free NFR license for one year.  It allows for the non‑production use and valid for five workstations and two servers. Veeam Agent for Linux is a simple backup tool for your Linux installations, both on‑premises and in the public cloud.

 

Author Comment

by:David Phelops
ID: 41790494
I have tried it with Acrobat 7, using the command: Document>>Recognize text using OCR, but nothing appears to be output

I have attached a page with the least information, so you can see what I am working with.
Scanned-from-a-Xerox-Multifunction-D.pdf
0
 
LVL 56

Expert Comment

by:Joe Winograd, EE MVE 2015&2016
ID: 41790532
> I have tried it with Acrobat 7, using the command: Document>>Recognize text using OCR, but nothing appears to be output

It doesn't output anything. It puts the text from the OCR process back into the same file along with the scanned image. You should be able to select/copy/paste the text after doing the OCR. However, that is a poor quality scan, so the OCR is not very accurate. Attached is the searchable PDF that I created from it using Acrobat XI. Regards, Joe
ocr-via-Acrobat-XI.pdf
1
 
LVL 30

Expert Comment

by:pgm554
ID: 41790609
For $79 bucks you can buy Wondershare PDF Converter Pro,does a great job of conversion

https://pdf.wondershare.com/pdf-converter-pro/
0
 
LVL 56

Expert Comment

by:Joe Winograd, EE MVE 2015&2016
ID: 41790705
pgm554,
Could you please post the results from Wondershare? I doubt very much that it can do "a great job of conversion" on that document.

David,
Attached is another OCR result, this one from PaperPort 14.5 (just $25 at Amazon), which uses OmniPage 19 under the covers for its OCR. It is top caliber OCR, but without manual correction, no OCR is going to perform with high accuracy on that scanned document because of its low image quality.

Regards, Joe
ocr-via-PP14-OP19.pdf
1
 

Author Comment

by:David Phelops
ID: 41793538
Hi Joe

Thanks very much for all your help - downloaded the free PDF exchange editor, which worked to an extent, but, as you say, the quality of the original was very poor, so I have worked most of the weekend to try and turn a load of text into analyzable data tables. (I'm not doing that again in a hurry!)

I noticed, even on Acrobat 7, that there is a feature that will copy the data as a table, but even that is not wholly accurate.

Is there any further software that can perform miracle of data conversion, or is it inevitable that there will be a heavy element of manual correction?

Your help is very much appreciated.  Thanks Joe.

David
0
 
LVL 56

Expert Comment

by:Joe Winograd, EE MVE 2015&2016
ID: 41793615
> but even that is not wholly accurate.

No OCR is wholly accurate, but it can be very good when the source document is high quality.

> Is there any further software that can perform miracle of data conversion

Not on a document of such poor image quality as the one you posted.

> or is it inevitable that there will be a heavy element of manual correction?

Yes, on a document of such poor image quality as the one you posted.

Regards, Joe
1
 
LVL 30

Assisted Solution

by:pgm554
pgm554 earned 400 total points
ID: 41793629
Wondershare convert
ocr-via-PP14-OP19.docx
1
 

Author Comment

by:David Phelops
ID: 41793940
Thanks you  both very much for your help - Ironically - in the end, the conversion was so poor for a lot of the data, it worked out quicker and more accurate to retype the required fields!

My lesson, apart from finding recommended software - always better than trawling through hundreds of programmes - is never to accept such appalling quality documents to work on.

I ended up working through the night to get the information in on time.

I will investigate and try some of the software you have recomended.

Thanks from a bleary eyed David
0
 
LVL 56

Expert Comment

by:Joe Winograd, EE MVE 2015&2016
ID: 41794275
> it worked out quicker and more accurate to retype the required fields!

Turns out that is often the case. I worked in the high-end document scanning/imaging/management arena for 20 years (million dollar systems) and most of the companies went with "heads-down" data entry instead of OCR. Unless you have a very clean document with relatively little formatting, manual data entry is often the better way to go — as you have sadly discovered. Get some rest. :)  Regards, Joe
1

Featured Post

Free Tool: Path Explorer

An intuitive utility to help find the CSS path to UI elements on a webpage. These paths are used frequently in a variety of front-end development and QA automation tasks.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

I. Introduction In a previous article (http://www.experts-exchange.com/Web_Development/Document_Imaging/A_6537-PaperPort-Upgrade-How-to-download-and-install-updated-versions-of-PaperPort-11-and-12.html) (now deprecated), I discussed how to upgrad…
PaperPort 14.5 Patch 1 update is often not detected or downloaded automatically. This article provides direct download links to solve the problem for retail (non-bundled) versions of the Standard and Professional editions, as well as the Professiona…
In this video, we show how to convert an image-only PDF file into a PDF Searchable Image file, that is, a file with both the image (typically from scanning) and text, which is created in an automated fashion with Optical Character Recognition (OCR) …
Please read the paragraph below before following the instructions in the video — there are important caveats in the paragraph that I did not mention in the video. If your PaperPort 12 or PaperPort 14 is failing to start, or crashing, or hanging, …
Suggested Courses

647 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question