Solved

Can anyone suggest a way of converting data from a pdf file into editable text

Posted on 2016-09-08
12
138 Views
Last Modified: 2016-09-12
I have a pdf file of several invoices for analysis.  The pdf files have the invoice as a picture, rather than scanned text.
Is there a good way to convert this picture to editable data?

I cannot include my example, as it has confidential information contained.

Many thanks
David Phelops
0
Comment
Question by:David Phelops
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 6
  • 4
  • 2
12 Comments
 
LVL 53

Expert Comment

by:Joe Winograd, EE MVE
ID: 41790445
Hi David,
You need to convert the image-only PDF to a searchable PDF with OCR. If you have Adobe Acrobat (not Reader), it has built-in OCR via Tools>Recognize Text in Version X (10) or Tools>Text Recognition in Version XI (11) or something similar in other versions. If you don't have Acrobat and want a free way to do it, look at this 5-minute EE video Micro Tutorial:
How to OCR pages in a PDF with free software

Regards, Joe
0
 

Author Comment

by:David Phelops
ID: 41790467
Thanks very much -  I will have a look at those options.  Help much appreciated.
0
 
LVL 53

Accepted Solution

by:
Joe Winograd, EE MVE earned 400 total points
ID: 41790483
You're welcome. If you have any problems, I can probably help you through it with either Acrobat X or XI, or the free PDF-XChange Editor demonstrated in the video mentioned above. If you have other OCR software on your computer, that would work, too. For example, here are other articles/videos about two other software packages that can do it (but neither is free):

Batch Conversion of PDF, TIFF, and Other Image Formats via Command Line Interface to PDF, PDF Searchable, and TIFF

PaperPort - How To Create Searchable PDF Files

Convert Scanned Image-Only PDF Files to PDF Searchable Image Files via OCR with Power PDF Advanced

Regards, Joe
1
What is SQL Server and how does it work?

The purpose of this paper is to provide you background on SQL Server. It’s your self-study guide for learning fundamentals. It includes both the history of SQL and its technical basics. Concepts and definitions will form the solid foundation of your future DBA expertise.

 

Author Comment

by:David Phelops
ID: 41790494
I have tried it with Acrobat 7, using the command: Document>>Recognize text using OCR, but nothing appears to be output

I have attached a page with the least information, so you can see what I am working with.
Scanned-from-a-Xerox-Multifunction-D.pdf
0
 
LVL 53

Expert Comment

by:Joe Winograd, EE MVE
ID: 41790532
> I have tried it with Acrobat 7, using the command: Document>>Recognize text using OCR, but nothing appears to be output

It doesn't output anything. It puts the text from the OCR process back into the same file along with the scanned image. You should be able to select/copy/paste the text after doing the OCR. However, that is a poor quality scan, so the OCR is not very accurate. Attached is the searchable PDF that I created from it using Acrobat XI. Regards, Joe
ocr-via-Acrobat-XI.pdf
1
 
LVL 30

Expert Comment

by:pgm554
ID: 41790609
For $79 bucks you can buy Wondershare PDF Converter Pro,does a great job of conversion

https://pdf.wondershare.com/pdf-converter-pro/
0
 
LVL 53

Expert Comment

by:Joe Winograd, EE MVE
ID: 41790705
pgm554,
Could you please post the results from Wondershare? I doubt very much that it can do "a great job of conversion" on that document.

David,
Attached is another OCR result, this one from PaperPort 14.5 (just $25 at Amazon), which uses OmniPage 19 under the covers for its OCR. It is top caliber OCR, but without manual correction, no OCR is going to perform with high accuracy on that scanned document because of its low image quality.

Regards, Joe
ocr-via-PP14-OP19.pdf
1
 

Author Comment

by:David Phelops
ID: 41793538
Hi Joe

Thanks very much for all your help - downloaded the free PDF exchange editor, which worked to an extent, but, as you say, the quality of the original was very poor, so I have worked most of the weekend to try and turn a load of text into analyzable data tables. (I'm not doing that again in a hurry!)

I noticed, even on Acrobat 7, that there is a feature that will copy the data as a table, but even that is not wholly accurate.

Is there any further software that can perform miracle of data conversion, or is it inevitable that there will be a heavy element of manual correction?

Your help is very much appreciated.  Thanks Joe.

David
0
 
LVL 53

Expert Comment

by:Joe Winograd, EE MVE
ID: 41793615
> but even that is not wholly accurate.

No OCR is wholly accurate, but it can be very good when the source document is high quality.

> Is there any further software that can perform miracle of data conversion

Not on a document of such poor image quality as the one you posted.

> or is it inevitable that there will be a heavy element of manual correction?

Yes, on a document of such poor image quality as the one you posted.

Regards, Joe
1
 
LVL 30

Assisted Solution

by:pgm554
pgm554 earned 100 total points
ID: 41793629
Wondershare convert
ocr-via-PP14-OP19.docx
1
 

Author Comment

by:David Phelops
ID: 41793940
Thanks you  both very much for your help - Ironically - in the end, the conversion was so poor for a lot of the data, it worked out quicker and more accurate to retype the required fields!

My lesson, apart from finding recommended software - always better than trawling through hundreds of programmes - is never to accept such appalling quality documents to work on.

I ended up working through the night to get the information in on time.

I will investigate and try some of the software you have recomended.

Thanks from a bleary eyed David
0
 
LVL 53

Expert Comment

by:Joe Winograd, EE MVE
ID: 41794275
> it worked out quicker and more accurate to retype the required fields!

Turns out that is often the case. I worked in the high-end document scanning/imaging/management arena for 20 years (million dollar systems) and most of the companies went with "heads-down" data entry instead of OCR. Unless you have a very clean document with relatively little formatting, manual data entry is often the better way to go — as you have sadly discovered. Get some rest. :)  Regards, Joe
1

Featured Post

Free Tool: Postgres Monitoring System

A PHP and Perl based system to collect and display usage statistics from PostgreSQL databases.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

I. Introduction In a previous article (http://www.experts-exchange.com/Web_Development/Document_Imaging/A_6537-PaperPort-Upgrade-How-to-download-and-install-updated-versions-of-PaperPort-11-and-12.html) (now deprecated), I discussed how to upgrad…
PaperPort 14.5 Patch 1 update is often not detected or downloaded automatically. This article provides direct download links to solve the problem for retail (non-bundled) versions of the Standard and Professional editions, as well as the Professiona…
Microsoft Office Picture Manager has a Picture Shortcuts pane that shows a list with the Recently Browsed folders. While creating my video Micro Tutorial here at Experts Exchange showing How to Install Microsoft Office Picture Manager in Office 2013…
In a recent question (https://www.experts-exchange.com/questions/28997919/Pagination-in-Adobe-Acrobat.html) here at Experts Exchange, a member asked how to add page numbers to a PDF file using Adobe Acrobat XI Pro. This short video Micro Tutorial sh…

696 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question