OCR

547

Solutions

1K

Contributors

Optical character recognition (OCR) is the mechanical or electronic conversion of images of typed, handwritten or printed text into machine-encoded text. It is widely used as a form of data entry from printed paper data records, including passport documents, invoices, bank statements, computerized receipts, business cards, mail, printouts of static-data, or any suitable documentation. It is a common method of digitizing printed texts so that it can be electronically edited, searched, stored more compactly, displayed on-line, and used in machine processes such as machine translation, text-to-speech, key data and text mining. OCR is a field of research in pattern recognition, artificial intelligence and computer vision.

Share tech news, updates, or what's on your mind.

Sign up to Post

I have a very large set of assorted PDF files. They contain searchable text, but the text is filled with errors. If I save a page from the PDF as an image, and use my OCR software on it, I get a much better result. So, I would like to re-OCR all of the files — but first I need to "flatten" all of the text objects in the PDFs, since none of my OCR tools will overwrite any existing text.

I do have Acrobat X and XI Pro, and I've tried using a batch action to strip the text and rerun OCR, but anytime the program encounters an error, it interrupts the process with a dialog box. I searched for a way to prevent this, but there does not appear to be one.

So, the way I see it, I need one of three things:

1. A way to force Acrobat to skip over errors in batch actions and process the remaining files. I could swear you used to be able to do this.

2. A batch OCR tool, free or paid, that will remove and replace all existing text objects.

3. A tool to batch-flatten all text objects in a large set of PDFs (so I can then run them through OCR). I've found software that looks related, but everything seems to either delete text, which I don't want; or else it flattens form fields, images, etc. but does not mention text.

Any of the three of these would solve my problem. I'm open to other suggestions too, of course -- any advice is greatly appreciated!
0
Learn Ruby Fundamentals
LVL 12
Learn Ruby Fundamentals

This course will introduce you to Ruby, as well as teach you about classes, methods, variables, data structures, loops, enumerable methods, and finishing touches.

hi,

Can EE recommend OCR software? We receive files in PDF, some are in editable PDF, some are in image scanned form. We are thinking if we could capture the data into Excel, that would save us a lot of time. thanks
0
I have a bunch of scanned (and OCR'ed) PDF and Word files : I need freeware tools that cud search (for case sensitive+non Sensitive) strings of text using AND & OR operands .

Appreciate a few free tools
0
My customer has been using Sharpdesk to do OCR conversion from a PDF to Word so they can then edit the Word Document. Sharpdesk is kind of ugly especially since they no longer have Sharp copiers.  Is the a simply way to convert PDFs to Word so that the Word Document will be editable?

   They have tried opening PDFs with Word but some of the PDFs are pure graphics so the OCR is important,....
0
Hi Experts,

With a document scanning project, what is "Searchable PDF"?

I am using Brother Control Center, and I believe when scanning into PDF, they are treated as image, but I know when I use OCR they are converting to simple text format?
0
Need to search find closest match in array of strings

I have a static list of about 500 strings containing things like:

VS Credit Voucher Proc-CR Trans 2
VS Credit Voucher Proc-OB Prepaid Trans 2

but am reading from OCR and get the strings from the faxed reports looking like:

VS Credit Voucher Proc-CR Trans 2
VS Crect Voucher Proc-OBPrepaid Trar 2

I need to do a lookup for the best match for each as it appears in the in the static list.

And of course, there needs to be a threshold where NO MATCH is a possibility.

How shall I store the static list? How can I do a search in the list that is resource efficient?

I would sort that list of 500, clearly. But what are the mechanics of the lookup?

I am writing a C# Win Forms (64 bit) application and could include a database, if I could include that into my EXE, to avoid a distinct installation step.

What search algorithm?
 
Thanks.
0
Baby steps with PDFtoText for OCR

What steps are the first for me to take as I create a proof of concept that will be:

- a C# Winforms program
- uses the PDFtoText library for OCR

Are there any demo programs I can review? Should I just dive in?

Thanks
0
how to programmatically isolate PDF from image scan versus an original PDF?

I have a folder filled with PDF's, most of which are scanned copies. But I need a way to pul out the original versions.

I do not want to deal with OCR software and need originals.

Is there a tool which can do this parsing to find originals?

Thanks
0
Hi

I have a pdf files got from scanner and I'd like to bulk rename all the files based on OCR data. Can someone provide me at software which can bulk rename base on OCR entries?

Regards,

CK
0
We want to develop an inventory application for a client.  They use Surface tablets to take handwritten notes which they later transcribe manually into Excel sheets.  We have an old application in Access which is close to what they want, but what it lacks is OCR.  We'd like for the client to be able to write their notes directly into a field.

From all that I've read and researched so far, it seems that Access 2016 does not have OCR capabilities, nor could I find add-ins which provide it.  The best I've been able to find has either been the OneNote API (writing notes in OneNote, linking to them in Access), or libraries for WPF (which would entail writing the app from scratch).

Has anyone seeing this done this before?  Any suggestions for accomplishing our goal?
0
Starting with Angular 5
LVL 12
Starting with Angular 5

Learn the essential features and functions of the popular JavaScript framework for building mobile, desktop and web applications.

As part of a news research project I need to download a series of pages from a site to perform OCR on them.
The site is using PHP and JAVASCRIPT to which I do not have real acquaintance. I have tried to download the image  in order to OCR it, but all pages on documents only show the page 1 and not the following pages.

The page has a button to circulate amongst pages and the code on the inspect is:

<a id="pag_seguinte" class="muda_pag botaoLinha setaDireita" href="http://casacomum.org/cc/visualizador.php?pasta=06337.058.13733&pag=2" title="pg. +1" style="visibility: visible;"></a>

 
Can anyone help me on trying to circulate amongst the pages?
0
I need help creating traineddata for Tesseract.  We need to train it for Car VINs.  Linux server got it installed on V4 and I creayed a set of 12 pictures to work with.
In the end, it's to be used in PHP on a website. Everything is completed except it's not accurate enough using only english traineddata.

The existing documentation is too confusing for us.  

I am looking for more specific instruction or someone we could hire for a few hours to help on this.

Out traineddata is located in /usr/share/tesseract/tessdata/
0
Which is the Best OCR engine for most accuracy - commerical or open source in terms of very high quality
0
Anyone has MSWord/editable version of CIS hardening guides?  
If not, appreciate if someone can OCR it to Word (I have problem attaching the PDF files here) as a number of free online ones are limited in
the number of pages that can be converted & boxoft doesn't seems to work well on my PC.
0
We have a multitude of files we need to OCR, some are images, others PDF. I can find tools which can do these one at a time, but need something that could do hundreds if you point the software in the direction of a folder full. I do not trust online converters as the docs may contain sensitive information. Please let me know of anything that may meet our criteria.
0
I am considering making a site that can auto-analyze a certain type of uploaded report, and instantly display the results as a PDF. There are various steps involved in the creation of the PDF and I want a feeling for the effort and technology needed for each step.

There are three different steps I will discuss here to see where I can use WordPress plugins and where I need to customize the functionality.

The uploaded report would be a merchant's monthly credit card statement, like the following snippet..

Statement Example
1) So, for the first of three steps, I need a WordPress OCR plug-in. Are there many options for that? Is the angle of the text a problem? I can not guarantee neatness. (I added the underlining to make it easier for me to read)

I imagine allowing an authorized user to upload a report. And i need this plug-in to convert images to some form of digital data, like a PDF or a CSV file.

2) I need a way to analyze that data, and wonder if there is a configurable WordPress plugin for this? It will query the items by the Description, then use the numeric values in the Number, Amount and Total columns for mathematical computations. There will be some mathematical steps performed on some of the data as it generates the output for the report.

The results should go into some format, like a CSV file

3) I need a report tool which can import the data results from Step #2 and apply them to various pre-designed fields in the final pre-designed …
0
I need a tool I can use to digitize a report, like the one attached here...

Report
Will this kind of report get a 100% successful conversion rate?

Eventually, I need the tool to be part of my website, but I have not, as yet, chosen my back-end technology. For now, a simple Mac based tool is fine, just so I can hand convert a report that I can start to use in my programming of the back-end.

Windows is okay, if there are limited Mac FREE versions.

I do have Office 365 (Mac) if there is a tool in there which I can use.

I am also interested in hearing what "plug-ins" can work when I deploy this to my website, for online OCR conversions.
0
Security/Privacy related question.  Can text be detected in say a .jpg, . bmp, etc. type file formats?  I know using text within .pdfs can be with OCR.  When I say "detected" I mean with use of a SIEM, DLP or other event driven software?  Not referring to steganography or obfuscation of text in anyway.  Just simple text detection in a jpg or bmp format.  Much thanks.
0
I had this question after viewing PDFTK - filling this PDF but got an error.

After talking with the others on this project, we decided it's ok to have the final PDF as read-only and it doesn't need to be editable.

I ran the commands below I can populate the PDF but not the checkbox. Joe (if you're reading this)....is this because of the LiveCycle issue that the checkboxes don't get checked?  If it is, I got approval to buy LiveCycle. I'll get it and see what's going on.

I tried "No" for value, "On", "1" but I don't see the checkbox checked.

1. i-765 is the orig file

2. notsigned.pdf is the file I QPDF-ed to get rid of the password error message

3. Ran this pdftk.exe notsigned.pdf fill_form i-765.txt output OutputFilled.pdf

4. outputfilled.pdf is the populated PDF.
i-765.pdf
notsigned.pdf
OutputFilled.pdf
0
Introduction to Web Design
LVL 12
Introduction to Web Design

Develop a strong foundation and understanding of web design by learning HTML, CSS, and additional tools to help you develop your own website.

Hi, do you have any shortcut for avoiding special characters in textbox?
im using this manual code:

Public Class Form1
    Private Sub TextBox1_TextChanged(sender As Object, e As EventArgs) Handles TextBox1.TextChanged
        If TextBox1.ToString.Contains("`") Then
            MsgBox("Text contains invalid character(s)", vbInformation, "Invalid!")
        ElseIf TextBox1.ToString.Contains("~") Then
            MsgBox("Text contains invalid character(s)", vbInformation, "Invalid!")
        ElseIf TextBox1.ToString.Contains("!") Then
            MsgBox("Text contains invalid character(s)", vbInformation, "Invalid!")
        ElseIf TextBox1.ToString.Contains("@") Then
            MsgBox("Text contains invalid character(s)", vbInformation, "Invalid!")
        ElseIf TextBox1.ToString.Contains("#") Then
            MsgBox("Text contains invalid character(s)", vbInformation, "Invalid!")
        ElseIf TextBox1.ToString.Contains("$") Then
            MsgBox("Text contains invalid character(s)", vbInformation, "Invalid!")
        ElseIf TextBox1.ToString.Contains("%") Then
            MsgBox("Text contains invalid character(s)", vbInformation, "Invalid!")
        ElseIf TextBox1.ToString.Contains("^") Then
            MsgBox("Text contains invalid character(s)", vbInformation, "Invalid!")
        ElseIf TextBox1.ToString.Contains("*") Then
            MsgBox("Text contains invalid character(s)", vbInformation, "Invalid!")
        ElseIf TextBox1.ToString.Contains("&") Then

Open in new window

0
I'm hoping there's a solution for this problem.

1. I have the FDF that I want to use to populate a PDF. It's attached. It's i-765.txt

2. I have the PDF file. I got it from the INS site. It's attached here and called i-765.pdf

3. I ran this command
   pdftk.exe i-765.pdf fill_form i-765.txt output Output.pdf

but got this error

Error: Failed to open PDF file:
   i-765.pdf
   OWNER PASSWORD REQUIRED, but not given (or incorrect)
Errors encountered.  No output created.
Done.  Input errors, so no output created.


4. I googled the error and came across this solution saying I need to run qpdf Solution to get rid of the error

    a. I downloaded QPDF from here Download QPDF
    b.  It got installed in folder:  C:\Users\bwa\Downloads\qpdf-7.0.0-bin-mingw32\qpdf-7.0.0\bin
    c. I copied i-765.pdf to that folder
    d. Ran this command
qpdf --decrypt i-765.pdf decrypted765.pdf

Open in new window

   e. Now, I have decrypte765.pdf. I open it and get this message. I click ok on it and the PDF is read-only
        Message I get    f.  I ran this command to get rid of the message
pdftk decryptedi765.pdf cat output i-765notsigned.pdf

Open in new window

   g. I open i765notsigned.pdf and there's no error and the fields are editable.
    However,  I noticed this: The functionality of the …
0
I need OCR software that runs on a Mac, with the scanned images coming from my HP LaserJet MFP M127fw.

What options do I have?

Thanks
0
Hello Experts,

I need an OCR (Optical Character Recognition) App to scan business card and directly update into outlook/exchange address book.

Thank you very much in advance.
0
I'm using ABBY OCR https://www.abbyy.com/en-us/  to read PDF files and parse some fields from it and then using C# code, I store the data in the database.

This is what I want to do:

I have some PDF files but they're forms. As it is now, we download the form and fill the form. For example, fill first name, last name, address, etc. Then we scan it or fax it whatever.

I want to read the entire PDF file and display it in a web page. Along with the image (if it exists).  Then, user fills the form online. For example, read the PDF file using OCR and display the same exact file on a web page, then save the entire form, print, etc.

Is this doable?

Edit: Maybe I should have the same fields the PDF form has on a web page. User fills the fields and somehow I plug those field values into the PDF?

Edit: I've attached a sample PDF form i like to process.
i-765.pdf
0
Hello,

What factors determine the quality and accuracy of OCR (optical character recognition) and is there much variability among different OCR software?

If there is variability in software, what applications are best (both free and purchased)?

Thanks
0

OCR

547

Solutions

1K

Contributors

Optical character recognition (OCR) is the mechanical or electronic conversion of images of typed, handwritten or printed text into machine-encoded text. It is widely used as a form of data entry from printed paper data records, including passport documents, invoices, bank statements, computerized receipts, business cards, mail, printouts of static-data, or any suitable documentation. It is a common method of digitizing printed texts so that it can be electronically edited, searched, stored more compactly, displayed on-line, and used in machine processes such as machine translation, text-to-speech, key data and text mining. OCR is a field of research in pattern recognition, artificial intelligence and computer vision.

Top Experts In
OCR
<
Monthly
>