OCR

566

Solutions

1K

Contributors

Optical character recognition (OCR) is the mechanical or electronic conversion of images of typed, handwritten or printed text into machine-encoded text. It is widely used as a form of data entry from printed paper data records, including passport documents, invoices, bank statements, computerized receipts, business cards, mail, printouts of static-data, or any suitable documentation. It is a common method of digitizing printed texts so that it can be electronically edited, searched, stored more compactly, displayed on-line, and used in machine processes such as machine translation, text-to-speech, key data and text mining. OCR is a field of research in pattern recognition, artificial intelligence and computer vision.

Share tech news, updates, or what's on your mind.

Sign up to Post

Hi tesseract OCR experts,

I’ve just installed tesseract on my Raspberry Pi running Linux (Raspbain) and I’m trying to extract text from PNG screen shots taken on my phone.  (I have hundreds of these screen shots, all in the same size & format, taken over the last year using the LeafSpy Lite app, for the Nissan LEAF EV, and I'll be extracting text from all of them.)

The problem I have is, some of the text is not being extracted.

When I run this command:
$ tesseract sample1.png sample1
It produces sample1.txt (attached), which includes plenty of useful figures, but it excludes:
-      “11.84V” near the bottom left (nice to have this voltage figure, but not vital), and
-      “32.0%” at the bottom (I really need this SOC figure).

I tried feeding tesseract a negative (created with IrfanView on Windows) of the image, in case it was a black/white issue, but that gave the same output.
I tried cropping the 11.84V and 32.0% figures out to TIF files (see sample1_voltage.tif & sample1_soc.tif attached, also created with IrfanView on Windows) then running them through tesseract, and that:
-      failed for the 11.84V (see empty sample1_voltage.txt attached), but
-      worked for the 32.0% (see sample1_soc.txt attached).

I know bash and Perl scripting.  I don’t know Python, but Python is installed so it could be used if necessary, if someone else writes the code, but it's not my preference.
ImageMagicK is also installed, in case I need to use it for cropping or whatever.…
0
OWASP Proactive Controls
LVL 13
OWASP Proactive Controls

Learn the most important control and control categories that every architect and developer should include in their projects.

I have a computer running windows 10.    The DVD player will not open or play a DVD or CD.   I can burn a dvd or CD in the burner and it will burn the disk.   But when I go to play it, the disk does not play.    If I take that disk and put it in another computer, the disk plays fine.   Also if I go to open on the computer in question, I can see the files but no matter which file I click on, the disk will not play.  Any ideas?
1
Hi
Node.js
Calling the library https://github.com/naptha/tesseract.js#tesseractjs

We call the function worker.recognize(path2png, language) for OCR of a PNG in a await function.

async function readPNG(path2png, language) {
 const worker = new TesseractWorker();
 try{
   let result = await worker.recognize(path2png, language);
    return result.text;
 } catch (error) {
   console.error("************************** error=",error)
 } 
}

Open in new window


There is a crash in tesseract and we would expect that it lands in the catch(error), but it does not. Instead, we get this and no callback.

contains_unichar_id(unichar_id):Error:Assert failed:in file /src/src/ccutil/unicharset.h, line 502
trap!
trap!
abort("trap!"). Build with -s ASSERTIONS=1 for more info.
abort("trap!"). Build with -s ASSERTIONS=1 for more info.

/home/diego/NetBeansProjects/FromGitHub/tmp/localsearch_triage/node_modules/tesseract.js-core/tesseract-core.js:8
var Module=typeof TesseractCoreWASM!=="undefined"?TesseractCoreWASM:{};var moduleOverrides={};var key;for(key in Module){if(Module.hasOwnProperty(key)){moduleOverrides[key]=Module[key]}}Module["arguments"]=[];Module["thisProgram"]="./this.program";Module["quit"]=(function(status,toThrow){throw toThrow});Module["preRun"]=[];Module["postRun"]=[];var ENVIRONMENT_IS_WEB=false;var ENVIRONMENT_IS_WORKER=false;var ENVIRONMENT_IS_NODE=false;var ENVIRONMENT_IS_SHELL=false;ENVIRONMENT_IS_WEB=typeof window==="object";ENVIRONMENT_IS_WORKER=typeof 

Open in new window

0
Is there a scanning or OCR program that can convert signatures of first name last name into Excel?  I have a pdf that has first name and last name as signatures and I want it in excel so I can sor it.
0
i got some free online ocr converters but they have limited pages that are being converted
0
I have the attached doc in a foreign language: is there any
online free translation that I could just upload the whole
doc & it returns me the equiv doc in English?

I guess some sort of OCR is needed: my OCR may not work
well with a foreign language
builData_classificationFrenchy2018.pdf
0
I want to scan a document and use ocr so I can edit it in word 2016
0
We have an order management system that is based on VB6, runs in Access 2003. It currently has code (we wrote all of the program) which sends outgoing faxes from the system to a Castell Faxpress box we have. For incoming faxes, we have a home brewed OCR system built which converts all incoming faxes to a copy/pastable PDF.

Any ideas what out there nowadays could replace either or both of these? The boxes this stuff runs on need to be replaced and instead of doing that we want to at least step into the 2010's.
0
Hi,

Someone can suggest some .net ocr free library, or any paid at a reasonable price?

best regards
0
In previous work place, the Canon IR ADV 4251 offers OCR scanning to PDF
ie the resultant softcopy PDF is text-searchable.

However, in current work place, the same model does not offer this feature.

Is this a plug-in or just an OS upgrade or some sort of additional add-on we
have to buy : can point me to some articles/manual that mention this?

If it's a software upgrade only, can point me to the specific version & where
to get the software?  If it's just a feature to turn on, appreciate the
instructions on how to do this
0
Become a Certified Penetration Testing Engineer
LVL 13
Become a Certified Penetration Testing Engineer

This CPTE Certified Penetration Testing Engineer course covers everything you need to know about becoming a Certified Penetration Testing Engineer. Career Path: Professional roles include Ethical Hackers, Security Consultants, System Administrators, and Chief Security Officers.

Hi Experts,

Can anyone recommend a good scanning software that has OCR capacity, while maintaining the format of the document, works with any scanner with sheet feeder and produces a PDF?

I have a Brother  Control Center, but the OCR produces a simple text file, but I need to preserve the formatting.
Thanks
0
Image Magick C# Library throwing exception

I suspect this is an easy one for you to help me solve.

I am trying to run a Visual Studio project that works on my friend's Windows PC, but is throwing a path/library exception on my Windows Visual Studio Community 2015,where Windows is running on my Mac via Parallels.

I verify the file exists, but then I get the following exception...

Message = "PDFDelegateFailed `The system cannot find the file specified.\r\n' @ error/pdf.c/ReadPDFImage/793"

Exception
and here is the code that throws it:

Code that throws exception
0
Our Helpdesk has install  Adobe Acrobat XI Pro  &  Adobe Forms Central
on my laptop when I requested for a software that could do OCR &
convert PDF to editable MSWord doc.

However, they don't know how to use it:  anyone has a Quick Guide on
how to do PDF to Word conversion (with OCR) with these tools?
0
I have some sample code of using Tesseract-OCR.
Right now the code just opens up a image file and extracts what text it can

I have a picturebox.
I have a textbox and a button
 The OCR Button Load the image from the opendialog in the picturebox and extracts what text it can.

I want to type text  in the text box AND if Tesseract can find it in the image then highlight the text it finds on the image in the picture box .
The ocr button does a pretty good job on extracting what text it can .
I will include a sample image i been working with.
Some sample code would be great .
Thanks for all comments and help.
123.tif
Tesseract-OCR.rar
0
Need a Windows Forms User Interface to enable OCR User Error Checking

I am in the process of writing a Windows Forms application (in C#) that will use an external OCR module to perform OCR on a PDF containing scanned financial documents. So, I expect there to be errors. But, I want to provide the scanned results to the user in a format where the results can be cross-checked by the user.

Clearly, typing over with the corrected value is key.

What is the easiest way to do this in a Windows Forms program?

Various forms will have different values, so I want this as generic as possible.

Shall I just display the whole block of data as a multi-line text input field?

Any other ideas?

Thanks
0
adobe acrobat reader dc (free)

I click on one pdf and it allows me to add/edit text

but another pdf I click "edit" and asks me to sign in (pay) for paid version

Could it be the type of pdf. ocr versus non ocr

I just want to write text over the lines.  I dont want to use Microsoft paint with a screenshot
0
I have a very large set of assorted PDF files. They contain searchable text, but the text is filled with errors. If I save a page from the PDF as an image, and use my OCR software on it, I get a much better result. So, I would like to re-OCR all of the files — but first I need to "flatten" all of the text objects in the PDFs, since none of my OCR tools will overwrite any existing text.

I do have Acrobat X and XI Pro, and I've tried using a batch action to strip the text and rerun OCR, but anytime the program encounters an error, it interrupts the process with a dialog box. I searched for a way to prevent this, but there does not appear to be one.

So, the way I see it, I need one of three things:

1. A way to force Acrobat to skip over errors in batch actions and process the remaining files. I could swear you used to be able to do this.

2. A batch OCR tool, free or paid, that will remove and replace all existing text objects.

3. A tool to batch-flatten all text objects in a large set of PDFs (so I can then run them through OCR). I've found software that looks related, but everything seems to either delete text, which I don't want; or else it flattens form fields, images, etc. but does not mention text.

Any of the three of these would solve my problem. I'm open to other suggestions too, of course -- any advice is greatly appreciated!
0
hi,

Can EE recommend OCR software? We receive files in PDF, some are in editable PDF, some are in image scanned form. We are thinking if we could capture the data into Excel, that would save us a lot of time. thanks
0
I have a bunch of scanned (and OCR'ed) PDF and Word files : I need freeware tools that cud search (for case sensitive+non Sensitive) strings of text using AND & OR operands .

Appreciate a few free tools
0
Why Diversity in Tech Matters
LVL 13
Why Diversity in Tech Matters

Kesha Williams, certified professional and software developer, explores the imbalance of diversity in the world of technology -- especially when it comes to hiring women. She showcases ways she's making a difference through the Colors of STEM program.

My customer has been using Sharpdesk to do OCR conversion from a PDF to Word so they can then edit the Word Document. Sharpdesk is kind of ugly especially since they no longer have Sharp copiers.  Is the a simply way to convert PDFs to Word so that the Word Document will be editable?

   They have tried opening PDFs with Word but some of the PDFs are pure graphics so the OCR is important,....
0
Hi Experts,

With a document scanning project, what is "Searchable PDF"?

I am using Brother Control Center, and I believe when scanning into PDF, they are treated as image, but I know when I use OCR they are converting to simple text format?
0
Need to search find closest match in array of strings

I have a static list of about 500 strings containing things like:

VS Credit Voucher Proc-CR Trans 2
VS Credit Voucher Proc-OB Prepaid Trans 2

but am reading from OCR and get the strings from the faxed reports looking like:

VS Credit Voucher Proc-CR Trans 2
VS Crect Voucher Proc-OBPrepaid Trar 2

I need to do a lookup for the best match for each as it appears in the in the static list.

And of course, there needs to be a threshold where NO MATCH is a possibility.

How shall I store the static list? How can I do a search in the list that is resource efficient?

I would sort that list of 500, clearly. But what are the mechanics of the lookup?

I am writing a C# Win Forms (64 bit) application and could include a database, if I could include that into my EXE, to avoid a distinct installation step.

What search algorithm?
 
Thanks.
0
Baby steps with PDFtoText for OCR

What steps are the first for me to take as I create a proof of concept that will be:

- a C# Winforms program
- uses the PDFtoText library for OCR

Are there any demo programs I can review? Should I just dive in?

Thanks
0
how to programmatically isolate PDF from image scan versus an original PDF?

I have a folder filled with PDF's, most of which are scanned copies. But I need a way to pul out the original versions.

I do not want to deal with OCR software and need originals.

Is there a tool which can do this parsing to find originals?

Thanks
0
Hi

I have a pdf files got from scanner and I'd like to bulk rename all the files based on OCR data. Can someone provide me at software which can bulk rename base on OCR entries?

Regards,

CK
0

OCR

566

Solutions

1K

Contributors

Optical character recognition (OCR) is the mechanical or electronic conversion of images of typed, handwritten or printed text into machine-encoded text. It is widely used as a form of data entry from printed paper data records, including passport documents, invoices, bank statements, computerized receipts, business cards, mail, printouts of static-data, or any suitable documentation. It is a common method of digitizing printed texts so that it can be electronically edited, searched, stored more compactly, displayed on-line, and used in machine processes such as machine translation, text-to-speech, key data and text mining. OCR is a field of research in pattern recognition, artificial intelligence and computer vision.

Top Experts In
OCR
<
Monthly
>