Link to home
Start Free TrialLog in
Avatar of tel2
tel2Flag for New Zealand

asked on

tesseract OCR problems scanning images

Hi tesseract OCR experts,

I’ve just installed tesseract on my Raspberry Pi running Linux (Raspbain) and I’m trying to extract text from PNG screen shots taken on my phone.  (I have hundreds of these screen shots, all in the same size & format, taken over the last year using the LeafSpy Lite app, for the Nissan LEAF EV, and I'll be extracting text from all of them.)

The problem I have is, some of the text is not being extracted.

When I run this command:
$ tesseract sample1.png sample1
It produces sample1.txt (attached), which includes plenty of useful figures, but it excludes:
-      “11.84V” near the bottom left (nice to have this voltage figure, but not vital), and
-      “32.0%” at the bottom (I really need this SOC figure).

I tried feeding tesseract a negative (created with IrfanView on Windows) of the image, in case it was a black/white issue, but that gave the same output.
I tried cropping the 11.84V and 32.0% figures out to TIF files (see sample1_voltage.tif & sample1_soc.tif attached, also created with IrfanView on Windows) then running them through tesseract, and that:
-      failed for the 11.84V (see empty sample1_voltage.txt attached), but
-      worked for the 32.0% (see sample1_soc.txt attached).

I know bash and Perl scripting.  I don’t know Python, but Python is installed so it could be used if necessary, if someone else writes the code, but it's not my preference.
ImageMagicK is also installed, in case I need to use it for cropping or whatever.

I haven’t found anything useful in the tesseract documentation yet, but if I can get it to look at specific rectangles something like this setRectangle command, then maybe that would be simpler, but I don’t see how to use that from the command line (that link seems to be for the R language).

Any suggestions on how to get the 11.84V and 32.0% figures extracted from files like sample1.png in a fully automated way?

I guess I could crop the 32.0% with ImageMagicK, or do a batch crop via IrfanView on Windows.  (I’d prefer to do it all from Linux so it’s all in one place.)  Then I could feed that plus the original file through tesseract and combine the contents of the .txt file outputs.  But cropping doesn’t seem to work for the 11.84V so I’m not sure how to get that.
Any better ideas?

Before anyone puts in a lot of effort with this, please pass your plan by me first, so you don't waste time going down a path that I'm not keen on using.


Here’s what happened when I ran the commands:

$ tesseract sample1.png sample1
Tesseract Open Source OCR Engine v3.04.01 with Leptonica
Detected 23 diacritics

$ tesseract sample1_soc.tif sample1_soc
Tesseract Open Source OCR Engine v3.04.01 with Leptonica
Page 1

$ tesseract sample1_voltage.tif sample1_voltage
Tesseract Open Source OCR Engine v3.04.01 with Leptonica
Page 1
Empty page!!
Empty page!!



Here’s my version info:
$ tesseract -v
tesseract 3.04.01
 leptonica-1.74.1
  libgif 5.1.4 : libjpeg 6b (libjpeg-turbo 1.5.1) : libpng 1.6.28 : libtiff 4.0.8 : zlib 1.2.8 : libwebp 0.5.2 : libopenjp2 2.1.2

$ uname -a
Linux raspberrypi 4.19.66-v7+ #1253 SMP Thu Aug 15 11:49:46 BST 2019 armv7l GNU/Linux

Thanks.
tel2
sample1.png
sample1.txt
sample1_voltage.tif
sample1_voltage.txt
sample1_soc.tif
sample1_soc.txt
Avatar of noci
noci

Did you try tesseract v4 (or isn't it available in your distro)...
tesseract is language sensitive in that it also uses a spellchecker to validate input.
(this makes tesseract therefore language sensitive. tesseract 4 has more languages and works on a different method).
You can still use the V3 engind in tessract v4 by adding -oem 0 to the command line.

I guess the trouble is with 11.84V   doesnt look like a word, or sentence.
The original fail probably is'caused by the  dot through the S from Soc.

Maybe try to crop oc=11.84V..

You are right the 2nd link you mention is for using Tesseract from R.
Just ran all your images through tesseract v4...

imac> tesseract -v
tesseract 4.0.0
 leptonica-1.77.0
  libjpeg 9c : libpng 1.6.37 : libtiff 4.0.10 : zlib 1.2.11 : libwebp 1.0.3 : libopenjp2 2.3.1
 Found AVX2
 Found AVX
 Found SSE

Open in new window


Forced test using old + new OCR engines...

tesseract --oem 0 $in $out

tesseract --oem 1 $in $out

Open in new window


Got the exact same results you're seeing with tesseract v3, with both engines.

Likely next step will be to visit https://github.com/tesseract-ocr/tesseract to open a dev ticket to fix this problem.

Developers will either fix the code or provide some work around.
Hmm If all your screen shots have exactly the same layout and the text is located at the same location (within the same range)  every screen shot then you could really just create a cropped rectangle of that region and run tesseract on it.

Knowing the location where text will appear is a very big advantage and allows you to direct the OCR tool at the reight section.

With python you could use the pillow library to do png cropping and saving  with the text of the voltage, create another crop for the percentage text and even inverse the image in case tesseract doesn't deasl with it nicely.

It would be great if you could post .png files for the cropped areas as most web browser don't support .tif natively, meaning, that some  persons reading your post has to save the file and open an image viewer by hand.
gelonida's suggestion is good. You could use ImageMagick to extract the region of the text.

Looking closely at your image files, it appears the problem may be the reversed colors.

The image begins with a white background with dark lettering, then switches to a black background with white lettering.

Might be tesseract has no intelligence to switch between the 2x color schemes.

This is why I suggested you open a ticket with tesseract development.

They'll be able to tell you your best course of action.

Another approach might be to modify the actual code producing the image. In other words, if you can remove all black background colors from all images, my guess is tesseract would likely work in this case.
ASKER CERTIFIED SOLUTION
Avatar of noci
noci

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of tel2

ASKER

Thank you all for your help!

Hi noci,
I haven't tried tesseract v4.  I simply did a "apt install tesseract" and it installed v3.04.01.  Based on David's first post, it looks as if v4 isn't going to give me better results in this case, though.
Thanks for the test you ran and the switches.  I think the "-l eng" is default so I don't need that, and I find that I don't even need the "-psm 8" now that I have the cropping correct, but it helps with some scenarios, thank you.

Hi David,
Thanks for trying my data in v4.  Too bad it didn't help.
And thanks for the link for opening a ticket, which I should do sometime.
Yes, I know I can use ImageMagicK to crop areas, which is what I proposed in my original post.  I've tried it now and it works for those 2 numbers that weren't extracted from the original file.
Although, as originally mentioned, tesseract gave me the same results when I fed it a negative of the image, I think you may be right about it not handling the foreground/background swap near the bottom of the image.  After swapping the foreground/background colours for that lower strip (see attached), I ran it through tesseract and got output (see attached) which now includes:
a) The "11.84V".  Strange, because I didn't touch that region, but maybe the change of background was too close to the bottom of that text to allow it to be read in the original.
b) The "SOC= 32.0%" which is now coming out as "($0 C = 3 2 . 0 0/0".  Not so good, but if that's consistently like that, I should be able to get what I need from it with a simple regex.  Interesting how the "%" has become "0/0".  Using ImageMagicK to crop regions would be slower but probably more accurate.
See my question below re ImageMagicK.

Hi gelonida,
Thanks for your suggestions.

Hi everyone,
If I decide to reverse the colours on the lower strip of the image, what's the best way to do that in ImageMagicK?  I know I can reverse the whole image like this:
    convert sample1.png -negate sample1neg.png
but how do I specify a region to reverse, without cropping it off first?
sample1wbg.png
sample1wbg.txt
Voltage CAN be scanned... it needs other options.
see: https://www.experts-exchange.com/questions/29159814/tesseract-OCR-problems-scanning-images.html?anchorAnswerId=42952833#a42952833

tesseract -l eng -psm 8 sample1_voltage.tif sample1_voltage.txt

Just be sure to leave no 'specks'...   they might become aprostrofes or comma's etc.
Avatar of tel2

ASKER

I know voltage can be scanned, noci.  I understood all your posts.  And as I indicated in my last post, I didn't even need to use those other switches after I cropped it again (using ImageMagicK this time).
So, scanning cropped regions is a workable solution, but before I decide which way to go, I also want to explore David's idea of reversing the foreground/background colours of the lower part of the image, so all the text I want is black on white (this includes some text up the top of the image which is already scanning well), then doing a single scan of the whole (modified) image.
The question at the end of my last post is just "If I decide to" go with David's suggestion of making all the text black and background white.
Are you with me?
From your answer I got the impression you missed my last post ...
limiting to a specific language may help for recognizing numbers vs. query broken line (due to a .)

tessract -psm 11
or
tesseract -psm 12

will scan the original image, you will get a value of "11 84V"  and QOC 32    
There will be quite some noise in the file.

Also specifying -l eng makes the process slightly faster.
Avatar of tel2

ASKER

Thanks again noci, I might try that later.
Did you see my response to your posts, at the top of this post?
I thought that "-l eng" was the default and that means that tesseract will be limited to recognising English if no "-l" switch is supplied.  Am I mistaken?
Yes, according to the man page it is the default language it is.
When i ran it with  -l eng it seemed faster then when i ran it without it after this.. just an observation i had a few times.
Just tried using time what the actual run times were... they are comparable when running each 20 times in a row.
time tessearact sample1.png sample_test  with or without -l eng used as a command
So it just was a coincident with n = 4 ( a bit spaced in time) in stead of 20 in a row.

Running without parameters also provides the data one just needs a trigger and some conversion.
(a bigger sample is needed to see if the data is usable.)

btw. adding a user word file with 11.84v inside left the . in 11.84v..
So you can help tesseract by specifying  a  --user-words file with valid words & values.  [ the valid values of number presented ].

IMHO you're best off using a cropped images. Which should be easy to do when text is in the same place.
Avatar of tel2

ASKER

Thanks again noci,

Yes, I've been thinking the cropping way is probably going to be more accurate, so I'm probably going to go that way, but that doesn't stop me from looking at both options, and seeing what the output looks like from a good amount of test data.

What do you mean by this:
 "Running without parameters also provides the data one just needs a trigger and some conversion."
a) In particular, what parameters?  Do you mean switches/options like "-l eng",, "-psm ...", etc?
b) What kind of trigger are you referring to?  Example?
c) Re conversion, are you talking about scanning the entire image without cropping, then converting/massaging the output, or what?
My guess is if you crop the area out of the entire image, this might work with v4, without reversing colors.

After cropping, if you do require reversing colors, this is fairly straight forward in ImageMagick.

An even better solution, if possible, is to just reverse the actual color of the code producing the initial image.
Yes I meant without extra options... (although --psm 11 does give more data...)
Trigger   to pickup 11.84V ... the trigger is /max = ... 17mV)...
(Anchor might be a better phrase.
....
/max = 3.628 3.638 3.645 (17 mV)

11.84V

min/avg

emp C= 13.7 13.3 13.2 (0.5°

Open in new window


I have seen 11 84V also....
So that is what i meant with conversion...

By preselecting the crops from the image there is far less confusion.
Avatar of tel2

ASKER

Hi David,

> "My guess is if you crop the area out of the entire image, this might work with v4, without reversing colors."
Cropped areas are already working in v3 without reversing colours, if you see my comments to noci.
My question about reversing colours was just for that lower region and just for if I want to scan the entire image in one go, as opposed to cropped parts.  As you suggested, it's the transition in text/background which seems to cause problems, not the fact that we're trying to scan white text on a black background.

> "An even better solution, if possible, is to just reverse the actual color of the code producing the initial image."
Are you suggesting I modify the LeafSpy Lite Android app that I mentioned in my original post?  I don't think I could do that.
Avatar of tel2

ASKER

Thanks for your clarification, noci.  Understood now.

If I were to try to find the "11.84V" text, I'd probably use a regex which finds:
newline + 1 or 2 digits + space or period + 2 digits + "V".
I know how to code that regex, so no problem there.
But I'm probably going to end up going with the cropping method, as you suggested.
The problem was i have seen (raw) scans with 11 84V
so the regex might need to be alike  [0-9]{2}[. ][0-9]{2}[Vv]
Which might be a tad to generic.  
there is never 9.99V?   in that case:  [0-9]{1,2}[. ][0-9]{2}[Vv]
Which would probably too generic.

OCR (even with the best tools) stays a rather messy business. of educated guesses.
The less options the better the results.
If you can, (If the layout is fixed), then cropping and selecting the areas of interest is always the safest  solution with the lowest probability of errors / badly recognized text.
Avatar of tel2

ASKER

Thanks for all your answers!
I'll see how I go with cropping each section.
You're welcome!

Good luck!

When you have a working solution, add another comment to this question... to help someone else in the future.
Avatar of tel2

ASKER

Yes David, I was thinking about posting my code here when I've "finished" it, which I might end up doing.  And then others can edit it and post updates.  We could call this: EE-Forge (the EE version of SourceForge).

Give me a few months.  8)

tel2