scanning and reading PDF documents

Hi,
I am looking at scanning large numbers of handwritten multi pages documents into a SQL database, about 200 at a time.
I am planning to scan the documents into a folder and run a job to record the file  name & location into a table in a database. This part is ok.
I am not sure yet if I will just save the file name and location or use file storage. If you have any comment on what you think is the best solution, please share,

However the question is. At the top right corner of the document is a number that I want to capture and save with the file record, as the file will be associated with other existing records in the database. The number is handwritten as well, it has 5 digits and each digit is in a separate box.

I am using SQL Server and ASP.NET (if that can help).

Please let me know if you know of any solutions that can help me.

Thank you in advance

Anne

I
AnneSKSAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Michael FowlerSolutions ConsultantCommented:
You will need to OCR the document to extract this information but from what you are describing this will be very problematic. Have a look at these free options to see how they work for your needs
http://www.makeuseof.com/tag/3-free-ocr-tools-convert-files-editable-documents/
0
udaya kumar laligondlaTechnical LeadCommented:
I am planning to scan the documents into a folder and run a job to record the file  name & location into a table in a database. This part is ok.
I am not sure yet if I will just save the file name and location or use file storage. If you have any comment on what you think is the best solution, please share
If you are sure that your files will not be altered then it's better to store in the file system. You can find some useful info at
http://stackoverflow.com/questions/8952/storing-a-file-in-a-database-as-opposed-to-the-file-system
http://stackoverflow.com/questions/13420305/storing-files-in-sql-server

However the question is. At the top right corner of the document is a number that I want to capture and save with the file record, as the file will be associated with other existing records in the database. The number is handwritten as well, it has 5 digits and each digit is in a separate box.
As the document is hand written, you will have very limited success in identifying the code. You have to manually do it. No OCR will be able to guaranty 100% accuracy.
0
AnneSKSAuthor Commented:
Hi,
Thanks for your answers.
Just to be more precise, I am just after reading 5 digits each one in a different box. as sampled in the file below.
I suppose we can even give an example on how the numbers should be written.
Do you still think that the result will be pretty poor?

What would be an alternative option?

Also I suppose we will have to scan each document twice, once as pdf and linked in the database, and once as an OCR document so we can retrieve the number.
PDF-Upload.PNG
0
Ultimate Tool Kit for Technology Solution Provider

Broken down into practical pointers and step-by-step instructions, the IT Service Excellence Tool Kit delivers expert advice for technology solution providers. Get your free copy now.

Michael FowlerSolutions ConsultantCommented:
You can OCR PDF documents, that is not the issue. For this scenario to work you would need to OCR the entire document and extract the numbers from the boxes in order.

Some issues that could arise are
1/ OCR fails to recognise the numbers due to the formatting with the boxes.
2/ OCR does not recognise all the numbers
3/ OCR misreads the numbers due to the handwriting
4/ Extracting the appropriate text from the document e.g. what happens if the doc is up-side down

These problems become really interesting if you get it working but starting getting inconsistent results.

My instinct says that the problems will be insurmountable but to be sure before giving in and spending days doing it manually I would use one of the OCR options from my post above and see what happens when you scan a couple of docs
0
udaya kumar laligondlaTechnical LeadCommented:
Based on your sample I am sure you will not be able to use any of the existing OCR tools to get 100% accuracy.

1. You can scan the full document for storage
2. Scan for capturing only the number for OCR, scanning of full doc for OCR is not worth and will create unwanted text
3. While inserting in to DB, use manual validation to verify if OCR is correct.

You can try uploading the images to the top ten search result sites of "online OCR converter" and test if the existing OCR methods can help or not.
0
frankhelkCommented:
Hmmm - I havn't used such things by myself, but there's OCR software which is capable to read even very lousy handwriting ... postal sevrvice companies use such solutions in letter sorting machines. And the addresses on many letters are very bad written. And only a very small amount gets handed over to humans to decipher.

But there's another approach ... that number problem looks like it's crying for something similar to the reCaptcha approach. In short, the use the captcha's on web sites for text digitizing .... the show two images as captcha. One known (for the web site entry verification as non-robot) and the second is the clue: It's something a human should decode, i.e. a text snippet (usually one word) from some old book etc.

With a little effort you could set up something similar and let the users of your web site decode the numbers with the most complex neuronal image decoding device on the planet ;-)
0
AnneSKSAuthor Commented:
Looks like I am heading in the wrong direction.

However I could use Frank solution, the recaptcha approach. I could ask the user to write the numbers once and scan them, then match the image previously scanned with the numbers on the document.  

Or I could give the users a stickers with a QR code, that they can stick on their document prior to hand it over.

Now the deal is to scan the document, it should be easy to read the QR code. Is there a software to use that can read the QR Code and save the code results as the name of the document?

This way I could match the document with the user record.

Thank you all for you comments.
0
Michael FowlerSolutions ConsultantCommented:
You may want to have a look at ABBYY. It is not free but it could do what you need. Give the trial version a go.
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
frankhelkCommented:
Hmmm - I mentioned the reCaptcha approach in the aspect of already existing documents.

If you scan documents that are not written already, I would really support the bar code approach.

You could hand out barcode stickers (with a number printed underneath the barcode for human needs) or produce pre-barcoded forms with a printer.

There are numerous ready-to-purchase solutions for extracting barcode data out of images ... the search term "scan barcode from images" will squeeze many out of you favorite search engine.

There are even some roll-your-own projects, I've seen i.e. a glipse of one on Codeproject.
0
AnneSKSAuthor Commented:
Looks like the QR Code or bar code is the way to go.
Michael, I have emailed ABBY to find out if they can extract the QR Code and use it in the file name.
I'll look at other scan software options.
If you know any, please let me know.
Thank you so much for your comments.
Anne
0
frankhelkCommented:
Just a personal note ... you're not restricted to QR codes. To be honest, a QR code seems to be some kind of overkill to code a five digit number. I know of a similar application that uses Code39 sticker for documents to be scanned. It contains a single letter and 7 numeric digits within a space of approx. 3 x 1 cm (text representation as 12pt. string underneath included).

On the other side I admit that a QR code looks sooo much more cooler that the Code39 ;-)
1
AnneSKSAuthor Commented:
Thank you for your help, and helping me finding the right option to solve this problem.
I am now in contact with ABBYY to see how we can implement this solution.
I don't know at this stage if we will go with QR code or Code19. Either way they are both good solutions.
Have a great day
Anne
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Document Management

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.