asked on

Reading code/number on page, then saving file using the code/number read

Hi,
I would like to be able to scan about 200 pages of documents, and save them in different files. Each page will have some text recognition, either a QRCode, or simply a number, and while scanning, the code/number should be read on each page, and this code/number should be used as the file name.

For instance first document is 3 pages, 2nd is 6 pages, 3rd is 2 pages.

The 11 pages are all put together in the scanner. Each document has a specific code or number.
When the first 3 pages are scanned, the number/code is read on each page, and then (I suppose) when the fourth page is read with a different number/code, it saves the first 3 pages using this number/code as the filename of the pdf document. And then it moves to the 2nd document.

Alternatively, only the first page of each document has a code/number, and if their is no code/number on the next pages, it means that they are part of the same document.

This is a new project and I have no knowledge on how to set it up.

Any information on what software to use, how to set it up, where to go to find out more information. How complicated would it be to set it up would be greatly appreciated.

Thank you for your help

aikimark

What is the file format of the documents?

sarabande

while scanning, the code/number should be read on each page

i assume your scanner currently creates one pdf document containing all the pages.

if i am right, your task is not so easy to perform:

you may think on reading each page separately programmatically after scan. this could be made by using the scan software available with your scanner or operation system. however, the data you get from scanner is either pdf, or image format like jpeg or bmp. pdf can't be anyalyzed programmatically simply without acrobat or some 3rd party library. image raw data is worse, since the number/code you were looking for is hidden in the image and again you need some software tools to retrieve the data. since you don't have knowledge on how to do that, i would not recommend to go this way.

a probably simpler approach is to let the scanner put all the pages into one pdf file. then, start your program which would
- create one file for each page of the pdf (file)
- for each page pdf file
- try to retrieve code information
- merge all pdf files with same code into a further pdf file
- delete all temporary files

there are free tools which could be used to divide pdf or image files into pages and merge multiple input files into one pdf. perhaps you also could find tools which could extract numbers or qr code from an image or pdf if you know at which part the information is located. but surely it is not a simple job.

Sara

AnneSKS

ASKER

Sara,
Your second option looks like a solution that could work, and that makes sense.
I'll just have to make sure that the scanner can save in pdf, this is pretty standard.

Then I have to find a tool that can:
- Divide a pdf into multiple page
- read either a code or a number
- use this code to save as file name
- then merges files

Do you (or anyone in the community) know any tools that can do that?

Joe Winograd

Hi Anne,
Please take a look at this EE article:
How To Split-Rename-Move a Batch of PDF Files Based on Contents of the Files

I pulled the code last year to make major changes to the program. I had planned to put the code back fairly quickly, but for a variety of reasons, did not, and I'm still not ready to do so. But in the meantime, I have modified it for numerous folks to do the split/rename based on different criteria. For example, one person didn't want any part of the existing file name to be used, but instead wanted an entire string that was found inside the file to be used as the file name. Another person wanted everything after a certain identifier in the file ("Customer Name:", which could be anywhere on the page) to be used as the file name.

My program does not currently do OCR, so you would have to use scanning software, such as Acrobat, ABBYY FineReader, OmniPage, PaperPort, PDF-XChange Editor, Power PDF, or many others, that has built-in OCR.

If you're looking for a commercial product that does both OCR and the file renaming, check out File by OCR. I've never used it, because I always use a combination of PaperPort 14.5 or Power PDF and my own program to acheive the same result, but I've heard good things about it from other folks. However, it is not cheap at $1,095. Regards, Joe

AnneSKS

ASKER

Thank you Joe,
I will have a look at it and get back to you. It seems very promising.
Anne

Joe Winograd

Sounds good, Anne.

wyliecoyoteuk

Depending on the scanner, TIFF is usually the best format for this sort of thing, as most OCR programs can analyse it.
However most professional OCR will also work with Jpegs or even PDF images.
Professional software will divide and file documents using "positional OCR" i.e., you tell the software to look at a particular area on the document, read the contents and dispose of the file accordingly.
INVU, Laserfiche and Autostore are 3 products that can do this.
What type of scanner are you using?

AnneSKS

ASKER

I don't know yet, it could be any scanner, as this application will be installed in different location. But I think I will be able to push for a specific type of scanner, if the one in place does not work.

Would you recommend to save the first file as a TIFF document, then use OCR to get the information, before dividing and saving in independent files as PDF?

Than you for your answer

Joe Winograd

> Would you recommend to save the first file as a TIFF document, then use OCR to get the information, before dividing and saving in independent files as PDF?

Nothing wrong with TIFF, but my opinion on this is, No. There was a time years ago when I would have recommended TIFF as the format of choice for document scanning/imaging, but these days I think PDF is the better choice. Virtually all high quality scanning software can scan directly to PDF, and most can create a PDF searchable image file, which contains both the image and the text from running built-in OCR (all the commercial products mentioned in my previous post http:#a41402007 can do this). Even if you want to defer OCR until after scanning (so that scanning is faster), you can still scan to an image-only PDF (rather than TIFF) and perform batch mode OCR after scanning, such as the method described in this video:

Convert Scanned Image-Only PDF Files to PDF Searchable Image Files via OCR with Power PDF Advanced

and in this article:

Batch Conversion of PDF, TIFF, and Other Image Formats via Command Line Interface to PDF, PDF Searchable, and TIFF

Regards, Joe

wyliecoyoteuk

Most multi function printers can scan directly to tiff or pdf, as single pages or multipage documents.
We use networked mfps to scan to folders on a server, and the software takes the file from the folder and processes it.
Usually they scan to tiff, process it and save a copy as a searchable pdf.
The reason for using tiff is that it uses lossless compression, is faster to process than a pdf image, and the file size is usually smaller.
Once processed, you keep both the original image, and the OCR result, just in case there is a mistake in the OCR process.
Most professional software will handle either these days.
Many mfds can also have embedded applications installed to enable automated routing and operator control.

wyliecoyoteuk

If you want to scan documents of different length, acommon method is to place a single preprinted sheet with a special barcode in between each document, so that the software can separate the page sets.
Or you can use onpage barcodes for the same thing.

AnneSKS

ASKER

Hi Wyliecoyoteuk,
That would be the ideal solution, as every document will have a different number of pages.
Now how do I have a code/number on the first page of the document, and then save the following pages without code/number with the first page.
What kind of software do you recommend, and what methodology.
Thank you for your answer

AnneSKS

ASKER

Joe,
I had a look at your previous post, that could work for me, however I don't want to ask any question to the user. For all scanning session, the file name will be of the same format, and the code/number will always be in the same location.
All they will have to do is to select the folder that would have been created by my program when they initiate the process.

Thank you for your help

wyliecoyoteuk

Kyocera Taskalfa MFDs integrate with Autostore, but most manufacturers have a solution of some sort. MFDs can offer scanning speeds up to 50 Pages a minute.
INVU is excellent software, but a lot more expensive. there are a lot of applications available.
The best solution depends on the volume and accuracy required.
I would advise talking to a company that can offer you guidance towards the correct solution.
The preprinted separation sheet solution is simplest, as you just photocopy a load of them, and interleave the documents, that way the software reads the separation sheet, and knows that the document has changed.

Joe Winograd

Anne,
Before we go any further, let's backtrack a bit to understand your requirements better. Some questions:

(1) You say that you have "about 200 pages of documents". Is this a one-time occurrence or will you be processing 200 pages of documents on a periodic basis? If the latter, what is your estimate of the daily (or weekly or monthly) volume?

(2) You talk about the docs having "either a QRCode, or simply a number". Do you have control over which one it will be?

(3) You say that "Alternatively, only the first page of each document has a code/number, and if their is no code/number on the next pages, it means that they are part of the same document." Do you have control over the approach that will be taken?

(4) You say that the app will be installed in different locations and that at each location "it could be any scanner", although you also say that you "will be able to push for a specific type of scanner, if the one in place does not work." If the latter, what is the budget for a scanner at each location? Also, do all locations have the same volume of paper to scan or does it vary?

(5) How many users will be doing the scanning/splitting? Will all locations have the same number of users or will it vary?

(6) Beyond just the scanner budget, what is the budget for the entire project? It doesn't have to be exact, but it's important to know approximately how much money your client/company is willing to spend on hardware, software, and services to implement the solution.

Regards, Joe

AnneSKS

ASKER

Hi Joe,
Thank you for all your questions.
I am in the process of developing a web platform that I am planning to commercialise in the next few months. So I have a pretty good deal of control over the process. My major concern is, I want it to be easy for the users, and very intuitive.

Now to answer your question:
1) That will happen on a regular basis. It could be 200, but it could be 50 or it could be 1000 pages, each document will have a different number of pages. Every time will be different. I think a regular user could be scanning once a week, but it could be more or less. It depends of course on how much they use my system.

2) Yes I will decide if I am using a QR code, a number or some text

3) Complete control of this option as well. As a matter of fact it would be my favourite option, and would solve other problems down the track.

4) Each location could potentially have different quantities to scan. As far as I understand scanners are not that expensive. Unless I have to use a specific type of scanner. If you have any information on the subject that would be very valuable.

5) Number of users, and scanning pattern can be different at each location.

6) I am still developing this platform, so I don't know the users budget yet. All I know is that I am planning to sell my product as a quarterly subscription, this section is just one crucial but small module and I don't want to have too much to ask the client upfront.

Thank you so much for spending your time helping me how to figure out this problem

Anne

ASKER CERTIFIED SOLUTION

Joe Winograd

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

wyliecoyoteuk

There are 2 main ways to scan.
Pull scanning - controlled by a PC. Requires a driver, TWAIN,ISIS, etc. And needs software that is compatible.
Push scanning - controlled by the scanner. This way destinations are programmed into the scanner. They can be SMB shared folders, FTP folders or email addresses. Needs software that can poll folders or receive email.
There is also a new type of bidirectional scanning called WDS (Web device service) built-in to Windows, needs a WDS compatible device, and allows the device to display all available PCs with a WDS connection. Scanning can be controlled by the device or the PCS. Drivers are installed automatically from the device and can be set to store scans and launch an OCR programme etc.

Joe Winograd

> scanning called WDS (Web device service)

I presume you mean Web Services on Devices (WSD):
https://technet.microsoft.com/en-us/library/dd871131.aspx
https://msdn.microsoft.com/en-us/library/windows/desktop/aa826001%28v=vs.85%29.aspx

CyberLex

You probably want to look at Nuance Autostore.

http://www.nuance.com/for-business/imaging-solutions/autostore/index.htm

wyliecoyoteuk

Yes WSD, although I have seen it called WDS and WPS.
A WSD capable device will auto-configure scanning and printing on Windows Vista and newer, and supports profiles for managing scans in different ways.
WSD ports are IPv6 based, and use Bonjour for configuration.
As I mentioned above, Autostore integrates well with Kyocera MFDs via a HyPas application.
They also have their own package, Scannervision
https://www.kyoceradocumentsolutions.co.uk/index/document_solutions/capturedistribution/scannervision.html