Electronic Documents storage and database in SQL server

We have a big project coming on which is about storing physical documents, papers, could be blueprints, typed documents with hand add notations, etc, in electronic format(they will be scanned) and made available trough a database and then application on top of it, web services, etc. The format will be originally in TIFF but from there we will have to come up with a more suitable format like PDF, PNG or something else. We are a Microsoft shop and we use SQL Server 2008 and 2012 and in the end the files will reside on a network location but they will be available through database tables that will point to each document. We are talking about tens of millions of documents.

The question is, actually kind of more, what would be the best way of doing this? What format to use and if the new file stream feature introduced in SQL2008 is recommendable? What are the benefits of file streaming, if any?

I thought that some of the experts may have already experience with similar situations.

Thank you in advance.

LVL 27
Who is Participating?

[Webinar] Streamline your web hosting managementRegister Today

Ryan McCauleyConnect With a Mentor Data and Analytics ManagerCommented:
It's fine on a small scale - even something like a website where images are stored in the database is okay, though I'd probably shy away from even that. Agreed, though, that it's a terrible choice as a large-scale document management system. I'm not clear on the value added by Filestream, actually - even after implementing it for something because it seemed like a good idea at the time that would address the same types of concerns the OP has here, we ran into issues that seemed exactly as if I'd just stored them in tables as an inline VARBINARY(MAX). Maybe there's some special use case that I'm not aware of.

Agreed on the PDF - it might be helpful for consistency's sake, but you won't see any file size savings. Where PDF shines is where it's a container for actual content, like drawings or text - if you're just using it to hold images, you won't see much (if any) savings.
Aaron TomoskyTechnology ConsultantCommented:
I've always used SQL to store the path of the file In a varchar. With tens of millions of records there is no reason to make your db also serve up files.
Jim P.Connect With a Mentor Commented:
Ok, you have the most important part specced out already: NEVER store binary data in a database.

My last company, a bank, used Hyland's SW called OnBase to do it.

The data files were in the 600GB range and the DB in the 80GB range a few years ago. Access times weren't bad. The VAR we worked with was good about getting us the metatag forms quickly along with the OCR for our custom forms.

I'm just commenting from experience of the one company that I worked with.

And just to reinforce the most important thing: You should NOT store binary data in a database. Our CIO built an Access database to just store the IT images and scans in the DB. It took 2 months of steady work to unload all the crap and rebuild his system after he retired.
Free Tool: Port Scanner

Check which ports are open to the outside world. Helps make sure that your firewall rules are working as intended.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Rich WeisslerConnect With a Mentor Professional Troublemaker^h^h^h^h^hshooterCommented:
I have to concur with the other experts who have commented so far.  Even Sharepoint, which stores the binary files in the database by default runs into some technical roadblocks fairly quickly, and there are third party vendors who provide external BLOB storage.  Folks are strongly discouraged from Filestreams, and I've not heard anyone indicate that Filetables are better.

The benefits of Filestreams or internal BLOB storage (simplification and ease of keeping components in sync) are far outweighed by problems in scalability.
Ryan McCauleyData and Analytics ManagerCommented:
I've not seen the benefits of using File Stream when we attempted to roll it out - yes, the files are stored outside the database, but they're still part of the backups and still appear inline in the table when you query it, so there didn't really seem to be any advantage.

That said, I'd put them on a file share on your network and then store the network path. I've seen folder browsing get problematic when there are more than about 10,000 files in a single folder, so you may want to consider some kind of sub-folder situation. If you're never browsing it manually, then don't worry about it, but even something like "26 sub folders, one for each letter, and then use the first character of the filename to put it in the relevant folder" might help. Though if you're talking about 10's of millions of documents anyways, then I don't know that this extra level matters much at all.

I'd shy away from saying NEVER store binary files in the database - I've seen it done without issue a number of times - but the other experts are right that it can lead to scalability concerns pretty quickly, and when other options are available, it should generally be avoided.
Jim P.Commented:
I'd shy away from saying NEVER store binary files in the database - I've seen it done without issue a number of times - but the other experts are right that it can lead to scalability concerns pretty quickly, and when other options are available, it should generally be avoided.

Ok, so I may have overstated it a bit. ;-) I can see it for something like company logos and letterheads. The pictures of just the CxOs for the annual report. The standard instruction PDF for forms, maybe. Doing it for declared document imaging system -- No. Really bad idea.
ZberteocAuthor Commented:
Thank all of you for your responses. I agree that documents should be kept outside the database. I was mainly curios if the new FILESTREAM feature in SQL 2008, which keeps the files in the system files but links them to the table and row,  is worth using. I can't see the immediate benefits in order to switch from folder and file name storing in a varchar column in a table.  

Also a word about the format would be appreciated. Wen it comes to blueprints, such as for buildings, photos plus regular written and/or typed documents, what format would be preferable? We thought of making all PDF after scanned.
Aaron TomoskyConnect With a Mentor Technology ConsultantCommented:
pdf is better than tiff only if it's turned into vector (ocr). If it's just a tiff embedded in a pdf that still helps it display correctly but the filesize stays large.
ZberteocAuthor Commented:
Thank you all for your valuable input. I realize this is a general question with no really wrong answer so I will split the points.
All Courses

From novice to tech pro — start learning today.