Best approach for document management system

Posted on 2011-10-20
Last Modified: 2012-05-12
We are developing a web based application ( 3.5 with sql server 2005) where we need to handle and search inside the documents uploaded by users.
We have the following flow which is working fine
User uploads the document ( doc / xls / pdf etc)
Its stored in a particular folder on the server
Our code reads it and extract all text inside it and store in a database field
Later whenever a user search for any string we look for that string in the database field and sisplay the search results. ( ie : there are some documents which are ~5 MB in size and have 450+ pages of text in them and all of this gets extracted and stored in the db field)
Our main question is: is this a right approach? If not wha's the best approach in this scenario?
If there are too many documents with too much text inside them will it adversely affect our database performance?
Since it's a web based app running on shared hosting we may not be able to use any standard library like lucene.
Question by:ExpertHelp79
    LVL 10

    Accepted Solution

    The size of the documents will eventually affect the database performance.

    The right approach would depend on the performance you expect from the application. Extracting the text from the files and storing it in a db would produce the best performance when searching with the application.

    If search times are not important you could just search through the documents when a search is made (I imagine it would be very slow though)

    My first suggestion would be to modify your code which reads the documents and extracts all text to be a little bit more 'clever' and remove any duplicated words so you only store all the unique words in the document. That should cut the database size down considerably.

    If you currently search for the entire sting you would need to modify the search to split the string by spaces and pass each word into the where clause. you can then let the user decide if they need acurate results 'where (includes word 1) AND (includes word 2) etc' or a wider search 'where (includes word 1) OR (includes word 2) etc. From there you have the possibility to break it down further and provide relevant results depending on how many matching unique words are found from the search string.

    Hope that helps
    LVL 4

    Expert Comment

    It would be advantageous to use the Index facility of SQL Server and query the index files to speed up the results. Based on the selection of the user you can then pull out the data from the tables.

    Featured Post

    What Security Threats Are You Missing?

    Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

    Join & Write a Comment

    This article is for Object-Oriented Programming (OOP) beginners. An Interface contains declarations of events, indexers, methods and/or properties. Any class which implements the Interface should provide the concrete implementation for each Inter…
    International Data Corporation (IDC) prognosticates that before the current the year gets over disbursing on IT framework products to be sent in cloud environs will be $37.1B.
    Get a first impression of how PRTG looks and learn how it works.   This video is a short introduction to PRTG, as an initial overview or as a quick start for new PRTG users.
    In this tutorial you'll learn about bandwidth monitoring with flows and packet sniffing with our network monitoring solution PRTG Network Monitor ( If you're interested in additional methods for monitoring bandwidt…

    734 members asked questions and received personalized solutions in the past 7 days.

    Join the community of 500,000 technology professionals and ask your questions.

    Join & Ask a Question

    Need Help in Real-Time?

    Connect with top rated Experts

    19 Experts available now in Live!

    Get 1:1 Help Now