• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 260
  • Last Modified:

Best approach for document management system

We are developing a web based application (asp.net 3.5 with sql server 2005) where we need to handle and search inside the documents uploaded by users.
We have the following flow which is working fine
User uploads the document ( doc / xls / pdf etc)
Its stored in a particular folder on the server
Our code reads it and extract all text inside it and store in a database field
Later whenever a user search for any string we look for that string in the database field and sisplay the search results. ( ie : there are some documents which are ~5 MB in size and have 450+ pages of text in them and all of this gets extracted and stored in the db field)
Our main question is: is this a right approach? If not wha's the best approach in this scenario?
If there are too many documents with too much text inside them will it adversely affect our database performance?
Since it's a web based app running on shared hosting we may not be able to use any standard library like lucene.
1 Solution
The size of the documents will eventually affect the database performance.

The right approach would depend on the performance you expect from the application. Extracting the text from the files and storing it in a db would produce the best performance when searching with the application.

If search times are not important you could just search through the documents when a search is made (I imagine it would be very slow though)

My first suggestion would be to modify your code which reads the documents and extracts all text to be a little bit more 'clever' and remove any duplicated words so you only store all the unique words in the document. That should cut the database size down considerably.

If you currently search for the entire sting you would need to modify the search to split the string by spaces and pass each word into the where clause. you can then let the user decide if they need acurate results 'where (includes word 1) AND (includes word 2) etc' or a wider search 'where (includes word 1) OR (includes word 2) etc. From there you have the possibility to break it down further and provide relevant results depending on how many matching unique words are found from the search string.

Hope that helps
It would be advantageous to use the Index facility of SQL Server and query the index files to speed up the results. Based on the selection of the user you can then pull out the data from the tables.

Featured Post

Prep for the ITIL® Foundation Certification Exam

December’s Course of the Month is now available! Enroll to learn ITIL® Foundation best practices for delivering IT services effectively and efficiently.

Tackle projects and never again get stuck behind a technical roadblock.
Join Now