How to Import Word document into SQL Server

I have a client who wants to import resumes which are in MS Word documents into SQL Server, and be able to do text searching within the docs.

What is the best way to a) import the data, and b) retrieve a document based on a search of it's contents?

Thanks for any assistance.....
force3incAsked:
Who is Participating?

[Webinar] Streamline your web hosting managementRegister Today

x
 
KirillMuellerConnect With a Mentor Commented:
force3inc: Yes, I've done this before. However, I have no experience beyond the 30 GB border that Racimo mentioned. Performance is no matter for small databases. There are several benefits of storing docs inside the SQL database:

- Replication
- "Everything is in one place"
- Retrieval is more comfortable

Undoubtedly, performance is better when documents are not in the database, as Racimo suggests. It must be, because SQL server is just one more layer between the document handler and the hard disk. You should evaluate if it matters in your case.

In fact, MS SQL Fulltext engine uses the same Indexing Service as Windows for full-text-indexing its files.

Racimo: It seems that you can store the table with the documents and the fulltext index on distinct filegroups. Wouldn't that reduce the impact on performance if the filegroups reside on different drives? Have you tried that? Is the performance of the native Windows Indexing Service better for such a huge amount of data?
0
 
mcmonapCommented:


Just a thought, but it might be worth considering Windows SharePoint Services depending upon what else the app needs to do; you can use it to organise and index documents and it's free, details here:
http://technet.microsoft.com/en-gb/windowsserver/sharepoint/bb684453.aspx
0
 
KirillMuellerCommented:
a) Insert the file in its disk format into a column of type IMAGE using WRITETEXT or UPDATETEXT. If you use ADO, pass an array (SAFEARRAY) of bytes with the file contents. You also need to provide a column that holds the extension of the file.

b) Create a full-text on this image column. Don't forget to turn on full-text indexing service. Use the CONTAINS predicate for retrieval.

The magic that is responsible of extracting the text from the Word document is called "IFilter". The IFilter for Word documents should be included with the OS or SQL server or at least with Word itself, I'm not sure about that. Look at http://www.ifilter.org/ for other implementations.

Of course, only if SharePoint is not an option... :-)
0
Take Control of Web Hosting For Your Clients

As a web developer or IT admin, successfully managing multiple client accounts can be challenging. In this webinar we will look at the tools provided by Media Temple and Plesk to make managing your clients’ hosting easier.

 
Racim BOUDJAKDJIDatabase Architect - Dba - Data ScientistCommented:
My best advice to you is to build an indexing system for retrieving the docs faster...Storing docs directly into the db is *not* a good idea, your database performance will degrade quick as the db grows and your system will require constant care and powerful machines...Been there done that...Definitely not a good idea...Oh yeah and do not use XML, ever...

Hope this helps...
0
 
force3incAuthor Commented:
Racimo, can you give me an idea at what point you saw performance degrade? This is a small shop, and shouldn't have too many (depends on what "too many" means) word docs.....

You mention you done this, what platform (SQL Server, etc.) and version?

Thanks
0
 
force3incAuthor Commented:
KirillMueller, Sharepoint isn't going to be an option in this case, have you done this before? Is performance going to be a problem?

Thanks
0
 
force3incAuthor Commented:
mcmonap, Sharepoint is not going to be an option in this case. Would that have provided search wtihin the docs?

Thanks
0
 
Racim BOUDJAKDJIDatabase Architect - Dba - Data ScientistCommented:
<<Racimo, can you give me an idea at what point you saw performance degrade>>
As the db size grows the response time increases dramatically and requires constant tuning (indexes)...Not to mention that regular admin operations (defrag, reindexing etc) take much more time to run...

<<This is a small shop, and shouldn't have too many (depends on what "too many" means) word docs.....>>
If you are convinced that your database will never go beyond 20/30Gb you can do it, but above forget it...

<<You mention you done this, what platform (SQL Server, etc.) and version?>>
2000/2005-32bits/2005-64bits.

Hope this helps...
0
 
Racim BOUDJAKDJIDatabase Architect - Dba - Data ScientistCommented:
<<Performance is no matter for small databases>>
I respectfully disagree with that statement.  I have seen critical small database running poorly (CPU high, RAM IO' bottlenecks).

<<Wouldn't that reduce the impact on performance if the filegroups reside on different drives?>>
Performance improvement by moving files to a different logical drive is very limited.  Creating new filegroups makes significant sense *if and only if*  the filegroups access different RAID systems.

<<Have you tried that?>>
Yes.

<< Is the performance of the native Windows Indexing Service better for such a huge amount of data?>>
Nope.  I stopped wasting time with that functionnality (requires too much troubleshooting and tuning).  Works poorly period.  find it more practical and performant  to store pathes.

Regard...
0
 
KirillMuellerCommented:
Racimo: Alright, but if you store paths to the documents, you still need to fulltext-index the documents. What would you prefer for that task?
0
 
Racim BOUDJAKDJIDatabase Architect - Dba - Data ScientistCommented:
<<What would you prefer for that task?>>
using (full-text + blob) is worse than using full-text only.

Regards...
0
 
mcmonapCommented:
Hi force3inc,

I agree with Racimo (documents in a database = bad place) however this is exactly what microsoft are doing with SharePoint and unfortunately it works quite well - at least every man and his dog seem to love it (not sure about the average DBA though!!).

I have numerous <SIGH> WSS deployments which have multiple content databases in excess 50GB,  and yes SharePoint will full text index these documents; as with SQL FT indexing you must install the appropriate iFilters like KirillMueller mentions above.

The issue with documents in a database for me is that SQL when SQL returns the data it may well be doing so for a long time - you can store a 2GB object (document) in a blob datatype (if you build a solution consider limiting the size of files that can be uploaded).  SQL is more about data manipulation and storage than it is about firing big files around the network (hence FILE server / DATAbase!!)

I suggested SharePoint (WSS) because it is free and does what you want out of the box, you can definitely do what you need in a customised way it just often takes longer - I'm a big fan of the 80/20 rule in the buy/build debate, if you can find a product that does 80% of what you want then buy it otherwise build it, generally it saves a lot of pain.
0
 
Racim BOUDJAKDJIDatabase Architect - Dba - Data ScientistCommented:
<<however this is exactly what microsoft are doing with SharePoint and unfortunately it works quite well>>
as long as the data store is below 50Go and does not have more than 10 proc/sec.  Passed that once has to get ready to buy new machines (even machines specifically for indexing/catalogs)., better get prepared to buy lots of new expensive machines...Take the following recommandations, multiply by 2 (hey that's MS)
to get a clue...

http://technet2.microsoft.com/Office/en-us/library/6a13cd9f-4b44-40d6-85aa-c70a8e5c34fe1033.mspx?mfr=true

<<if you can find a product that does 80% of what you want then buy it otherwise build it, generally it saves a lot of pain.>>
I kind of agree if cha,ces are, the db won't grow dramatically in the first year as well as concurrency.  If not I stick to store references...

Regards...
0
 
Racim BOUDJAKDJIDatabase Architect - Dba - Data ScientistCommented:
You may want to check out the following product/company.  I have found it much more powerful (and cheaper) than SQL Server.

http://armadillo.fr/english/armamailpro.html#doc

Hope this helps...
0
 
FLETCHMCommented:
Imports multiple word docs individually or from a folder. It parses the address information into seperate fields.

http://www.egrabber.com/resumegrabber.html
0
 
force3incAuthor Commented:
I've set this up and it works just fine - thanks
0
All Courses

From novice to tech pro — start learning today.