Link to home
Start Free TrialLog in
Avatar of force3inc
force3inc

asked on

How to Import Word document into SQL Server

I have a client who wants to import resumes which are in MS Word documents into SQL Server, and be able to do text searching within the docs.

What is the best way to a) import the data, and b) retrieve a document based on a search of it's contents?

Thanks for any assistance.....
Avatar of mcmonap
mcmonap
Flag of United Kingdom of Great Britain and Northern Ireland image



Just a thought, but it might be worth considering Windows SharePoint Services depending upon what else the app needs to do; you can use it to organise and index documents and it's free, details here:
http://technet.microsoft.com/en-gb/windowsserver/sharepoint/bb684453.aspx
Avatar of KirillMueller
KirillMueller

a) Insert the file in its disk format into a column of type IMAGE using WRITETEXT or UPDATETEXT. If you use ADO, pass an array (SAFEARRAY) of bytes with the file contents. You also need to provide a column that holds the extension of the file.

b) Create a full-text on this image column. Don't forget to turn on full-text indexing service. Use the CONTAINS predicate for retrieval.

The magic that is responsible of extracting the text from the Word document is called "IFilter". The IFilter for Word documents should be included with the OS or SQL server or at least with Word itself, I'm not sure about that. Look at http://www.ifilter.org/ for other implementations.

Of course, only if SharePoint is not an option... :-)
Avatar of Racim BOUDJAKDJI
My best advice to you is to build an indexing system for retrieving the docs faster...Storing docs directly into the db is *not* a good idea, your database performance will degrade quick as the db grows and your system will require constant care and powerful machines...Been there done that...Definitely not a good idea...Oh yeah and do not use XML, ever...

Hope this helps...
Avatar of force3inc

ASKER

Racimo, can you give me an idea at what point you saw performance degrade? This is a small shop, and shouldn't have too many (depends on what "too many" means) word docs.....

You mention you done this, what platform (SQL Server, etc.) and version?

Thanks
KirillMueller, Sharepoint isn't going to be an option in this case, have you done this before? Is performance going to be a problem?

Thanks
mcmonap, Sharepoint is not going to be an option in this case. Would that have provided search wtihin the docs?

Thanks
<<Racimo, can you give me an idea at what point you saw performance degrade>>
As the db size grows the response time increases dramatically and requires constant tuning (indexes)...Not to mention that regular admin operations (defrag, reindexing etc) take much more time to run...

<<This is a small shop, and shouldn't have too many (depends on what "too many" means) word docs.....>>
If you are convinced that your database will never go beyond 20/30Gb you can do it, but above forget it...

<<You mention you done this, what platform (SQL Server, etc.) and version?>>
2000/2005-32bits/2005-64bits.

Hope this helps...
ASKER CERTIFIED SOLUTION
Avatar of KirillMueller
KirillMueller

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
<<Performance is no matter for small databases>>
I respectfully disagree with that statement.  I have seen critical small database running poorly (CPU high, RAM IO' bottlenecks).

<<Wouldn't that reduce the impact on performance if the filegroups reside on different drives?>>
Performance improvement by moving files to a different logical drive is very limited.  Creating new filegroups makes significant sense *if and only if*  the filegroups access different RAID systems.

<<Have you tried that?>>
Yes.

<< Is the performance of the native Windows Indexing Service better for such a huge amount of data?>>
Nope.  I stopped wasting time with that functionnality (requires too much troubleshooting and tuning).  Works poorly period.  find it more practical and performant  to store pathes.

Regard...
Racimo: Alright, but if you store paths to the documents, you still need to fulltext-index the documents. What would you prefer for that task?
<<What would you prefer for that task?>>
using (full-text + blob) is worse than using full-text only.

Regards...
Hi force3inc,

I agree with Racimo (documents in a database = bad place) however this is exactly what microsoft are doing with SharePoint and unfortunately it works quite well - at least every man and his dog seem to love it (not sure about the average DBA though!!).

I have numerous <SIGH> WSS deployments which have multiple content databases in excess 50GB,  and yes SharePoint will full text index these documents; as with SQL FT indexing you must install the appropriate iFilters like KirillMueller mentions above.

The issue with documents in a database for me is that SQL when SQL returns the data it may well be doing so for a long time - you can store a 2GB object (document) in a blob datatype (if you build a solution consider limiting the size of files that can be uploaded).  SQL is more about data manipulation and storage than it is about firing big files around the network (hence FILE server / DATAbase!!)

I suggested SharePoint (WSS) because it is free and does what you want out of the box, you can definitely do what you need in a customised way it just often takes longer - I'm a big fan of the 80/20 rule in the buy/build debate, if you can find a product that does 80% of what you want then buy it otherwise build it, generally it saves a lot of pain.
<<however this is exactly what microsoft are doing with SharePoint and unfortunately it works quite well>>
as long as the data store is below 50Go and does not have more than 10 proc/sec.  Passed that once has to get ready to buy new machines (even machines specifically for indexing/catalogs)., better get prepared to buy lots of new expensive machines...Take the following recommandations, multiply by 2 (hey that's MS)
to get a clue...

http://technet2.microsoft.com/Office/en-us/library/6a13cd9f-4b44-40d6-85aa-c70a8e5c34fe1033.mspx?mfr=true

<<if you can find a product that does 80% of what you want then buy it otherwise build it, generally it saves a lot of pain.>>
I kind of agree if cha,ces are, the db won't grow dramatically in the first year as well as concurrency.  If not I stick to store references...

Regards...
You may want to check out the following product/company.  I have found it much more powerful (and cheaper) than SQL Server.

http://armadillo.fr/english/armamailpro.html#doc

Hope this helps...
Imports multiple word docs individually or from a folder. It parses the address information into seperate fields.

http://www.egrabber.com/resumegrabber.html
I've set this up and it works just fine - thanks