Solved

How to Import Word document into SQL Server

Posted on 2007-11-22
16
861 Views
Last Modified: 2010-04-21
I have a client who wants to import resumes which are in MS Word documents into SQL Server, and be able to do text searching within the docs.

What is the best way to a) import the data, and b) retrieve a document based on a search of it's contents?

Thanks for any assistance.....
0
Comment
Question by:force3inc
  • 6
  • 4
  • 3
  • +2
16 Comments
 
LVL 15

Expert Comment

by:mcmonap
ID: 20335808


Just a thought, but it might be worth considering Windows SharePoint Services depending upon what else the app needs to do; you can use it to organise and index documents and it's free, details here:
http://technet.microsoft.com/en-gb/windowsserver/sharepoint/bb684453.aspx
0
 
LVL 5

Expert Comment

by:KirillMueller
ID: 20336181
a) Insert the file in its disk format into a column of type IMAGE using WRITETEXT or UPDATETEXT. If you use ADO, pass an array (SAFEARRAY) of bytes with the file contents. You also need to provide a column that holds the extension of the file.

b) Create a full-text on this image column. Don't forget to turn on full-text indexing service. Use the CONTAINS predicate for retrieval.

The magic that is responsible of extracting the text from the Word document is called "IFilter". The IFilter for Word documents should be included with the OS or SQL server or at least with Word itself, I'm not sure about that. Look at http://www.ifilter.org/ for other implementations.

Of course, only if SharePoint is not an option... :-)
0
 
LVL 23

Expert Comment

by:Racim BOUDJAKDJI
ID: 20336569
My best advice to you is to build an indexing system for retrieving the docs faster...Storing docs directly into the db is *not* a good idea, your database performance will degrade quick as the db grows and your system will require constant care and powerful machines...Been there done that...Definitely not a good idea...Oh yeah and do not use XML, ever...

Hope this helps...
0
 

Author Comment

by:force3inc
ID: 20336623
Racimo, can you give me an idea at what point you saw performance degrade? This is a small shop, and shouldn't have too many (depends on what "too many" means) word docs.....

You mention you done this, what platform (SQL Server, etc.) and version?

Thanks
0
 

Author Comment

by:force3inc
ID: 20336629
KirillMueller, Sharepoint isn't going to be an option in this case, have you done this before? Is performance going to be a problem?

Thanks
0
 

Author Comment

by:force3inc
ID: 20336632
mcmonap, Sharepoint is not going to be an option in this case. Would that have provided search wtihin the docs?

Thanks
0
 
LVL 23

Expert Comment

by:Racim BOUDJAKDJI
ID: 20336952
<<Racimo, can you give me an idea at what point you saw performance degrade>>
As the db size grows the response time increases dramatically and requires constant tuning (indexes)...Not to mention that regular admin operations (defrag, reindexing etc) take much more time to run...

<<This is a small shop, and shouldn't have too many (depends on what "too many" means) word docs.....>>
If you are convinced that your database will never go beyond 20/30Gb you can do it, but above forget it...

<<You mention you done this, what platform (SQL Server, etc.) and version?>>
2000/2005-32bits/2005-64bits.

Hope this helps...
0
 
LVL 5

Accepted Solution

by:
KirillMueller earned 500 total points
ID: 20337314
force3inc: Yes, I've done this before. However, I have no experience beyond the 30 GB border that Racimo mentioned. Performance is no matter for small databases. There are several benefits of storing docs inside the SQL database:

- Replication
- "Everything is in one place"
- Retrieval is more comfortable

Undoubtedly, performance is better when documents are not in the database, as Racimo suggests. It must be, because SQL server is just one more layer between the document handler and the hard disk. You should evaluate if it matters in your case.

In fact, MS SQL Fulltext engine uses the same Indexing Service as Windows for full-text-indexing its files.

Racimo: It seems that you can store the table with the documents and the fulltext index on distinct filegroups. Wouldn't that reduce the impact on performance if the filegroups reside on different drives? Have you tried that? Is the performance of the native Windows Indexing Service better for such a huge amount of data?
0
Zoho SalesIQ

Hassle-free live chat software re-imagined for business growth. 2 users, always free.

 
LVL 23

Expert Comment

by:Racim BOUDJAKDJI
ID: 20338712
<<Performance is no matter for small databases>>
I respectfully disagree with that statement.  I have seen critical small database running poorly (CPU high, RAM IO' bottlenecks).

<<Wouldn't that reduce the impact on performance if the filegroups reside on different drives?>>
Performance improvement by moving files to a different logical drive is very limited.  Creating new filegroups makes significant sense *if and only if*  the filegroups access different RAID systems.

<<Have you tried that?>>
Yes.

<< Is the performance of the native Windows Indexing Service better for such a huge amount of data?>>
Nope.  I stopped wasting time with that functionnality (requires too much troubleshooting and tuning).  Works poorly period.  find it more practical and performant  to store pathes.

Regard...
0
 
LVL 5

Expert Comment

by:KirillMueller
ID: 20338804
Racimo: Alright, but if you store paths to the documents, you still need to fulltext-index the documents. What would you prefer for that task?
0
 
LVL 23

Expert Comment

by:Racim BOUDJAKDJI
ID: 20339280
<<What would you prefer for that task?>>
using (full-text + blob) is worse than using full-text only.

Regards...
0
 
LVL 15

Expert Comment

by:mcmonap
ID: 20342179
Hi force3inc,

I agree with Racimo (documents in a database = bad place) however this is exactly what microsoft are doing with SharePoint and unfortunately it works quite well - at least every man and his dog seem to love it (not sure about the average DBA though!!).

I have numerous <SIGH> WSS deployments which have multiple content databases in excess 50GB,  and yes SharePoint will full text index these documents; as with SQL FT indexing you must install the appropriate iFilters like KirillMueller mentions above.

The issue with documents in a database for me is that SQL when SQL returns the data it may well be doing so for a long time - you can store a 2GB object (document) in a blob datatype (if you build a solution consider limiting the size of files that can be uploaded).  SQL is more about data manipulation and storage than it is about firing big files around the network (hence FILE server / DATAbase!!)

I suggested SharePoint (WSS) because it is free and does what you want out of the box, you can definitely do what you need in a customised way it just often takes longer - I'm a big fan of the 80/20 rule in the buy/build debate, if you can find a product that does 80% of what you want then buy it otherwise build it, generally it saves a lot of pain.
0
 
LVL 23

Expert Comment

by:Racim BOUDJAKDJI
ID: 20342301
<<however this is exactly what microsoft are doing with SharePoint and unfortunately it works quite well>>
as long as the data store is below 50Go and does not have more than 10 proc/sec.  Passed that once has to get ready to buy new machines (even machines specifically for indexing/catalogs)., better get prepared to buy lots of new expensive machines...Take the following recommandations, multiply by 2 (hey that's MS)
to get a clue...

http://technet2.microsoft.com/Office/en-us/library/6a13cd9f-4b44-40d6-85aa-c70a8e5c34fe1033.mspx?mfr=true

<<if you can find a product that does 80% of what you want then buy it otherwise build it, generally it saves a lot of pain.>>
I kind of agree if cha,ces are, the db won't grow dramatically in the first year as well as concurrency.  If not I stick to store references...

Regards...
0
 
LVL 23

Expert Comment

by:Racim BOUDJAKDJI
ID: 20342306
You may want to check out the following product/company.  I have found it much more powerful (and cheaper) than SQL Server.

http://armadillo.fr/english/armamailpro.html#doc

Hope this helps...
0
 
LVL 1

Expert Comment

by:FLETCHM
ID: 20815361
Imports multiple word docs individually or from a folder. It parses the address information into seperate fields.

http://www.egrabber.com/resumegrabber.html
0
 

Author Closing Comment

by:force3inc
ID: 31410587
I've set this up and it works just fine - thanks
0

Featured Post

How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

Join & Write a Comment

Nowadays, some of developer are too much worried about data. Who is using data, who is updating it etc. etc. Because, data is more costlier in term of money and information. So security of data is focusing concern in days. Lets' understand the Au…
Ever needed a SQL 2008 Database replicated/mirrored/log shipped on another server but you can't take the downtime inflicted by initial snapshot or disconnect while T-logs are restored or mirror applied? You can use SQL Server Initialize from Backup…
Via a live example, show how to shrink a transaction log file down to a reasonable size.
Viewers will learn how to use the UPDATE and DELETE statements to change or remove existing data from their tables. Make a table: Update a specific column given a specific row using the UPDATE statement: Remove a set of values using the DELETE s…

707 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

11 Experts available now in Live!

Get 1:1 Help Now