Solved

How to Import Word document into SQL Server

Posted on 2007-11-22
16
881 Views
Last Modified: 2010-04-21
I have a client who wants to import resumes which are in MS Word documents into SQL Server, and be able to do text searching within the docs.

What is the best way to a) import the data, and b) retrieve a document based on a search of it's contents?

Thanks for any assistance.....
0
Comment
Question by:force3inc
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 6
  • 4
  • 3
  • +2
16 Comments
 
LVL 15

Expert Comment

by:mcmonap
ID: 20335808


Just a thought, but it might be worth considering Windows SharePoint Services depending upon what else the app needs to do; you can use it to organise and index documents and it's free, details here:
http://technet.microsoft.com/en-gb/windowsserver/sharepoint/bb684453.aspx
0
 
LVL 5

Expert Comment

by:KirillMueller
ID: 20336181
a) Insert the file in its disk format into a column of type IMAGE using WRITETEXT or UPDATETEXT. If you use ADO, pass an array (SAFEARRAY) of bytes with the file contents. You also need to provide a column that holds the extension of the file.

b) Create a full-text on this image column. Don't forget to turn on full-text indexing service. Use the CONTAINS predicate for retrieval.

The magic that is responsible of extracting the text from the Word document is called "IFilter". The IFilter for Word documents should be included with the OS or SQL server or at least with Word itself, I'm not sure about that. Look at http://www.ifilter.org/ for other implementations.

Of course, only if SharePoint is not an option... :-)
0
 
LVL 23

Expert Comment

by:Racim BOUDJAKDJI
ID: 20336569
My best advice to you is to build an indexing system for retrieving the docs faster...Storing docs directly into the db is *not* a good idea, your database performance will degrade quick as the db grows and your system will require constant care and powerful machines...Been there done that...Definitely not a good idea...Oh yeah and do not use XML, ever...

Hope this helps...
0
NFR key for Veeam Backup for Microsoft Office 365

Veeam is happy to provide a free NFR license (for 1 year, up to 10 users). This license allows for the non‑production use of Veeam Backup for Microsoft Office 365 in your home lab without any feature limitations.

 

Author Comment

by:force3inc
ID: 20336623
Racimo, can you give me an idea at what point you saw performance degrade? This is a small shop, and shouldn't have too many (depends on what "too many" means) word docs.....

You mention you done this, what platform (SQL Server, etc.) and version?

Thanks
0
 

Author Comment

by:force3inc
ID: 20336629
KirillMueller, Sharepoint isn't going to be an option in this case, have you done this before? Is performance going to be a problem?

Thanks
0
 

Author Comment

by:force3inc
ID: 20336632
mcmonap, Sharepoint is not going to be an option in this case. Would that have provided search wtihin the docs?

Thanks
0
 
LVL 23

Expert Comment

by:Racim BOUDJAKDJI
ID: 20336952
<<Racimo, can you give me an idea at what point you saw performance degrade>>
As the db size grows the response time increases dramatically and requires constant tuning (indexes)...Not to mention that regular admin operations (defrag, reindexing etc) take much more time to run...

<<This is a small shop, and shouldn't have too many (depends on what "too many" means) word docs.....>>
If you are convinced that your database will never go beyond 20/30Gb you can do it, but above forget it...

<<You mention you done this, what platform (SQL Server, etc.) and version?>>
2000/2005-32bits/2005-64bits.

Hope this helps...
0
 
LVL 5

Accepted Solution

by:
KirillMueller earned 500 total points
ID: 20337314
force3inc: Yes, I've done this before. However, I have no experience beyond the 30 GB border that Racimo mentioned. Performance is no matter for small databases. There are several benefits of storing docs inside the SQL database:

- Replication
- "Everything is in one place"
- Retrieval is more comfortable

Undoubtedly, performance is better when documents are not in the database, as Racimo suggests. It must be, because SQL server is just one more layer between the document handler and the hard disk. You should evaluate if it matters in your case.

In fact, MS SQL Fulltext engine uses the same Indexing Service as Windows for full-text-indexing its files.

Racimo: It seems that you can store the table with the documents and the fulltext index on distinct filegroups. Wouldn't that reduce the impact on performance if the filegroups reside on different drives? Have you tried that? Is the performance of the native Windows Indexing Service better for such a huge amount of data?
0
 
LVL 23

Expert Comment

by:Racim BOUDJAKDJI
ID: 20338712
<<Performance is no matter for small databases>>
I respectfully disagree with that statement.  I have seen critical small database running poorly (CPU high, RAM IO' bottlenecks).

<<Wouldn't that reduce the impact on performance if the filegroups reside on different drives?>>
Performance improvement by moving files to a different logical drive is very limited.  Creating new filegroups makes significant sense *if and only if*  the filegroups access different RAID systems.

<<Have you tried that?>>
Yes.

<< Is the performance of the native Windows Indexing Service better for such a huge amount of data?>>
Nope.  I stopped wasting time with that functionnality (requires too much troubleshooting and tuning).  Works poorly period.  find it more practical and performant  to store pathes.

Regard...
0
 
LVL 5

Expert Comment

by:KirillMueller
ID: 20338804
Racimo: Alright, but if you store paths to the documents, you still need to fulltext-index the documents. What would you prefer for that task?
0
 
LVL 23

Expert Comment

by:Racim BOUDJAKDJI
ID: 20339280
<<What would you prefer for that task?>>
using (full-text + blob) is worse than using full-text only.

Regards...
0
 
LVL 15

Expert Comment

by:mcmonap
ID: 20342179
Hi force3inc,

I agree with Racimo (documents in a database = bad place) however this is exactly what microsoft are doing with SharePoint and unfortunately it works quite well - at least every man and his dog seem to love it (not sure about the average DBA though!!).

I have numerous <SIGH> WSS deployments which have multiple content databases in excess 50GB,  and yes SharePoint will full text index these documents; as with SQL FT indexing you must install the appropriate iFilters like KirillMueller mentions above.

The issue with documents in a database for me is that SQL when SQL returns the data it may well be doing so for a long time - you can store a 2GB object (document) in a blob datatype (if you build a solution consider limiting the size of files that can be uploaded).  SQL is more about data manipulation and storage than it is about firing big files around the network (hence FILE server / DATAbase!!)

I suggested SharePoint (WSS) because it is free and does what you want out of the box, you can definitely do what you need in a customised way it just often takes longer - I'm a big fan of the 80/20 rule in the buy/build debate, if you can find a product that does 80% of what you want then buy it otherwise build it, generally it saves a lot of pain.
0
 
LVL 23

Expert Comment

by:Racim BOUDJAKDJI
ID: 20342301
<<however this is exactly what microsoft are doing with SharePoint and unfortunately it works quite well>>
as long as the data store is below 50Go and does not have more than 10 proc/sec.  Passed that once has to get ready to buy new machines (even machines specifically for indexing/catalogs)., better get prepared to buy lots of new expensive machines...Take the following recommandations, multiply by 2 (hey that's MS)
to get a clue...

http://technet2.microsoft.com/Office/en-us/library/6a13cd9f-4b44-40d6-85aa-c70a8e5c34fe1033.mspx?mfr=true

<<if you can find a product that does 80% of what you want then buy it otherwise build it, generally it saves a lot of pain.>>
I kind of agree if cha,ces are, the db won't grow dramatically in the first year as well as concurrency.  If not I stick to store references...

Regards...
0
 
LVL 23

Expert Comment

by:Racim BOUDJAKDJI
ID: 20342306
You may want to check out the following product/company.  I have found it much more powerful (and cheaper) than SQL Server.

http://armadillo.fr/english/armamailpro.html#doc

Hope this helps...
0
 
LVL 1

Expert Comment

by:FLETCHM
ID: 20815361
Imports multiple word docs individually or from a folder. It parses the address information into seperate fields.

http://www.egrabber.com/resumegrabber.html
0
 

Author Closing Comment

by:force3inc
ID: 31410587
I've set this up and it works just fine - thanks
0

Featured Post

Salesforce Has Never Been Easier

Improve and reinforce salesforce training & adoption using WalkMe's digital adoption platform. Start saving on costly employee training by creating fast intuitive Walk-Thrus for Salesforce. Claim your Free Account Now

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Why is this different from all of the other step by step guides?  Because I make a living as a DBA and not as a writer and I lived through this experience. Defining the name: When I talk to people they say different names on this subject stuff l…
For both online and offline retail, the cross-channel business is the most recent pattern in the B2C trade space.
Familiarize people with the process of retrieving data from SQL Server using an Access pass-thru query. Microsoft Access is a very powerful client/server development tool. One of the ways that you can retrieve data from a SQL Server is by using a pa…
This videos aims to give the viewer a basic demonstration of how a user can query current session information by using the SYS_CONTEXT function

728 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question