Link to home
Start Free TrialLog in
Avatar of Stefan Lennerbrant
Stefan LennerbrantFlag for Sweden

asked on

Free text search in Word/Excel etc stored in BLOB

I'm storing Word and Excel (and other) documents in BLOBs in SQL Server 2000, alongside with various other simple varchar fields.
Today I search for text data in the varchars with simple
 ...where field like '%text%' or field2 like '%text%'
It has not yet been interesting to start using the real free text search functions in SQL.

However, now I'd like to perform searches in at least Word documents, perhaps also Excel and PDF and perhaps also other types.
We are in full control of the application performing the searches (in-house developed in C and C++) and thus I look for some product that may perform these searches and return indications of which rows match.

Either a module that may compiled or linked into our program, or alternatively an external EXE program that takes some input and gives some output that may be written/read by our application.

Also, I may of course use the freetext capabilities and transfer the documents into those tables when storing them initially.
Then I'd need some module to extract the data from the documents.

Any ideas or experiences of document free text search?
Thanks, /Stefan Lennerbrant
Avatar of Aneesh
Flag of Canada image

I am not sure, is this what you are looking for
Avatar of Stefan Lennerbrant


Thanks, I'll read it but it seems to address only the fulltext search system itself, not how to do searching in the binary BLOB objects containing Word/Excel. The BLOB contains Word/Excel-specific data (binary) that cannot be searched "just like that"

for Searching data in a full text indexed column , use 'CONTAINS', refer BOL for more info

A. Use CONTAINS with <simple_term>
This example finds all products with a price of $15.00 that contain the word "bottles."

USE Northwind
SELECT ProductName
FROM Products
WHERE UnitPrice = 15.00
   AND CONTAINS(QuantityPerUnit, 'bottles')

B. Use CONTAINS and phrase in <simple_term>
This example returns all products that contain either the phrase "sasquatch ale" or "steeleye stout."

USE Northwind
SELECT ProductName
FROM Products
WHERE CONTAINS(ProductName, ' "sasquatch ale" OR "steeleye stout" ')

C. Use CONTAINS with <prefix_term>
This example returns all product names with at least one word starting with the prefix choc in the ProductName column.

USE Northwind
SELECT ProductName
FROM Products
WHERE CONTAINS(ProductName, ' "choc*" ')

D. Use CONTAINS and OR with <prefix_term>
This example returns all category descriptions containing the strings "sea" or "bread."

USE Northwind
SELECT CategoryName
FROM Categories
WHERE CONTAINS(Description, '"sea*" OR "bread*"')

E. Use CONTAINS with <proximity_term>
This example returns all product names that have the word "Boysenberry" near the word "spread."

USE Northwind
SELECT ProductName
FROM Products
WHERE CONTAINS(ProductName, 'spread NEAR Boysenberry')

F. Use CONTAINS with <generation_term>
This example searches for all products with words of the form dry: dried, drying, and so on.

USE Northwind
SELECT ProductName
FROM Products

G. Use CONTAINS with <weighted_term>
This example searches for all product names containing the words spread, sauces, or relishes, and different weightings are given to each word.

USE Northwind
SELECT CategoryName, Description
FROM Categories
WHERE CONTAINS(Description, 'ISABOUT (spread weight (.8),
   sauces weight (.4), relishes weight (.2) )' )

H. Use CONTAINS with variables
This example uses a variable instead of a specific search term.

USE pubs
DECLARE @SearchWord varchar(30)
SET @SearchWord ='Moon'
SELECT pr_info FROM pub_info WHERE CONTAINS(pr_info, @SearchWord)

Yes, but still that example only search text data. Word and Excel are binary data in BLOBs and need to be parsed.

Either I need to scan through each BLOB with a third party module/program (non-fulltext search), or I need to extract the document contents from the BLOB and populate into the fulltext search tables (extraction using third party module).

I suppose there must exist commercial products to do this, at least extraction of Word/Excel documents into fulltext. Thus, perhaps one wouldn't even need to actually integrate the products into the application, as SQL Server would fix all that by itself. Right?
Avatar of deroby

I think that you've found your own answer, namely, store the texts in char (or text) fields alongside the BLOBs. It's going to be much more performant than installing a system where each file is loaded from the database, read & searched and added to the result-list where appropriate. Duplicating the the contents (well, only the text, forget about the layout etc) of the files will make maintenance of the database quite a bit more costly (extract of textual contents, storage of textual contents), but that's just an initial cost, you'll probably win it back many times once you started doing searches on it.

(as for the extraction, I would try to (programmatically) do a "Save as text file" of the file to be saved as blob where-ever possible)

My 2 cents.
The question is then, how to "save as text file" from within the application itself:-)

The Word/Excel documents shall be stored in the database as a BLOB, such that it can be retrieved and returned to anyone who wants to see it. It could of course be stored as a file in the file system instead, but we put it in the database.
Searching may be performed by either scanning each BLOB at every search (not very efficient:-) or else by using fulltext search capabilities.
A third option might be to really store a duplicate of the BLOB/binary, as a BLOB/text -- just to be able to search with a normal "like". I haven't thought of that but it seems inefficient.
Storing in varchar is not an option, as the documents may be large.

Still, the primary question is to find products that do all this for me. Extracting data from Word/Excel (and other), either into text files etc or directly into fulltext search tables in SQL.
Perhaps I've been unclear about this, but the actual question is to find "extraction programs" that convert Word/Excel/PDF etc into something that SQL may search in some way.
The 'how' is not so difficult and can be easily done using the MS Office API's (as for the Office document files), it can't be that hard to write a little loop around it that looks in the database for non-extracted blobs, saves the file, extracts the file into another file and then reads & uploads the new file to relevant table(s).

You're probably right that there are programs around that do this for you, I would google around for so called KIX systems... (knowledge & Information eXchange systems).
The only one I personally know about is as it's a spin-off from the previous company I worked for (feel free to mention me =),
although they might offer more than what you're looking for, oh, well, never hurts asking off course.
Anyway, I'm sure there are other companies around that do similar things and might sell those tools seperately. Heck, google seems to be quite able to do it as they seem to index html, .doc, .pff, ...
as for efficiency & storage : you can store the text divided into 8000 char 'parts' and have your front-end have them reconstructed again. Agreed that LIKE operations on it won't be the fastest thing around.
(you could start building indexed dictionaries and store the links to those instead of the actual text.. somehow seems quite the challenge though =)

AFAIK this is why FULL TEXT SEARCH was added to sql-server, but I have no working experience with it, sorry.
Yep, I'll look at icmsgrp and any other suggestions that are mentioned.
However, using Office API's is not applicable, as Office tools are not installed on a databas server...

Other types of utilities (such as, perhaps, icmsgrp products) may be of interest.
The more file formats, the better, but I suppose that Word/Excel/PDF covers most cases.

Anyone with real experience of using such tools to extract data into SQL databases, or into other formats that may be used for database storage?

Well well, I just seem to have focused to much on my recollection of SQL7 functionality

In SQL2000 docs I find the following:
When an image column participates in a full-text index, the full-text service looks at the extensions of the documents in the image column and applies a corresponding filter to interpret the binary data and extract the textual information needed for indexing and querying.

However, there are constraints:
The following types are supported: .doc, .xls, .ppt, .txt, and .htm (if not using custom filters)
A document must be less than 16 megabytes (MB) in size
It must not contain more than 256 kilobytes (KB) of filtered text

This seems to solve all of the problems? Does anyone have any experience with these methods?
Should I still look for other products / implementations?

Also, to use fulltext indexes I suppose I need to add a separate server for this, such that the normal server is not affected by index rebuilding and searching (there is a lot of activity on the "production servers"). I suppose such a setup works well even though I haven't seen any reference to "server-splitting" while reading through the first MSDN descriptions.
Avatar of Computer101
Flag of United States of America image

Link to home
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial