Link to home
Start Free TrialLog in
Avatar of wchestnut
wchestnutFlag for United States of America

asked on

SQL 2012 FileTable Full Text Search Issue

I have a FileTable that consists mostly of resumes in Word DOC/DOCX format.  For some reason, I have certain people who's e-mail address or name is not being found even though their resume exists in the FileStream Share.  I even opened their resume, copied the e-mail address directly from it and pasted it into the query (not real e-mail address):
SELECT	* 
FROM	RESUMES.dbo.ftbl_resume_files 
WHERE	CONTAINS(file_stream, 'john.smith@yahoo.com')

Open in new window

But, I can find my own resume with my e-mail address and many others.

I've tried initiating Rebuilds on all Indexes and the Catalog, but when they're finished, the same names and e-mail addresses aren't found.  I'm new to this so I don't even know where to start looking.  Any help would be appreciated.
Avatar of wchestnut
wchestnut
Flag of United States of America image

ASKER

Here's something else I noticed when trying to troubleshoot this.  In the two SELECT statements below, the first one does not work but the second one does:
SELECT	* 
FROM	RESUMES.dbo.ftbl_resume_files 
WHERE	CONTAINS(name, 'Keay')

SELECT	* 
FROM	RESUMES.dbo.ftbl_resume_files 
WHERE	LEFT(name,4) = 'Keay'

Open in new window

Avatar of lcohan
BTW....."I have a FileTable that consists mostly of resumes in Word DOC/DOCX format."
DOCX is not a supported default document type in SQL 2012 - try the select below to see supported attachment types and also:
http://social.msdn.microsoft.com/Forums/en-US/2346bf93-d7e9-4720-aac2-de71d866cb3e/sql-server-2012-filetables-document-types-docx?forum=sqldatabaseengine


select * from sys.fulltext_document_types
-- as you can see DOCX is not listed as defaults therefor most likely those emails are those type of documents only.



http://technet.microsoft.com/en-us/library/ms143272.aspx

"
Behavior Changes in Full-Text Search in SQL Server 2012
SQL Server 2012 installs a new version of the word breakers and stemmers for US English (LCID 1033) and UK English (LCID 2057). However you can switch to the previous version of these components if you want to retain the previous behavior. For more information, see Change the Word Breaker Used for US English and UK English.
"

http://technet.microsoft.com/en-us/library/ms187787.aspx

"
CONTAINS can search for:
•A word or phrase.

•The prefix of a word or phrase.

•A word near another word.

•A word inflectionally generated from another (for example, the word drive is the inflectional stem of drives, drove, driving, and driven).

•A word that is a synonym of another word using a thesaurus (for example, the word "metal" can have synonyms such as "aluminum" and "steel").

"

Therefor try for instance:

SELECT      *
FROM      RESUMES.dbo.ftbl_resume_files
WHERE      CONTAINS(file_stream, 'john and smith and yahoo')
That statement didn't work, either.  I see where you were going with this... but supposedly the new components in 2012 are supposed to include more not less results.  An e-mail address should be considered a "phrase" according to that article and they list "cat@dog.com" should be found under the new word breakers and stemmers.  And it doesn't explain why some whole e-mail addresses are found while others are not.
An e-mail address should be considered a "phrase" according to that article and they list "cat@dog.com" should be found under the new word breakers and stemmers.
So that means you have attempted to search for email addresses as phrases?  As in:
SELECT      *
FROM      RESUMES.dbo.ftbl_resume_files
WHERE      CONTAINS(file_stream, '"john.smith@yahoo.com"')
Yes.  That works for most other e-mail addresses, but not for others.  It's inconsistent as if the index doesn't include them.  The same goes for "this and that" name searches as well.  Some names are found but not others yet I can see the names un the "name" column.

I'm thinking there's a problem with the index and/or catalog as if it didn't complete the rebuild but don't know how to troubleshoot them.  I didn't see anything unusual in the SQL Logs, either.
Maybe the ones not found are located in DOCX documents as per your original question comments and I believe these are not supported by default SQL 2012.

http://social.msdn.microsoft.com/Forums/en-US/2346bf93-d7e9-4720-aac2-de71d866cb3e/sql-server-2012-filetables-document-types-docx?forum=sqldatabaseengine


Try run query below against your FT database:

select * from sys.fulltext_document_types

and see if DOCX is listed in the result set
That sounded like another great idea.  Unfortunately, some of the doc types not found are DOC's and some are DOCX's.  Your query resulted in this:User generated imageMy existing queries are pulling up some Names and E-mail Addresses from DOCX files.  

Here's another tidbit.  Each night, we delete all of the documents in the FILESTREAM Share and copy a fresh set of files from our main File Server.  I used to use the DEL command, but that blew up my LOG file.  So, I created a query that does it better:
TRUNCATE TABLE RESUMES.dbo.ftbl_resume_files
CHECKPOINT
DBCC SHRINKFILE('RESUMES_log', 1000)
ALTER FULLTEXT CATALOG [ResumeFullTextCatalog] REBUILD

Open in new window

Then I have a separate job that copies the files over.  Do you think this could be part of the problem?
I haven't received a reply since Monday.  Is there anyone that can help??  Here's a real example again.  These queries work:
SELECT	* 
FROM	RESUMES.dbo.ftbl_resume_files 
WHERE	CONTAINS(file_stream, 'wchestnut@gmail.com')

SELECT	* 
FROM	RESUMES.dbo.ftbl_resume_files 
WHERE	LEFT(name,4) = 'Keay'

Open in new window

But neither of these work:
SELECT	* 
FROM	RESUMES.dbo.ftbl_resume_files 
WHERE	CONTAINS(name, 'Keay')

SELECT	* 
FROM	RESUMES.dbo.ftbl_resume_files 
WHERE	CONTAINS(file_stream, 'john_keay@madeup.com')

Open in new window

I verified the document does contain the (real) e-mail address.
Here's another update:

I opened one of the original Word DOC files that can't be found with CONTAINS but could with LEFT(name,4), then I saved it as a TXT (for kicks) and a DOCX file.  Then I went back and ran the CONTAINS query -- and both showed up, yet the original DOC does not and they're all in the same folder.

Then I opened the original Word DOC again, made a change, then Saved it (as DOC).  I was hoping an Index/Catalog would be updated, but it did not after running the CONTAINS query again.

So, for some reason, this DOC is not being indexed/cataloged but it is showing up in the FileTable.
ASKER CERTIFIED SOLUTION
Avatar of wchestnut
wchestnut
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial