wchestnut
asked on
SQL 2012 FileTable Full Text Search Issue
I have a FileTable that consists mostly of resumes in Word DOC/DOCX format. For some reason, I have certain people who's e-mail address or name is not being found even though their resume exists in the FileStream Share. I even opened their resume, copied the e-mail address directly from it and pasted it into the query (not real e-mail address):
I've tried initiating Rebuilds on all Indexes and the Catalog, but when they're finished, the same names and e-mail addresses aren't found. I'm new to this so I don't even know where to start looking. Any help would be appreciated.
SELECT	*
FROM	RESUMES.dbo.ftbl_resume_files
WHERE	CONTAINS(file_stream, 'john.smith@yahoo.com')
But, I can find my own resume with my e-mail address and many others.I've tried initiating Rebuilds on all Indexes and the Catalog, but when they're finished, the same names and e-mail addresses aren't found. I'm new to this so I don't even know where to start looking. Any help would be appreciated.
BTW....."I have a FileTable that consists mostly of resumes in Word DOC/DOCX format."
DOCX is not a supported default document type in SQL 2012 - try the select below to see supported attachment types and also:
http://social.msdn.microsoft.com/Forums/en-US/2346bf93-d7e9-4720-aac2-de71d866cb3e/sql-server-2012-filetables-document-types-docx?forum=sqldatabaseengine
select * from sys.fulltext_document_type s
-- as you can see DOCX is not listed as defaults therefor most likely those emails are those type of documents only.
http://technet.microsoft.com/en-us/library/ms143272.aspx
"
Behavior Changes in Full-Text Search in SQL Server 2012
SQL Server 2012 installs a new version of the word breakers and stemmers for US English (LCID 1033) and UK English (LCID 2057). However you can switch to the previous version of these components if you want to retain the previous behavior. For more information, see Change the Word Breaker Used for US English and UK English.
"
http://technet.microsoft.com/en-us/library/ms187787.aspx
"
CONTAINS can search for:
•A word or phrase.
•The prefix of a word or phrase.
•A word near another word.
•A word inflectionally generated from another (for example, the word drive is the inflectional stem of drives, drove, driving, and driven).
•A word that is a synonym of another word using a thesaurus (for example, the word "metal" can have synonyms such as "aluminum" and "steel").
"
Therefor try for instance:
SELECT *
FROM RESUMES.dbo.ftbl_resume_fi les
WHERE CONTAINS(file_stream, 'john and smith and yahoo')
DOCX is not a supported default document type in SQL 2012 - try the select below to see supported attachment types and also:
http://social.msdn.microsoft.com/Forums/en-US/2346bf93-d7e9-4720-aac2-de71d866cb3e/sql-server-2012-filetables-document-types-docx?forum=sqldatabaseengine
select * from sys.fulltext_document_type
-- as you can see DOCX is not listed as defaults therefor most likely those emails are those type of documents only.
http://technet.microsoft.com/en-us/library/ms143272.aspx
"
Behavior Changes in Full-Text Search in SQL Server 2012
SQL Server 2012 installs a new version of the word breakers and stemmers for US English (LCID 1033) and UK English (LCID 2057). However you can switch to the previous version of these components if you want to retain the previous behavior. For more information, see Change the Word Breaker Used for US English and UK English.
"
http://technet.microsoft.com/en-us/library/ms187787.aspx
"
CONTAINS can search for:
•A word or phrase.
•The prefix of a word or phrase.
•A word near another word.
•A word inflectionally generated from another (for example, the word drive is the inflectional stem of drives, drove, driving, and driven).
•A word that is a synonym of another word using a thesaurus (for example, the word "metal" can have synonyms such as "aluminum" and "steel").
"
Therefor try for instance:
SELECT *
FROM RESUMES.dbo.ftbl_resume_fi
WHERE CONTAINS(file_stream, 'john and smith and yahoo')
ASKER
That statement didn't work, either. I see where you were going with this... but supposedly the new components in 2012 are supposed to include more not less results. An e-mail address should be considered a "phrase" according to that article and they list "cat@dog.com" should be found under the new word breakers and stemmers. And it doesn't explain why some whole e-mail addresses are found while others are not.
An e-mail address should be considered a "phrase" according to that article and they list "cat@dog.com" should be found under the new word breakers and stemmers.
So that means you have attempted to search for email addresses as phrases? As in:
SELECT *
FROM RESUMES.dbo.ftbl_resume_fi les
WHERE CONTAINS(file_stream, '"john.smith@yahoo.com"')
So that means you have attempted to search for email addresses as phrases? As in:
SELECT *
FROM RESUMES.dbo.ftbl_resume_fi
WHERE CONTAINS(file_stream, '"john.smith@yahoo.com"')
ASKER
Yes. That works for most other e-mail addresses, but not for others. It's inconsistent as if the index doesn't include them. The same goes for "this and that" name searches as well. Some names are found but not others yet I can see the names un the "name" column.
I'm thinking there's a problem with the index and/or catalog as if it didn't complete the rebuild but don't know how to troubleshoot them. I didn't see anything unusual in the SQL Logs, either.
I'm thinking there's a problem with the index and/or catalog as if it didn't complete the rebuild but don't know how to troubleshoot them. I didn't see anything unusual in the SQL Logs, either.
Maybe the ones not found are located in DOCX documents as per your original question comments and I believe these are not supported by default SQL 2012.
http://social.msdn.microsoft.com/Forums/en-US/2346bf93-d7e9-4720-aac2-de71d866cb3e/sql-server-2012-filetables-document-types-docx?forum=sqldatabaseengine
Try run query below against your FT database:
select * from sys.fulltext_document_type s
and see if DOCX is listed in the result set
http://social.msdn.microsoft.com/Forums/en-US/2346bf93-d7e9-4720-aac2-de71d866cb3e/sql-server-2012-filetables-document-types-docx?forum=sqldatabaseengine
Try run query below against your FT database:
select * from sys.fulltext_document_type
and see if DOCX is listed in the result set
ASKER
That sounded like another great idea. Unfortunately, some of the doc types not found are DOC's and some are DOCX's. Your query resulted in this:My existing queries are pulling up some Names and E-mail Addresses from DOCX files.
Here's another tidbit. Each night, we delete all of the documents in the FILESTREAM Share and copy a fresh set of files from our main File Server. I used to use the DEL command, but that blew up my LOG file. So, I created a query that does it better:
Here's another tidbit. Each night, we delete all of the documents in the FILESTREAM Share and copy a fresh set of files from our main File Server. I used to use the DEL command, but that blew up my LOG file. So, I created a query that does it better:
TRUNCATE TABLE RESUMES.dbo.ftbl_resume_files
CHECKPOINT
DBCC SHRINKFILE('RESUMES_log', 1000)
ALTER FULLTEXT CATALOG [ResumeFullTextCatalog] REBUILD
Then I have a separate job that copies the files over. Do you think this could be part of the problem?
ASKER
I haven't received a reply since Monday. Is there anyone that can help?? Here's a real example again. These queries work:
SELECT *
FROM RESUMES.dbo.ftbl_resume_files
WHERE CONTAINS(file_stream, 'wchestnut@gmail.com')
SELECT *
FROM RESUMES.dbo.ftbl_resume_files
WHERE LEFT(name,4) = 'Keay'
But neither of these work:
SELECT *
FROM RESUMES.dbo.ftbl_resume_files
WHERE CONTAINS(name, 'Keay')
SELECT *
FROM RESUMES.dbo.ftbl_resume_files
WHERE CONTAINS(file_stream, 'john_keay@madeup.com')
I verified the document does contain the (real) e-mail address.
ASKER
Here's another update:
I opened one of the original Word DOC files that can't be found with CONTAINS but could with LEFT(name,4), then I saved it as a TXT (for kicks) and a DOCX file. Then I went back and ran the CONTAINS query -- and both showed up, yet the original DOC does not and they're all in the same folder.
Then I opened the original Word DOC again, made a change, then Saved it (as DOC). I was hoping an Index/Catalog would be updated, but it did not after running the CONTAINS query again.
So, for some reason, this DOC is not being indexed/cataloged but it is showing up in the FileTable.
I opened one of the original Word DOC files that can't be found with CONTAINS but could with LEFT(name,4), then I saved it as a TXT (for kicks) and a DOCX file. Then I went back and ran the CONTAINS query -- and both showed up, yet the original DOC does not and they're all in the same folder.
Then I opened the original Word DOC again, made a change, then Saved it (as DOC). I was hoping an Index/Catalog would be updated, but it did not after running the CONTAINS query again.
So, for some reason, this DOC is not being indexed/cataloged but it is showing up in the FileTable.
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Open in new window