?
Solved

Full Text Indexing problem with PDF Ifilter

Posted on 2005-03-11
2
Medium Priority
?
599 Views
Last Modified: 2012-06-21
I am having a problem with indexing of large pdf files. I have 2 large pdfs : both around 23meg and both around 2000 pages. When the gatherer tries to index them it fails and retries. It appears to be a 30sec time out failure as CPU usage drops after 30sec and then ramps up again.

It retries repeatedly without moving on to other documents - effectively getting stuck. It does not log an error in the Windows Event log or the sql log. It logs the following in the gatherer log:


09/03/2005 14:38:24 Add The gatherer has started
09/03/2005 14:38:26 Add The initialization has completed
09/03/2005 16:10:36 Add The gatherer has started
09/03/2005 16:10:40 Add The recovery has completed
09/03/2005 16:45:06 MSSQL75://SQLServer/76cba758/F87750AC4AACBF4BA9F2816993FBE5EA Add Error fetching URL, (80041201 - The object was not found. )


However this is only logged after the pdf files are deleted from the document library.  Nothing is logged before.


Other documents in the database get indexed propoperly (if they were indexed before these pdfs) and the full-text catalogs are searchable. If the 2 large pdfs are removed then the indexing completes successfully. Other pdfs in the database are indexable and searchable.

I am using SQL 2000, SP3, Adobe IFilter 6.0, Windows 2003. The database is a Windows Sharepoint Services content database.

Any ideas?
0
Comment
Question by:noelkennedy
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
2 Comments
 

Author Comment

by:noelkennedy
ID: 13555438
I have fixed this myself.  This fix is SharePoint specific so a lot of what follows will be irrelevant to general full-text users (but there is a lot of general stuff so if you don’t care about SharePoint skip the 2nd paragraph).  That said I don’t even know if my problem exists outside of SharePoint.

On further investigation I discovered the problem does and doesn’t exist in SharePoint Portal (SPS)!  Essentially I discovered that when a small web farm is created (1 backend SQL and 1 front end web server which is the search and index server) the front end index server will fail to index the document.  It says it has indexed it partially (but I couldn’t get it to show up in any searches).  More importantly the indexing process actually finished with errors.  This is significant because in SharePoint Services (WSS) , the indexing process never finishes – it repeatedly tries again.  The reason why I say it doesn’t work as well is because I discovered full text indexing in is turned on in the SPS site database – and this fails in the same way as WSS.  Essentially what is happening is that the document is being indexed in 2 places – the front end index/search server AND the SQL backend database!  If you open WSS central admin in the farm and turn off searching at the WSS level, the full text catalogues are deleted in SQL for the SPS Site database.  You can still search WSS sites from Portal but not from within WSS.  This means that documents that are stored in the Portal areas are indexed through full-text indexing in the backend SQL database as well as in the index catalogues on the front end web servers.

It is going to be difficult to prevent this problem from occurring or automatically detecting when it has occurred.  To prevent it from happening possible ways are to limit the size of files that users can upload or don’t index pdfs at all.  To spot when it is happening ‘in the wild’ you can monitor CPU usage of the msdmn.exe process.  This is the process that performs filtering through the IFilters.  If this is ramped up all the time or repeatedly ramping up and down then it’s likely you have hit this problem.  Another way is to check the full-text catalogs status in Enterprise Manager or Query Analyzer.  If it is ‘notifications processing’ or ‘change tracking’ for a significant length of time then it is likely you have hit this problem.  Another way to check is to look in the temp directory used by the indexer – usually:

C:\Program Files\Microsoft SQL Server\MSSQL\FTDATA

If this directory has large PDF files with recent creation dates (last couple of minutes) then you are likely to be experiencing the problem.  Another way of checking is to use PerfMon:
1.      Select the Performance Object – Microsoft Gather Projects.
2.      Select the Retries counter
3.      Select all the instances (if you have more than one) – theses instances can be matched back to SQL databases – the number at the end ie SQLServ~1c SQL00009~1c can be matched to database_ID 00009 by using Query Analyzer (SELECT DB_ID() tells you the id for the database).
4.      These counters should probably be at 0.  If they are incrementing at the rate of 1 or 2 per minute – you are probably experiencing the problem.

Obviously none of these are satisfactory!

Resolution to large PDF problem:
1.      Open WSS Central Administration
2.      Under Component Configuration click Configure Data Retrieval service Settings
3.      Under Data Source Time Out set the Request time-out to a number larger than 30 (ie 120)
4.      Sometimes this fixes the problem straight away.  Sometimes you have to rebuild the catalog by going into WSS central admin, clicking configure full text search, then clicking OK.

I was unable to locate the registry setting that is changed (it might not be a registry setting therefore as I used software to compare the registry before and after the change on both the front-end and back-end server) so general Full-text users are on their own from now (but as I said earlier I don’t even know if my problem exists outside of SharePoint)

0
 

Accepted Solution

by:
ee_ai_construct earned 0 total points
ID: 13582602
Question answered by asker or dialog deemed valuable.
Closed, 500 points refunded.
ee_ai_construct (replacement part #xm34)
Community Support Admin
0

Featured Post

New feature and membership benefit!

New feature! Upgrade and increase expert visibility of your issues with Priority Questions.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Slowly Changing Dimension Transformation component in data task flow is very useful for us to manage and control how data changes in SSIS.
It is possible to export the data of a SQL Table in SSMS and generate INSERT statements. It's neatly tucked away in the generate scripts option of a database.
Using examples as well as descriptions, and references to Books Online, show the documentation available for date manipulation functions and by using a select few of these functions, show how date based data can be manipulated with these functions.
Via a live example combined with referencing Books Online, show some of the information that can be extracted from the Catalog Views in SQL Server.
Suggested Courses

800 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question