Link to home
Start Free TrialLog in
Avatar of sedostal
sedostal

asked on

What could a Database reboot be doing to fix a hanging query

If you have a query that is stuck in processing, what would a database reboot be doing that would fix the query after restarting Statistics don't appear to be involved here as they are updated and are not changed before, and after, the reboot. This has happned 3 times where a reboot of the database fixed the problem.

We are using MSSQL 2008 ( only version supported by the application) on WIndows server 2008

I have query that that can hang will , only at times, but not all the time  ( This is inside a cobol program)  When the database is rebooted, the query completes in 10 minutes. What would the reboot be doing to fix the problem with the query.  Here are the symptoms of  of the query duing the hanging process and what was done up to this point.

Symtoms: Numbered 1-6
When running this statement
1. SELECT session_id,status,row_count,reads,writes,logical_reads,cpu_time FROM sys.dm_exec_sessions for that SPID, (all numbers run to a certain point and then freeze.

2. When the TEMPDB is examined , the numbers freeze and CPU time is stopped
FROM sys.dm_db_file_space_usage;
SELECT
                    sys.dm_exec_sessions.session_id AS [SESSION ID],
                    DB_NAME(database_id) AS [DATABASE Name],
                    HOST_NAME AS [System Name],
                    program_name AS [Program Name],
                    login_name AS [USER Name],
                    status,
                    cpu_time AS [CPU TIME (in milisec)],
                    total_scheduled_time AS [Total Scheduled TIME (in milisec)],
                    total_elapsed_time AS    [Elapsed TIME (in milisec)],
                    (memory_usage * 8)      AS [Memory USAGE (in KB)],
                    (user_objects_alloc_page_count * 8) AS [SPACE Allocated FOR USER Objects (in KB)],
                    (user_objects_dealloc_page_count * 8) AS [SPACE Deallocated FOR USER Objects (in KB)],
                    (internal_objects_alloc_page_count * 8) AS [SPACE Allocated FOR Internal Objects (in KB)],
                    (internal_objects_dealloc_page_count * 8) AS [SPACE Deallocated FOR Internal Objects (in KB)],
                    CASE is_user_process
                                         WHEN 1      THEN 'user session'
                                         WHEN 0      THEN 'system session'
                    END         AS [SESSION Type], row_count AS [ROW COUNT]
FROM sys.dm_db_session_space_usage INNER join sys.dm_exec_sessions
     ON sys.dm_db_session_space_usage.session_id = sys.dm_exec_sessions.session_id
3. dbcc opentran shows the LSN has stopped moving for the SPID and but continues to show the transaction running
4. The query does show locks on the tables (sp_lock2) but the locks are exlusive to that SPID
5. When the database is refreshed to another instance, the process runs fine.
6. Several times, when this happened, the query was trapped in the DBCC inputbuffer( spid ) and analyzed.  statistics were updated after the hang, but it was only the  reboot still fixed the problem.

Other things tried at this point.
Statistics were updated on all tables that are involved in the query trapped by the DBCC Input buffer.
Before the reboot, the DBCC FREEPROCCACHE was executed and the attempt was made again, still the query hangs in a "running" state.
Avatar of David Favor
David Favor
Flag of United States of America image

When you reboot your server, the query is aborted.

A better way to think about this is all queries are stopped when a reboot occurs.

You can accomplish the same thing by bouncing your database server (bounce == stop/restart), which will also abort all current queries.

Hopefully your SQL version supports the EXPLAIN statement, so just insert EXPLAIN in front of your entire query, on the command line using whatever command line process your product provides.

More than likely this will turn up some sort of lock contention or searching a table, returning every row, which means adding an index may speed up your query, so it finishes.
Avatar of sedostal
sedostal

ASKER

Actually, the query is aborted(stopped thru the application ( Cobol that's running it) .  The stopping and starting of the query has been tried with each time the hanging symptom re-surfacing. Its only the database reboot that fixes the problem.  Is it possible that one of two things are happening:
There is one table with 1.2 million rows, when the database it rebooted, there is room in the memory for that table but later, with other queries, its removed and has to be retrieved from disk.  This table is also used several times in the same query. is it possible, that with the delay, the query is locking within itself causing these symptoms?
Run EXPLAIN + likely the answer will come up.

For example, if your query is reading 1.2 million rows without using an index, likely adding an index will help.

EXPLAIN is your friend.
I follow your concept of the index.  :)
but I would think that the problem would resurface after a reboot. It actual runs fine after a reboot.  It goes from never finishing to finishing in 3 minutes.  The system as a whole, is not stressed, Cache hit ratio is fine, CPU's are on idle and this job is executed on off hours and even than, will only complete after a reboot.

I was just curious if someone had thoughts on what the reboot was doing to make this work vs the normal performance/indexing/stats issue.
We were able to re-create the hanging problem in a test environment. First, it ran fine, then we selected a series of very large tables and ran processes, re-ran the original process and it hung.

After experimenting with removing rows from the biggest tables used in the query and not having any luck and having the process continue to hang, the execution of the
DBCC FREESYSTEMCACHE('ALL');then runs to success.  
This is Microfocus cobol program running on
Microsoft SQL Server 2008 (SP2) - 10.0.4000.0 (X64)   Sep 16 2010 19:43:16   Copyright (c) 1988-2008 Microsoft Corporation  Enterprise Edition (64-bit) on Windows NT 6.0 <X64> (Build 6002: Service Pack 2)
Any idea what still could be causing this?
Post the output from EXPLAIN + likely someone can assist you.
How would everyone like the output, xml format ... screen shots of the explain plan graphics?
Output EXPLAIN as text + highlight + click the CODE menu item, before you hit submit.
I have attached an explain plan. I'm running 2008 that probably doesn't have the code snip it feature.  

What we are seeing here in the explain plan are work tables that populated at the time of the run, this may not be a good reflection of what is happening.  I might state here that I'm still getting 2 minute run completion run when the database is rebooted and prior to the reboot, it doesn't finish even after 4+ hours of running indicating to me that maybe a poor part of the query is consuming memory.

This Peoplesoft cobol program has complaints everywhere regarding this problem and more proposed "fixes" then I think I've ever seen.
ExplainPlan0719.txt
ASKER CERTIFIED SOLUTION
Avatar of David Favor
David Favor
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
I agree with ya. This is very odd.   Out of about 5,000 programs and 14,000 data tables and this is the only one that I'm having issues with, its really looking like something off the wall.   I have some ideas and will continue to work the problem
Good luck!
Not sure what the reboot is doing to fix the problem yet, but at least I'm not missing something obvious.
Thank you David.