TempDB log grows without bounds

We're experiencing periods where our tempdb log grows until it fills up the disk.  It's normal operating size is about 2 GB, but every once in a while it will grow past 100 GB.

Here are the facts:

- Tempdb is by default set to simple recovery and is supposed to truncate when the log exceeds 70% full.
- In our case it will blow right thru that 70% and proceed to begin growing the log
- The problem is intermittent.  We have some success in reproducing the problem by running  fullscan stats update along with a less than ideal proc that uses a overly simplistic split function that creates a table variable for the output.
- The system is an VMWare 8 core virtual machine.  There is currently only one tempdb file.  However, we've resisted adding more cores as tempdb contention is not evident (used the query here: http://www.sqlsoldier.com/wp/sqlserver/breakingdowntempdbcontentionpart2)
- Manually checkpointing tempdb does not seem to help the issue

To be clear the proc is a monstrosity - it splits a CSV varchar and returns a table variable.  Outside of not using one of the recommended split implementations, it is called in a loop which ends up peppering tempdb with new table variables.  However,  sometimes the problem will occur and other times we can run three of these procs at a time and the tempdb log will truncate and recycle as it should, never growing out of it's 2GB.

The only reason I've seen where the tempdb log won't auto-truncate is when the IO is under such load that checkpoint won't run.  But so far, no io metrics show such a load - disk queue stays low, IO utilization is only about 20%.

We're resolving the issue with this proc, but I'd like evidence of a reason on why the log isn't truncating.  Finding a particular wait, identifying that the system at least tried to checkpoint and/or seeing that it's delayed, etc.  We've spent a lot of time finding this proc but it would be nice to finish the lesson as what we should be looking for and why.
LVL 11
b_levittAsked:
Who is Participating?
 
b_levittAuthor Commented:
@Anthony Perkins

I understand it can only use simple recovery.  That's why I said "by default",  I was trying to head off any "make sure it's in simple recovery" comments.

I did not configure it to truncate.  Databases in simple recovery mode autotruncate when they are 70% full.  Tempdb is no exception.

I do have additional data though and for now it may be enough to satisfy me:  sys.database showed the log_reuse_wait reason was ACTIVE_TRANSACTION.  While it's not clear which transaction was doing it, at least it's good to know that it was a transaction.



E
0
 
Anthony PerkinsCommented:
Tempdb is by default set to simple recovery and is supposed to truncate when the log exceeds 70% full.
That is not the way it works.  Tempdb can only use Simple Recovery Model and you cannot configure it to TRUNCATE with any percentage.  That will happen when it is done with the specific process.
0
 
Goodangel MatopeSoftware ArchitectCommented:
0
Free Tool: Subnet Calculator

The subnet calculator helps you design networks by taking an IP address and network mask and returning information such as network, broadcast address, and host range.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

 
Scott PletcherSenior DBACommented:
First, replace the bad splitter function.  Delimited8KSplit is readily available; I can give you the code if needed.

Next, FULLSCAN stats updates for very large tables should be run one-at-a-time if possible, rather than multiple at once.

Also, look for snapshot isolation being set on for a large db, particularly during index rebuilds and other maintenance, as that can cause lots of tempdb use.  Replication could also do that, although it's usually less likely, since you'd probably be aware of that.

Finally remember that any large sort will use tempdb, so very large sorted queries, reports, etc., could also use a lot of tempdb.
0
 
Anthony PerkinsCommented:
I understand it can only use simple recovery.  That's why I said "by default",
I guess my definition of "by default" and yours are different, but at least we agree there is no other option.

Databases in simple recovery mode autotruncate when they are 70% full.  
Are you referring to this:
Also, under the simple recovery model, an automatic checkpoint is also queued if the log becomes 70 percent full.
The part you may be overlooking is in bold and the part you may be overlooking is:
Under the simple recovery model, unless some factor is delaying log truncation, an automatic checkpoint truncates the unused section of the transaction log.

While it's not clear which transaction was doing it, at least it's good to know that it was a transaction.
Have you checked DBCC OPENTRAN() ?
0
 
b_levittAuthor Commented:
I guess I'm not sure what you're telling me.  If you're trying to make sure everything is semantically correct, ok. But If you're trying to say that truncation is not guaranteed, i think i recognized that.

The only reason I've seen where the tempdb log won't auto-truncate is when the IO is under such load that checkpoint won't run.  But so far, no io metrics show such a load - disk queue stays low, IO utilization is only about 20%.

As for your other question:
Have you checked DBCC OPENTRAN() ?

Yes but unfortunately log_reuse_wait only shows the reason the trunc failed the LAST time and we never saw one very long transaction.  If we catch it again i'd try this guy's query with sys.dm_tran_* views:
http://www.sqlskills.com/blogs/paul/script-open-transactions-with-text-and-plans/

The fact that it was intermittent also made it difficult.  What we believe is the same senario would result in this condition sometimes but not others.  To be honest it felt like once it missed the automatic truncation at 70% it didn't try again until the system was nearly idle.  In most cases DBCC SQLPERF(logspace) would show the log truncation right at 70%, in the problem case, it would pass it and never come down.
0
 
Anthony PerkinsCommented:
The fact that it was intermittent also made it difficult.
What you need is something like sp_whoisactive and log the results to a table this way you can look back and find the exact query that was the root cause.
0
 
b_levittAuthor Commented:
found the log_reuse_wait indicator
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.