Link to home
Start Free TrialLog in
Avatar of Steve Wales
Steve WalesFlag for United States of America

asked on

Alert Log shows Thread 1 cannot allocate new log

Environment:

Linux x64 Virtual Machine
Oracle 11.2.0.4

I have a production database that keeps spitting out this message:

Thread 1 cannot allocate new log, sequence xxx.

I know what causes the message to trigger - a redo log has filled and the database goes to use the next one in the cycle but it's currently marked as active, so it has to make sure that all protected blocks are written to disk so it can reuse the online log space.

I have been researching this error message all over the place.  More often than not, it appears to be reported by people who are having very fast log switches and DBWn can't keep up with the furious pace of the log switches.

The general consensus seems to be "add more log files or increase the size of redo logs until the message goes away"  (Tom Kyte has said this in several replies on asktom).

However, that's not my issue.  I started with 3 log group, 2 members per group at 50 MB.

I run archive log backups every hour.  RMAN does a log switch automatically when an archive log backup occurs.  Average redo generation appears to be between 30 - 40 MB / hr - so my logs rarely fill and do an automatic switch.

Following the advice (although it didn't make much sense to me) I added a 4th, then a 5th and eventually a 6th redo log group.  Still the message appears.

Checking v$log, it is not unusual to see the current log marked as CURRENT and all the other 5 marked as ACTIVE.  If I understand this correctly, a redo log is marked as current as long as some data block it is protecting is still in memory and hasn't been written to disk yet.

If my understanding is correct, this can mean (6 log groups, swapping once per hour) that I can have dirty blocks in memory for 5-6 hours that haven't been written to disk.

If I manually issue ALTER SYSTEM CHECKPOINT and check v$log again, all 5 CURRENT logs become INACTIVE.

I started mucking around trying to get checkpoints to occur more frequently automatically.  Since this database is Standard Edition I can't use FAST_START_MTTR_TARGET (which seems to be the way to do it).  I've been playing with LOG_CHECKPOINT_INTERVAL but can't seem to get Oracle to checkpoint more frequently.  I started it with a value of 2500 (OS block size is 4K - 2500 blocks is 10MB - thought that should trigger the switch).  I've lowered it all the way down to 1000 at the moment and still am not seeing any change.

The wait every hour on the log switch is only 2-3 seconds which I guess isn't the end of the world (I'd rather it went away though), but I'm concerned about unsaved dirty blocks.

I am seeing in the alert log that it will say that it's starting a checkpoint for a particular marker .... but then it won't come up as complete for anywhere from 20 to 40 minutes for that particular checkpoint.  It's almost like it starts but then doesn't have the priority to complete it.  However if I manually execute a checkpoint, it tidies everything up in seconds.

The VM administrator is seeing nothing server side that concerns him.  No excesssive CPU or disk disk activity other than when expected (during backups etc).

Other possible solutions I have seen for this also indicate enabling async io or increasing DBWR's - if you have the former, the latter is best left alone.  I do not have that enable at the moment but I wonder if it is even relevant since I do not have frequent log switches going on.

So, after all of the above I guess my question is twofold.

* Do I understand what CURRENT in the v$log status means (do I have dirty blocks in the buffer cache sitting there for hours) ?
* Is there a way to make Oracle checkpoint more frequently and set my redo logs back to INACTIVE more frequently.

Thanks in advance for any insight into what I've got going on here.
Avatar of David VanZandt
David VanZandt
Flag of United States of America image

Excellently stated question :)  Just for a laugh, please eyeball the free storage left on the device that holds the archive log(s).  Weirder things have happened.....
Avatar of Steve Wales

ASKER

Forgot to mention in my question:

Archive destination has plenty of free space (I monitor that closely)
Tablespaces and internal storage is also fine.
SOLUTION
Avatar of johnsone
johnsone
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
That is a great report.  I will save that.

Usually 1 swap per hour - which is what my perusing of the alert log is telling me.  Definitely not switching too fast.  (Except the 1 AM hour, where there's some process running that pumps out about 15 of them - but would that have impact on the rest of the day?).

My archive destination is a 20GB file system, currently using 13.5 GB, 6.5 GB free.

I'm generating at most 70 logs a day @ 50 MB each which is 3.5 GB.  The file system space is monitored, it never breaks the 90% threshold.  So it's not that.

My main concern here comes back to the point of:  Do I have dirty buffers sitting in the buffer cache hours and hours after they have been committed ?

This is a concern for me.

If I issue ALTER SYSTEM CHECKPOINT, I know all the logs immediately get set to INACTIVE (except for the CURRENT one of course) and I'll set up that in the job scheduler if I have to but I really shouldn't have to.

Oddly enough, last time I forced a checkpoint (so I had 1 CURRENT and 5 inactive) I still got the 'cannot allocate log' message.

I am completely confused on this one.
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
As a matter of course, I review my alert logs daily - I have a script I've developed over the years - where I look for certain things and investigate them as they come up.  Once I've discovered that certain messages don't mean anything, I exclude them (like the one about private strand flush not complete or whatever it is - the developers on MOS say ignore it).

I researched this message (I've seen it pop up from time to time but not with the frequency that occurs in this one DB - this one is only 3-4 months old).

>>Are you equating this as a bad thing?

The information I was able to find on this message indicated a possible performance issue - when the log switch occurs and and it can't write to a log, the DB has to kick DBWR in the pants and clear the next log that needs writing to.

For the database in question, there have been no complaints about any performance issues, in fact the users are deliriously happy since this new machine is performing much better than the old SQL Server the app came from.  However, since the notes I had read indicated an issue, I was trying to make it go away.

>>I understand the fear of possible 'dirty' blocks but as long as they are on disk, does it really matter?

That depends, I suppose.  I understand that if they are in the online redo logs and a crash occurs, then the blocks that need amending can be read from the online redo logs instead of the data files.  

Maybe I'm just being overly cautious, but having them in the data file where they belong gives me the warm and fuzzies.  We've both seen questions here before talking about backups and data security and insurance etc and I like warm and fuzzies.

>>What is the 'issue' you are trying to 'fix' here?

Just trying to make a message that I possibly thought was indicative of a problem go away.

>>I believe current is just, well, the current one.

I have spent quite a bit of time the last few days researching and trying things.

I mis-stated in my question - I meant ACTIVE, not CURRENT (shame on you for not being able to read my mind and understand what I meant instead of what I wrote :) ).  

Current is of course the current one.  Active appeared to mean that the online redo logs were still needed because they were covering dirty blocks in the buffer cache - I was seeking confirmation.  The docs (here) say this:

ACTIVE - Log is active but is not the current log. It is needed for crash recovery. It may be in use for block recovery. It may or may not be archived.
 
That doesn't really address the question - says it's needed for recovery but doesn't say why.  If it was a recently written one I could understand it, but an older one ?

>>Do you think ACTIVE is 'bad' and INACIVE 'good'?

Not necessarily no.  The fact that it's needed to crash recovery was indicative to me that there may still be dirty blocks that needed attention - especially when the log marked ACTIVE that is to be the next current upon the next log switch contains redo that is 5 to 6 hours old.  THAT is what concerned me.

>>I really think you might be chasing a non-issue here.

You may be right.  I was about "this far" (holds finger and thumb really close together) from adding this message to my ignore list in my monitoring script - but I have multiple databases I watch over, from 8i (only one of these dinosaurs left and it's almost dead) through 11gR2 - on HPUX, Windows and Linux - and all Enterprise Edition - and I never saw this issue on any of them before (except maybe during backups it would pop up briefly).

The only thing that I can see as different here (possibly) is that I'm working with Standard Edition (maybe SE One, can't remember) for this one and wondered if there was something different.

It may be that this is a storm in a teacup.  I'll leave this question open another day or two if case anyone else wants to throw in their two cents on the issue - but your example on the XE one does make one think.  

I have an XE instance on my laptop.  I tried out your example and saw exactly the same thing.

Thanks to all participants in this discussion so far.
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
That much is true - Oracle is VERY robust - I've had servers die, disks fail and goodness knows what else happen in the last dozen years and apart from having a sysadmin actually physically do a "rm *" on a database directory once, it's always just come back up.

(Of course there was also the time when a sysadmin removed this file from "/" called "hpux" back in the HPUX 9.x days because it was taking up all his root volume space, but that's another story completely :) )

Alrighty, will see if anyone else has anything to add on this one, if not I'll close it and split up some points early next week - you've pretty much convinced me.
Avatar of slightwv (䄆 Netminder)
slightwv (䄆 Netminder)

>>apart from having a sysadmin actually physically do a "rm *" on a database directory once

Had a developer looking for space 'rm' a database file. Had management convinced a server reboot crashed the database.  Review of shell histories showed the rm command of an unused database file so the database remained up and usable until the reboot.


>> back in the HPUX 9.x days

I almost remember them!!!  I like HP-UX but mainly remember the 10 days...
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Thanks to all for your inputs.  You've convinced me that this message can be ignored.  Some good information came out of this discussion too (including a nice script!).