Solved

Alert Log shows Thread 1 cannot allocate new log

Posted on 2014-01-31
11
6,451 Views
Last Modified: 2014-02-03
Environment:

Linux x64 Virtual Machine
Oracle 11.2.0.4

I have a production database that keeps spitting out this message:

Thread 1 cannot allocate new log, sequence xxx.

I know what causes the message to trigger - a redo log has filled and the database goes to use the next one in the cycle but it's currently marked as active, so it has to make sure that all protected blocks are written to disk so it can reuse the online log space.

I have been researching this error message all over the place.  More often than not, it appears to be reported by people who are having very fast log switches and DBWn can't keep up with the furious pace of the log switches.

The general consensus seems to be "add more log files or increase the size of redo logs until the message goes away"  (Tom Kyte has said this in several replies on asktom).

However, that's not my issue.  I started with 3 log group, 2 members per group at 50 MB.

I run archive log backups every hour.  RMAN does a log switch automatically when an archive log backup occurs.  Average redo generation appears to be between 30 - 40 MB / hr - so my logs rarely fill and do an automatic switch.

Following the advice (although it didn't make much sense to me) I added a 4th, then a 5th and eventually a 6th redo log group.  Still the message appears.

Checking v$log, it is not unusual to see the current log marked as CURRENT and all the other 5 marked as ACTIVE.  If I understand this correctly, a redo log is marked as current as long as some data block it is protecting is still in memory and hasn't been written to disk yet.

If my understanding is correct, this can mean (6 log groups, swapping once per hour) that I can have dirty blocks in memory for 5-6 hours that haven't been written to disk.

If I manually issue ALTER SYSTEM CHECKPOINT and check v$log again, all 5 CURRENT logs become INACTIVE.

I started mucking around trying to get checkpoints to occur more frequently automatically.  Since this database is Standard Edition I can't use FAST_START_MTTR_TARGET (which seems to be the way to do it).  I've been playing with LOG_CHECKPOINT_INTERVAL but can't seem to get Oracle to checkpoint more frequently.  I started it with a value of 2500 (OS block size is 4K - 2500 blocks is 10MB - thought that should trigger the switch).  I've lowered it all the way down to 1000 at the moment and still am not seeing any change.

The wait every hour on the log switch is only 2-3 seconds which I guess isn't the end of the world (I'd rather it went away though), but I'm concerned about unsaved dirty blocks.

I am seeing in the alert log that it will say that it's starting a checkpoint for a particular marker .... but then it won't come up as complete for anywhere from 20 to 40 minutes for that particular checkpoint.  It's almost like it starts but then doesn't have the priority to complete it.  However if I manually execute a checkpoint, it tidies everything up in seconds.

The VM administrator is seeing nothing server side that concerns him.  No excesssive CPU or disk disk activity other than when expected (during backups etc).

Other possible solutions I have seen for this also indicate enabling async io or increasing DBWR's - if you have the former, the latter is best left alone.  I do not have that enable at the moment but I wonder if it is even relevant since I do not have frequent log switches going on.

So, after all of the above I guess my question is twofold.

* Do I understand what CURRENT in the v$log status means (do I have dirty blocks in the buffer cache sitting there for hours) ?
* Is there a way to make Oracle checkpoint more frequently and set my redo logs back to INACTIVE more frequently.

Thanks in advance for any insight into what I've got going on here.
0
Comment
Question by:Steve Wales
  • 5
  • 3
  • 2
  • +1
11 Comments
 
LVL 23

Expert Comment

by:David
ID: 39825254
Excellently stated question :)  Just for a laugh, please eyeball the free storage left on the device that holds the archive log(s).  Weirder things have happened.....
0
 
LVL 22

Author Comment

by:Steve Wales
ID: 39825336
Forgot to mention in my question:

Archive destination has plenty of free space (I monitor that closely)
Tablespaces and internal storage is also fine.
0
 
LVL 34

Assisted Solution

by:johnsone
johnsone earned 250 total points
ID: 39825530
One thing that I would do is to run the below in SQL*Plus.  It should give you a good idea of how many switches are taking place.

alter session set nls_date_format='DD-MON-YYYY HH24:MI:SS';
set lines 132
set pause off
column 00 format 999
column 01 format 999
column 02 format 999
column 03 format 999
column 04 format 999
column 05 format 999
column 06 format 999
column 07 format 999
column 08 format 999
column 09 format 999
column 10 format 999
column 11 format 999
column 12 format 999
column 13 format 999
column 14 format 999
column 15 format 999
column 16 format 999
column 17 format 999
column 18 format 999
column 19 format 999
column 20 format 999
column 21 format 999
column 22 format 999
column 23 format 999
column Tot format 999
select to_char(first_time, 'MM/DD') day,
       sum(decode(substr(first_time, 13, 2), '00', 1, 0)) "00",
       sum(decode(substr(first_time, 13, 2), '01', 1, 0)) "01",
       sum(decode(substr(first_time, 13, 2), '02', 1, 0)) "02",
       sum(decode(substr(first_time, 13, 2), '03', 1, 0)) "03",
       sum(decode(substr(first_time, 13, 2), '04', 1, 0)) "04",
       sum(decode(substr(first_time, 13, 2), '05', 1, 0)) "05",
       sum(decode(substr(first_time, 13, 2), '06', 1, 0)) "06",
       sum(decode(substr(first_time, 13, 2), '07', 1, 0)) "07",
       sum(decode(substr(first_time, 13, 2), '08', 1, 0)) "08",
       sum(decode(substr(first_time, 13, 2), '09', 1, 0)) "09",
       sum(decode(substr(first_time, 13, 2), '10', 1, 0)) "10",
       sum(decode(substr(first_time, 13, 2), '11', 1, 0)) "11",
       sum(decode(substr(first_time, 13, 2), '12', 1, 0)) "12",
       sum(decode(substr(first_time, 13, 2), '13', 1, 0)) "13",
       sum(decode(substr(first_time, 13, 2), '14', 1, 0)) "14",
       sum(decode(substr(first_time, 13, 2), '15', 1, 0)) "15",
       sum(decode(substr(first_time, 13, 2), '16', 1, 0)) "16",
       sum(decode(substr(first_time, 13, 2), '17', 1, 0)) "17",
       sum(decode(substr(first_time, 13, 2), '18', 1, 0)) "18",
       sum(decode(substr(first_time, 13, 2), '19', 1, 0)) "19",
       sum(decode(substr(first_time, 13, 2), '20', 1, 0)) "20",
       sum(decode(substr(first_time, 13, 2), '21', 1, 0)) "21",
       sum(decode(substr(first_time, 13, 2), '22', 1, 0)) "22",
       sum(decode(substr(first_time, 13, 2), '23', 1, 0)) "23",
       count(1) "Tot"
from v$log_history
group by to_char(first_time, 'MM/DD')
;

Open in new window


I took that from somewhere a very long time ago and it always helps give you a good picture of when you are at your highest number of log switches.  It may not help a lot, but it may give you some pointers.

Typically there are 2 scenarios where I have seen this.  Either you are switching too fast or your log destination is full.  It could also be a combination of both.  You say you are backing up archive logs so are they also being purged at the same time?  If so, you may be self correcting the problem when the backups run.
0
 
LVL 22

Author Comment

by:Steve Wales
ID: 39825549
That is a great report.  I will save that.

Usually 1 swap per hour - which is what my perusing of the alert log is telling me.  Definitely not switching too fast.  (Except the 1 AM hour, where there's some process running that pumps out about 15 of them - but would that have impact on the rest of the day?).

My archive destination is a 20GB file system, currently using 13.5 GB, 6.5 GB free.

I'm generating at most 70 logs a day @ 50 MB each which is 3.5 GB.  The file system space is monitored, it never breaks the 90% threshold.  So it's not that.

My main concern here comes back to the point of:  Do I have dirty buffers sitting in the buffer cache hours and hours after they have been committed ?

This is a concern for me.

If I issue ALTER SYSTEM CHECKPOINT, I know all the logs immediately get set to INACTIVE (except for the CURRENT one of course) and I'll set up that in the job scheduler if I have to but I really shouldn't have to.

Oddly enough, last time I forced a checkpoint (so I had 1 CURRENT and 5 inactive) I still got the 'cannot allocate log' message.

I am completely confused on this one.
0
 
LVL 76

Assisted Solution

by:slightwv (䄆 Netminder)
slightwv (䄆 Netminder) earned 250 total points
ID: 39825599
It's been years since I've had to peel back the layers and troubleshoot low level type issues so I reserve the right to be mildly incorrect.

>>that keeps spitting out this message

Are you equating this as a bad thing?  I understand the fear of possible 'dirty' blocks but as long as they are on disk, does it really matter?

What is the 'issue' you are trying to 'fix' here?

>>Do I understand what CURRENT in the v$log

I believe current is just, well, the current one.

I hope you have looked in the docs for how the current status is defined.

>> Is there a way to make Oracle checkpoint

Disclaimer:  I'm sure this can have unforeseen consequences...

dbms_scheduler and stored procedure with an execute immediate?

>> I forced a checkpoint (so I had 1 CURRENT and 5 inactive) I still got the 'cannot allocate log' message.

I really think you might be chasing a non-issue here.

Do you think ACTIVE is 'bad' and INACIVE 'good'?

Personally I think "it is what it is" and you shouldn't worry about it.


Do you have a test system where only you are on it?  If not, get one...  XE is perfect as a play database.

I have one.  I have ZERO transactions outstanding.  I am the ONLY user connected.

I deleted my alert log so anything from the current point forward is logged.

I issued 6-10 logfile switch commands back to back.  I see "cannot allocate new log" all over the place in tghe alert log.

Do I have a problem with my database?  Man, I sure hope not...
0
PRTG Network Monitor: Intuitive Network Monitoring

Network Monitoring is essential to ensure that computer systems and network devices are running. Use PRTG to monitor LANs, servers, websites, applications and devices, bandwidth, virtual environments, remote systems, IoT, and many more. PRTG is easy to set up & use.

 
LVL 22

Author Comment

by:Steve Wales
ID: 39825747
As a matter of course, I review my alert logs daily - I have a script I've developed over the years - where I look for certain things and investigate them as they come up.  Once I've discovered that certain messages don't mean anything, I exclude them (like the one about private strand flush not complete or whatever it is - the developers on MOS say ignore it).

I researched this message (I've seen it pop up from time to time but not with the frequency that occurs in this one DB - this one is only 3-4 months old).

>>Are you equating this as a bad thing?

The information I was able to find on this message indicated a possible performance issue - when the log switch occurs and and it can't write to a log, the DB has to kick DBWR in the pants and clear the next log that needs writing to.

For the database in question, there have been no complaints about any performance issues, in fact the users are deliriously happy since this new machine is performing much better than the old SQL Server the app came from.  However, since the notes I had read indicated an issue, I was trying to make it go away.

>>I understand the fear of possible 'dirty' blocks but as long as they are on disk, does it really matter?

That depends, I suppose.  I understand that if they are in the online redo logs and a crash occurs, then the blocks that need amending can be read from the online redo logs instead of the data files.  

Maybe I'm just being overly cautious, but having them in the data file where they belong gives me the warm and fuzzies.  We've both seen questions here before talking about backups and data security and insurance etc and I like warm and fuzzies.

>>What is the 'issue' you are trying to 'fix' here?

Just trying to make a message that I possibly thought was indicative of a problem go away.

>>I believe current is just, well, the current one.

I have spent quite a bit of time the last few days researching and trying things.

I mis-stated in my question - I meant ACTIVE, not CURRENT (shame on you for not being able to read my mind and understand what I meant instead of what I wrote :) ).  

Current is of course the current one.  Active appeared to mean that the online redo logs were still needed because they were covering dirty blocks in the buffer cache - I was seeking confirmation.  The docs (here) say this:

ACTIVE - Log is active but is not the current log. It is needed for crash recovery. It may be in use for block recovery. It may or may not be archived.
 
That doesn't really address the question - says it's needed for recovery but doesn't say why.  If it was a recently written one I could understand it, but an older one ?

>>Do you think ACTIVE is 'bad' and INACIVE 'good'?

Not necessarily no.  The fact that it's needed to crash recovery was indicative to me that there may still be dirty blocks that needed attention - especially when the log marked ACTIVE that is to be the next current upon the next log switch contains redo that is 5 to 6 hours old.  THAT is what concerned me.

>>I really think you might be chasing a non-issue here.

You may be right.  I was about "this far" (holds finger and thumb really close together) from adding this message to my ignore list in my monitoring script - but I have multiple databases I watch over, from 8i (only one of these dinosaurs left and it's almost dead) through 11gR2 - on HPUX, Windows and Linux - and all Enterprise Edition - and I never saw this issue on any of them before (except maybe during backups it would pop up briefly).

The only thing that I can see as different here (possibly) is that I'm working with Standard Edition (maybe SE One, can't remember) for this one and wondered if there was something different.

It may be that this is a storm in a teacup.  I'll leave this question open another day or two if case anyone else wants to throw in their two cents on the issue - but your example on the XE one does make one think.  

I have an XE instance on my laptop.  I tried out your example and saw exactly the same thing.

Thanks to all participants in this discussion so far.
0
 
LVL 76

Accepted Solution

by:
slightwv (䄆 Netminder) earned 250 total points
ID: 39825768
>>and a crash occurs, then the blocks that need amending can be read from the online redo logs instead of the data files.

Disk is disk...  In the event of a crash, recovery is performed and transactions from the redo logs are replayed.

>>Just trying to make a message that I possibly thought was indicative of a problem go away.

I think this is the problem...  perception of an issue doesn't mean there is one.

I've never really monitored the alert log to the level you are but... different versions mean different behaviors in logging.

>> (shame on you for not being able to read my mind and understand what I meant instead of what I wrote :) ).  

lol...I thought that the first time you stated it:
"a redo log is marked as current as long as"

But then you repeated it:
"Do I understand what CURRENT in the v$log"

I didn't figure you would make the mistake twice?


>>The docs (here) say this:

Same docs I go by.  Notice all the 'may's? It is just for crash recovery.

Some blocks may be dirty in memory but not yet written to disk.  However, the committed transaction was recorded.

>>  I was about "this far"

I too am interested in what the other Experts have to say.  I am ALWAYS willing to learn something new.

Try this as a test (I did when I was playing because of this question):
open two sqlplus windows.
In #1:
create table tab1(col1 char(1));
insert into tab1 values('a');
commit;

In #2:
alter system checkpoint;

*outside of them: delete the alert log*

In #1:
update tab1 set col1='z';
--NO COMMIT

In #2:
alter system switch logfile;
--repeat at least 3 times the number of redo log groups you have


Now check the alert log.
Log switches occur.  Nothing hangs.  Yet, we STILL have an uncommitted transaction pending...

I had several "cannot allocate new log" and a couple of "checkpoint not complete"s.  Not sure if any are related since I had the same things without an uncommitted transactions pending.

It's magic?

Maybe I just trust too much that Oracle "has my back" and some things I am just not supposed to understand...

If I kill oracle.exe on Windows or pmon on *nix, Oracle is able to recover all committed transactions and rolls back all uncommitted ones.

It just works...
0
 
LVL 22

Author Comment

by:Steve Wales
ID: 39825781
That much is true - Oracle is VERY robust - I've had servers die, disks fail and goodness knows what else happen in the last dozen years and apart from having a sysadmin actually physically do a "rm *" on a database directory once, it's always just come back up.

(Of course there was also the time when a sysadmin removed this file from "/" called "hpux" back in the HPUX 9.x days because it was taking up all his root volume space, but that's another story completely :) )

Alrighty, will see if anyone else has anything to add on this one, if not I'll close it and split up some points early next week - you've pretty much convinced me.
0
 
LVL 76

Expert Comment

by:slightwv (䄆 Netminder)
ID: 39825805
>>apart from having a sysadmin actually physically do a "rm *" on a database directory once

Had a developer looking for space 'rm' a database file. Had management convinced a server reboot crashed the database.  Review of shell histories showed the rm command of an unused database file so the database remained up and usable until the reboot.


>> back in the HPUX 9.x days

I almost remember them!!!  I like HP-UX but mainly remember the 10 days...
0
 
LVL 34

Assisted Solution

by:johnsone
johnsone earned 250 total points
ID: 39826215
I guess I misunderstood what we were looking at as well.  I thought you had a problem.

If the ACTIVE log has been written fully to disk as an archive log, then one other reason I can think of is that you have transactions that aren't committing.  Crash recovery would have to roll them back.  Keep an eye on V$TRANSACTION and maybe you can spot something.  I have seen this with SQL*Plus users (or other tools) doing things and not issuing a commit.  Although I never saw it cause the cannot allocate log message, but I wasn't looking for that.  Still not a bad thing, if there is an instance crash, Oracle will handle it (the info should be in the UNDO segments as well and those shouldn't get recycled if they are required for recovery).

Personally, I wouldn't have that message in a script that is going to alert me.  I care more about the file system where the archive logs are being written.  I'll chase the message if people are seeing an issue due to excessive log switches and it actually causing a problem, but it doesn't appear that you are seeing that.
0
 
LVL 22

Author Closing Comment

by:Steve Wales
ID: 39829808
Thanks to all for your inputs.  You've convinced me that this message can be ignored.  Some good information came out of this discussion too (including a nice script!).
0

Featured Post

PRTG Network Monitor: Intuitive Network Monitoring

Network Monitoring is essential to ensure that computer systems and network devices are running. Use PRTG to monitor LANs, servers, websites, applications and devices, bandwidth, virtual environments, remote systems, IoT, and many more. PRTG is easy to set up & use.

Join & Write a Comment

Truncate is a DDL Command where as Delete is a DML Command. Both will delete data from table, but what is the difference between these below statements truncate table <table_name> ?? delete from <table_name> ?? The first command cannot be …
How to Create User-Defined Aggregates in Oracle Before we begin creating these things, what are user-defined aggregates?  They are a feature introduced in Oracle 9i that allows a developer to create his or her own functions like "SUM", "AVG", and…
This video shows information on the Oracle Data Dictionary, starting with the Oracle documentation, explaining the different types of Data Dictionary views available by group and permissions as well as giving examples on how to retrieve data from th…
This video shows how to configure and send email from and Oracle database using both UTL_SMTP and UTL_MAIL, as well as comparing UTL_SMTP to a manual SMTP conversation with a mail server.

743 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

14 Experts available now in Live!

Get 1:1 Help Now