sharscho
asked on
RAC processes on linux
Hi experts,
I have a linux oracle cluster which has a avg cpu load of 29557. when I do a ps -ef on the server I get lots of
root 31347 1 1 06:57 ? 00:05:12 /data1/oracle/10.2.0/crs/b in/racgmai n check
I am not familiar with RAC yet and so I don't know what this process is or does.
when I do a top I get these on the top:
1184 root 25 0 20304 2720 1992 R 4 0.0 3:36.08 racgmain
and rarely I see a oracle process on top.
Can someone help me out with these processes? Thank you in advance.
I have a linux oracle cluster which has a avg cpu load of 29557. when I do a ps -ef on the server I get lots of
root 31347 1 1 06:57 ? 00:05:12 /data1/oracle/10.2.0/crs/b
I am not familiar with RAC yet and so I don't know what this process is or does.
when I do a top I get these on the top:
1184 root 25 0 20304 2720 1992 R 4 0.0 3:36.08 racgmain
and rarely I see a oracle process on top.
Can someone help me out with these processes? Thank you in advance.
ASKER
Thanks for your prompt response. How do I know that a process hangs? Is it by the time? for example:
14596 root 25 0 20304 2712 1984 R 4 0.0 11:33.11 racgmain
14596 root 25 0 20304 2712 1984 R 4 0.0 11:33.11 racgmain
ps -ef|grep "racgmain check"|wc -l 1290
If you get something like this:
CAAMonitorHandler :: 0:Action Script /opt/oracle/product/crs/bi n/racgwrap (check) timed out for ora.harac1.vip! (timeout=60)
CheckResource error for ora.harac1.vip error code = -2
CAAMonitorHandler :: 0:Could not join /opt/oracle/product/crs/bi n/racgwrap (check)
category: 1234, operation: scls_process_join, loc: childcrash, OS error: 0,
other: Abnormal termination of the child
then
If you get something like this:
CAAMonitorHandler :: 0:Action Script /opt/oracle/product/crs/bi
CheckResource error for ora.harac1.vip error code = -2
CAAMonitorHandler :: 0:Could not join /opt/oracle/product/crs/bi
category: 1234, operation: scls_process_join, loc: childcrash, OS error: 0,
other: Abnormal termination of the child
then
ASKER
I got this as output:
[root@vwtu200 ~]# ps -ef|grep "racgmain check"|wc -l 1290
wc: 1290: No such file or directory
?
[root@vwtu200 ~]# ps -ef|grep "racgmain check"|wc -l 1290
wc: 1290: No such file or directory
?
Sorry. The inconveniences of copy paste.
ps -ef|grep "racgmain check"|wc -l
You must get 1. You get in troble if you get more than one.
ps -ef|grep "racgmain check"|wc -l
You must get 1. You get in troble if you get more than one.
Why in problem?
Because Oracle says:
This is fixed in 11.1.0.7 patchset.. If you are running into this issue in 10gR2, please go ahead and apply 10.2.0.4 patchset and the latest CRS bundle patch. This fix is included in CRS bundle patch from bundle #2 onwards.
There is a temporary workarround
Because Oracle says:
This is fixed in 11.1.0.7 patchset.. If you are running into this issue in 10gR2, please go ahead and apply 10.2.0.4 patchset and the latest CRS bundle patch. This fix is included in CRS bundle patch from bundle #2 onwards.
There is a temporary workarround
ASKER
OK I understand.
When I issue the command, I get 349 so it is a problem. Can I get the workaround because patching can't be applied now. I will describe the issue on paper and try to pursue a patch for the db. It is an essential db of the most important application here and they want to stick to the standard which is applied at each location. Our location has the most issue with the rac because it is on a linux rac system.
Can I get the temporary workaround from you. And can you send me some links about this issue? I can search but I want some relevant ones. I thank you in advance. Can I just kill some with kill -9?
When I issue the command, I get 349 so it is a problem. Can I get the workaround because patching can't be applied now. I will describe the issue on paper and try to pursue a patch for the db. It is an essential db of the most important application here and they want to stick to the standard which is applied at each location. Our location has the most issue with the rac because it is on a linux rac system.
Can I get the temporary workaround from you. And can you send me some links about this issue? I can search but I want some relevant ones. I thank you in advance. Can I just kill some with kill -9?
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Hi, Iwas waiting for some more advices so I decided to kill the 349 processes for racgmain. But There were problems this weekend, I did the killing on Friday and Sunday noon the node1 got in a hanging state. My colleague restarted the server but it was still not working properly. till after the second node was restarted it started functioning right. I don't know the impact of the process killing on the database or the rac server. The cpu load did get lower. In the alertlog of the node1 I see no activities after that the racgmain processes were killed. so it stayed there till Sunday when the db was shutdown the hard way. The alert log on the node2 had some entries which I am going to list below.
Reconfiguration started (old inc 12, new inc 14)
List of nodes:
1
Global Resource Directory frozen
* dead instance detected - domain 0 invalid = TRUE
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Fri Apr 16 17:21:36 2010
LMS 1: 0 GCS shadows cancelled, 0 closed
Fri Apr 16 17:21:36 2010
LMS 0: 0 GCS shadows cancelled, 0 closed
Set master node info
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Post SMON to start 1st pass IR
Fri Apr 16 17:21:36 2010
LMS 1: 7736 GCS shadows traversed, 0 replayed
Fri Apr 16 17:21:36 2010
LMS 0: 7707 GCS shadows traversed, 0 replayed
Fri Apr 16 17:21:36 2010
Submitted all GCS remote-cache requests
Post SMON to start 1st pass IR
Fix write in gcs resources
Fri Apr 16 17:21:36 2010
Instance recovery: looking for dead threads
Fri Apr 16 17:21:36 2010
Beginning instance recovery of 1 threads
Reconfiguration complete
Fri Apr 16 17:21:37 2010
parallel recovery started with 3 processes
Fri Apr 16 17:21:37 2010
Started redo scan
Fri Apr 16 17:21:37 2010
Completed redo scan
3728 redo blocks read, 118 data blocks need recovery
Fri Apr 16 17:21:37 2010
Started redo application at
Thread 1: logseq 1468, block 70517
Fri Apr 16 17:21:37 2010
Recovery of Online Redo Log: Thread 1 Group 10 Seq 1468 Reading mem 0
Mem# 0 errs 0: /data3/oradata/INFRABV/red o10.log
Fri Apr 16 17:21:37 2010
Completed redo application
Fri Apr 16 17:21:37 2010
Completed instance recovery at
Thread 1: logseq 1468, block 74245, scn 83661866
116 data blocks read, 117 data blocks written, 3728 redo blocks read
Switch log for thread 1 to sequence 1469
Fri Apr 16 19:04:41 2010
GES: Potential blocker (pid=18839) on resource CF-00000000-00000000;
enqueue info in file /data1/oracle/10.2.0/db1/a dmin/INFRA BV/bdump/i nfrabv2_lm d0_8671.tr c and DIAG trace file
Fri Apr 16 19:04:41 2010
Killing enqueue blocker (pid=18839) on resource CF-00000000-00000000
by killing session 398.25
Fri Apr 16 19:06:07 2010
Killing enqueue blocker (pid=18839) on resource CF-00000000-00000000
by killing session 398.25
Fri Apr 16 19:12:11 2010
Killing enqueue blocker (pid=18839) on resource CF-00000000-00000000
by terminating the process
Fri Apr 16 19:27:29 2010
It continued with the killing till sunday when the node2 was rebooted also.
Can you send me some more clues on the racgmain check process? When I do a ps -ef now I don't get any racgmain check process, is that wrong? Waht is the interval for the racgmain run? Can you tell me the effect of the racgmai process killing on the db? There no DIAG file present so it was not created.
Any help is appreciated.
Reconfiguration started (old inc 12, new inc 14)
List of nodes:
1
Global Resource Directory frozen
* dead instance detected - domain 0 invalid = TRUE
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Fri Apr 16 17:21:36 2010
LMS 1: 0 GCS shadows cancelled, 0 closed
Fri Apr 16 17:21:36 2010
LMS 0: 0 GCS shadows cancelled, 0 closed
Set master node info
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Post SMON to start 1st pass IR
Fri Apr 16 17:21:36 2010
LMS 1: 7736 GCS shadows traversed, 0 replayed
Fri Apr 16 17:21:36 2010
LMS 0: 7707 GCS shadows traversed, 0 replayed
Fri Apr 16 17:21:36 2010
Submitted all GCS remote-cache requests
Post SMON to start 1st pass IR
Fix write in gcs resources
Fri Apr 16 17:21:36 2010
Instance recovery: looking for dead threads
Fri Apr 16 17:21:36 2010
Beginning instance recovery of 1 threads
Reconfiguration complete
Fri Apr 16 17:21:37 2010
parallel recovery started with 3 processes
Fri Apr 16 17:21:37 2010
Started redo scan
Fri Apr 16 17:21:37 2010
Completed redo scan
3728 redo blocks read, 118 data blocks need recovery
Fri Apr 16 17:21:37 2010
Started redo application at
Thread 1: logseq 1468, block 70517
Fri Apr 16 17:21:37 2010
Recovery of Online Redo Log: Thread 1 Group 10 Seq 1468 Reading mem 0
Mem# 0 errs 0: /data3/oradata/INFRABV/red
Fri Apr 16 17:21:37 2010
Completed redo application
Fri Apr 16 17:21:37 2010
Completed instance recovery at
Thread 1: logseq 1468, block 74245, scn 83661866
116 data blocks read, 117 data blocks written, 3728 redo blocks read
Switch log for thread 1 to sequence 1469
Fri Apr 16 19:04:41 2010
GES: Potential blocker (pid=18839) on resource CF-00000000-00000000;
enqueue info in file /data1/oracle/10.2.0/db1/a
Fri Apr 16 19:04:41 2010
Killing enqueue blocker (pid=18839) on resource CF-00000000-00000000
by killing session 398.25
Fri Apr 16 19:06:07 2010
Killing enqueue blocker (pid=18839) on resource CF-00000000-00000000
by killing session 398.25
Fri Apr 16 19:12:11 2010
Killing enqueue blocker (pid=18839) on resource CF-00000000-00000000
by terminating the process
Fri Apr 16 19:27:29 2010
It continued with the killing till sunday when the node2 was rebooted also.
Can you send me some more clues on the racgmain check process? When I do a ps -ef now I don't get any racgmain check process, is that wrong? Waht is the interval for the racgmain run? Can you tell me the effect of the racgmai process killing on the db? There no DIAG file present so it was not created.
Any help is appreciated.
RACG—Extends clusterware to support Oracle-specific requirements and complex resources. Runs server callout scripts when FAN events occur. In Linux, the processes are racgmain and racgimon
You must have one racgmain!
I don't know but I think the best you can do now is reboot the cluster orderly and apply the patch in racgwrap (look above)
You must have one racgmain!
I don't know but I think the best you can do now is reboot the cluster orderly and apply the patch in racgwrap (look above)
ASKER
Thanks for your response, I got some info which I used to look for things and below is what I got.
The only line that I get when I do a ps -ef is my own grep:
oracle 4797 4889 0 14:43 pts/3 00:00:00 grep racgmain check
I do see this one:
ps -ef|grep racgimon
oracle 19983 1 0 Apr18 ? 00:00:00 /data1/oracle/10.2.0/db1/b in/racgimo n startd INFRABV
it exists on bodes nodes.
But when I do:
$ ps -ef|grep racgmain
oracle 12100 4889 0 14:45 pts/3 00:00:00 grep racgmain
I get my own grep. Is it OK like this? Is it Ok to have only the racgimon? I will have to check with the corporate office to get permission to apply the patch. Cant I start the racgmain process manually?
The only line that I get when I do a ps -ef is my own grep:
oracle 4797 4889 0 14:43 pts/3 00:00:00 grep racgmain check
I do see this one:
ps -ef|grep racgimon
oracle 19983 1 0 Apr18 ? 00:00:00 /data1/oracle/10.2.0/db1/b
it exists on bodes nodes.
But when I do:
$ ps -ef|grep racgmain
oracle 12100 4889 0 14:45 pts/3 00:00:00 grep racgmain
I get my own grep. Is it OK like this? Is it Ok to have only the racgimon? I will have to check with the corporate office to get permission to apply the patch. Cant I start the racgmain process manually?
ASKER
I also get this error in the crsd.log
2010-04-19 15:40:38.907: [ OCRSRV][3031468960]th_sele ct_handler : Failed to retrieve procctx from ht. constr = [-1604294368] retval lht [-27] Signal CV.
2010-04-19 15:40:38.907: [ OCRSRV][3031468960]th_sele
From Oracle:
This message can be ignored as described in unpublished bug 4494370.
Schedule a maintenance window and turn off this messaging as user root on one of the clusternodes:
# crsctl debug log crs OCRSRV:0
Stop and start CRS after changing the tracing level to pick up the change.
This message can be ignored as described in unpublished bug 4494370.
Schedule a maintenance window and turn off this messaging as user root on one of the clusternodes:
# crsctl debug log crs OCRSRV:0
Stop and start CRS after changing the tracing level to pick up the change.
ASKER
Ok thanks for the info. We will stop abd start everything on Thursday morning. I will do more readings on racgmain processes also.
you're welcome
If the resource action timed out, crsd kills the action script, which is racgwrap, while racgmain process will not be killed. Over time, this might create lot of orphan racgmain processes in the system. This would eventually slow down the due to the resource contention at the OS level.
(from myOracle support aka. Metalink)
Is this your case?