Link to home
Start Free TrialLog in
Avatar of sharscho
sharscho

asked on

RAC processes on linux

Hi experts,
I have a linux oracle cluster which has a avg cpu load of 29557. when I do a ps -ef on the server I get lots of
 root     31347     1  1 06:57 ?        00:05:12 /data1/oracle/10.2.0/crs/bin/racgmain check
I am not familiar with RAC yet and so I don't know what this process is or does.
when I do a top I get these on the top:
1184 root      25   0 20304 2720 1992 R    4  0.0   3:36.08 racgmain
 and rarely I see a oracle process on top.
Can someone help me out with these processes? Thank you in advance.
Avatar of jiruiz
jiruiz
Flag of Spain image

crsd.bin invokes the racgmain to check the status of the resources that are managed by CRS. The racgmain is invoked through the wrapper script racgwrap.

If the resource action timed out, crsd kills the action script, which is racgwrap, while racgmain process will not be killed. Over time, this might create lot of orphan racgmain processes in the system. This would eventually slow down the due to the resource contention at the OS level.

(from myOracle support aka. Metalink)

Is this your case?
Avatar of sharscho
sharscho

ASKER

Thanks for your prompt response. How do I know that a process hangs? Is it by the time? for example:
14596 root      25   0 20304 2712 1984 R    4  0.0  11:33.11 racgmain
ps -ef|grep "racgmain check"|wc -l 1290

If you get something like this:

CAAMonitorHandler :: 0:Action Script /opt/oracle/product/crs/bin/racgwrap(check) timed out for ora.harac1.vip! (timeout=60)
CheckResource error for ora.harac1.vip error code = -2
CAAMonitorHandler :: 0:Could not join /opt/oracle/product/crs/bin/racgwrap(check)
category: 1234, operation: scls_process_join, loc: childcrash, OS error: 0,
other: Abnormal termination of the child


then
I got this as output:

[root@vwtu200 ~]# ps -ef|grep "racgmain check"|wc -l 1290
wc: 1290: No such file or directory
?
Sorry. The inconveniences of copy paste.

ps -ef|grep "racgmain check"|wc -l

You must get 1. You get in troble if you get more than one.
Why in problem?

Because Oracle says:

This is fixed in 11.1.0.7 patchset.. If you are running into this issue in 10gR2, please go ahead and apply 10.2.0.4 patchset and the latest CRS bundle patch. This fix is included  in CRS bundle patch from bundle #2 onwards.

There is a temporary workarround
OK I understand.
When I issue the command, I get 349 so it is a problem. Can I get the workaround because patching can't be applied now. I will describe the issue on paper and try to pursue a patch for the db. It is an essential db of the most important application here and they want to stick to the standard which is applied at each location. Our location has the most issue with the rac because it is on a linux rac system.
Can I get the temporary workaround from you. And can you send me some links about this issue? I can search but I want some relevant ones. I thank you in advance. Can I just kill some with kill -9?
ASKER CERTIFIED SOLUTION
Avatar of jiruiz
jiruiz
Flag of Spain image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Hi, Iwas waiting for some more advices so I decided to kill the 349 processes for racgmain. But There were problems this weekend, I did the killing on Friday and Sunday noon the node1 got in a hanging state. My colleague restarted the server but it was still not working properly. till after the second node was restarted it started functioning right. I don't know the impact of the process killing on the database or the rac server. The cpu load did get lower. In the alertlog of the node1 I see no activities after that the racgmain processes were killed. so it stayed there till Sunday when the db was shutdown the hard way. The alert log on the node2 had some entries which I am going to list below.
Reconfiguration started (old inc 12, new inc 14)
List of nodes:
 1
 Global Resource Directory frozen
 * dead instance detected - domain 0 invalid = TRUE
 Communication channels reestablished
 Master broadcasted resource hash value bitmaps
 Non-local Process blocks cleaned out
Fri Apr 16 17:21:36 2010
 LMS 1: 0 GCS shadows cancelled, 0 closed
Fri Apr 16 17:21:36 2010
 LMS 0: 0 GCS shadows cancelled, 0 closed
 Set master node info
 Submitted all remote-enqueue requests
 Dwn-cvts replayed, VALBLKs dubious
 All grantable enqueues granted
 Post SMON to start 1st pass IR
Fri Apr 16 17:21:36 2010
 LMS 1: 7736 GCS shadows traversed, 0 replayed
Fri Apr 16 17:21:36 2010
 LMS 0: 7707 GCS shadows traversed, 0 replayed
Fri Apr 16 17:21:36 2010
 Submitted all GCS remote-cache requests
 Post SMON to start 1st pass IR
 Fix write in gcs resources
Fri Apr 16 17:21:36 2010
Instance recovery: looking for dead threads
Fri Apr 16 17:21:36 2010
Beginning instance recovery of 1 threads
Reconfiguration complete
Fri Apr 16 17:21:37 2010
 parallel recovery started with 3 processes
Fri Apr 16 17:21:37 2010
Started redo scan
Fri Apr 16 17:21:37 2010
Completed redo scan
 3728 redo blocks read, 118 data blocks need recovery
Fri Apr 16 17:21:37 2010
Started redo application at
 Thread 1: logseq 1468, block 70517
Fri Apr 16 17:21:37 2010
Recovery of Online Redo Log: Thread 1 Group 10 Seq 1468 Reading mem 0
  Mem# 0 errs 0: /data3/oradata/INFRABV/redo10.log
Fri Apr 16 17:21:37 2010
Completed redo application
Fri Apr 16 17:21:37 2010
Completed instance recovery at
 Thread 1: logseq 1468, block 74245, scn 83661866
 116 data blocks read, 117 data blocks written, 3728 redo blocks read
Switch log for thread 1 to sequence 1469
Fri Apr 16 19:04:41 2010
GES: Potential blocker (pid=18839) on resource CF-00000000-00000000;
 enqueue info in file /data1/oracle/10.2.0/db1/admin/INFRABV/bdump/infrabv2_lmd0_8671.trc and DIAG trace file
Fri Apr 16 19:04:41 2010
Killing enqueue blocker (pid=18839) on resource CF-00000000-00000000
 by killing session 398.25
Fri Apr 16 19:06:07 2010
Killing enqueue blocker (pid=18839) on resource CF-00000000-00000000
 by killing session 398.25
Fri Apr 16 19:12:11 2010
Killing enqueue blocker (pid=18839) on resource CF-00000000-00000000
 by terminating the process
Fri Apr 16 19:27:29 2010
It continued with the killing till sunday when the node2 was rebooted also.
Can you send me some more clues on the racgmain check process? When I do a ps -ef now I don't get any racgmain check process, is that wrong? Waht is the interval for the racgmain run? Can you tell me the effect of the racgmai process killing on the db? There no DIAG file present so it was not created.
Any help is appreciated.
RACG—Extends clusterware to support Oracle-specific requirements and complex resources. Runs server callout scripts when FAN events occur. In Linux, the processes are racgmain and racgimon

You must have one racgmain!

I don't know but I think the best you can do now is reboot the cluster orderly and apply the patch in racgwrap (look above)
Thanks for your response, I got some info which I used to look for things and below is what I got.
The only line that I get when I do a ps -ef is my own grep:
oracle    4797  4889  0 14:43 pts/3    00:00:00 grep racgmain check
I do see this one:
 ps -ef|grep racgimon
oracle   19983     1  0 Apr18 ?        00:00:00 /data1/oracle/10.2.0/db1/bin/racgimon startd INFRABV
it exists on bodes nodes.
But when I do:
$  ps -ef|grep racgmain
oracle   12100  4889  0 14:45 pts/3    00:00:00 grep racgmain
I get my own grep. Is it OK like this? Is it Ok to have only the racgimon? I will have to check with the corporate office to get permission to apply the patch. Cant I start the racgmain process manually?
I also get this error in the crsd.log
2010-04-19 15:40:38.907: [  OCRSRV][3031468960]th_select_handler: Failed to retrieve procctx from ht. constr = [-1604294368] retval lht [-27] Signal CV.
From Oracle:

This message can be ignored as described in unpublished bug 4494370.

Schedule a maintenance window and turn off this messaging as user root on one of the clusternodes:

# crsctl debug log crs OCRSRV:0

Stop and start CRS after changing the tracing level to pick up the change.
Ok thanks for the info. We will stop abd start everything on Thursday morning. I will do more readings on racgmain processes also.
you're welcome