RAC processes on linux

Hi experts,
I have a linux oracle cluster which has a avg cpu load of 29557. when I do a ps -ef on the server I get lots of
 root     31347     1  1 06:57 ?        00:05:12 /data1/oracle/10.2.0/crs/bin/racgmain check
I am not familiar with RAC yet and so I don't know what this process is or does.
when I do a top I get these on the top:
1184 root      25   0 20304 2720 1992 R    4  0.0   3:36.08 racgmain
 and rarely I see a oracle process on top.
Can someone help me out with these processes? Thank you in advance.
sharschoAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

jiruizCommented:
crsd.bin invokes the racgmain to check the status of the resources that are managed by CRS. The racgmain is invoked through the wrapper script racgwrap.

If the resource action timed out, crsd kills the action script, which is racgwrap, while racgmain process will not be killed. Over time, this might create lot of orphan racgmain processes in the system. This would eventually slow down the due to the resource contention at the OS level.

(from myOracle support aka. Metalink)

Is this your case?
0
sharschoAuthor Commented:
Thanks for your prompt response. How do I know that a process hangs? Is it by the time? for example:
14596 root      25   0 20304 2712 1984 R    4  0.0  11:33.11 racgmain
0
jiruizCommented:
ps -ef|grep "racgmain check"|wc -l 1290

If you get something like this:

CAAMonitorHandler :: 0:Action Script /opt/oracle/product/crs/bin/racgwrap(check) timed out for ora.harac1.vip! (timeout=60)
CheckResource error for ora.harac1.vip error code = -2
CAAMonitorHandler :: 0:Could not join /opt/oracle/product/crs/bin/racgwrap(check)
category: 1234, operation: scls_process_join, loc: childcrash, OS error: 0,
other: Abnormal termination of the child


then
0
Ultimate Tool Kit for Technology Solution Provider

Broken down into practical pointers and step-by-step instructions, the IT Service Excellence Tool Kit delivers expert advice for technology solution providers. Get your free copy now.

sharschoAuthor Commented:
I got this as output:

[root@vwtu200 ~]# ps -ef|grep "racgmain check"|wc -l 1290
wc: 1290: No such file or directory
?
0
jiruizCommented:
Sorry. The inconveniences of copy paste.

ps -ef|grep "racgmain check"|wc -l

You must get 1. You get in troble if you get more than one.
0
jiruizCommented:
Why in problem?

Because Oracle says:

This is fixed in 11.1.0.7 patchset.. If you are running into this issue in 10gR2, please go ahead and apply 10.2.0.4 patchset and the latest CRS bundle patch. This fix is included  in CRS bundle patch from bundle #2 onwards.

There is a temporary workarround
0
sharschoAuthor Commented:
OK I understand.
When I issue the command, I get 349 so it is a problem. Can I get the workaround because patching can't be applied now. I will describe the issue on paper and try to pursue a patch for the db. It is an essential db of the most important application here and they want to stick to the standard which is applied at each location. Our location has the most issue with the rac because it is on a linux rac system.
Can I get the temporary workaround from you. And can you send me some links about this issue? I can search but I want some relevant ones. I thank you in advance. Can I just kill some with kill -9?
0
jiruizCommented:
From metalink:

...
Solution

    * This is fixed in 11.1.0.7 patchset.. If you are running into this issue in 10gR2, please go ahead and apply 10.2.0.4 patchset and the latest CRS bundle patch. This fix is included  in CRS bundle patch from bundle #2 onwards.

    * Following option could be used as a temporary workaround until the patch is applied.


1.  Make a copy of racgwrap located under $ORACLE_HOME/bin and $CRS_HOME/bin on ALL Nodes

2.  Edit the file racgwrap and modify the last 3 lines from:

~~~
$ORACLE_HOME/bin/racgmain "$@"
status=$?
exit $status

to:

# Line added to fix for Bug 6196746
exec $ORACLE_HOME/bin/racgmain "$@"
~~~

3.  Kill all the orphan racgmain processes running.

$ ps -ef|grep "racgmain check"
oracle 18701 1 0 Aug 1 ? 0:00 /oracle/product/10.2.0/database/bin/racgmain check
oracle 14653 1 0 Aug 1 ? 0:00 /oracle/product/10.2.0/database/bin/racgmain check
oracle 24517 1 0 Aug 1 ? 0:00 /oracle/product/10.2.0/database/bin/racgmain check

$ kill -9 <PID of racgmain>

This is the document # 732086.1

0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
sharschoAuthor Commented:
Hi, Iwas waiting for some more advices so I decided to kill the 349 processes for racgmain. But There were problems this weekend, I did the killing on Friday and Sunday noon the node1 got in a hanging state. My colleague restarted the server but it was still not working properly. till after the second node was restarted it started functioning right. I don't know the impact of the process killing on the database or the rac server. The cpu load did get lower. In the alertlog of the node1 I see no activities after that the racgmain processes were killed. so it stayed there till Sunday when the db was shutdown the hard way. The alert log on the node2 had some entries which I am going to list below.
Reconfiguration started (old inc 12, new inc 14)
List of nodes:
 1
 Global Resource Directory frozen
 * dead instance detected - domain 0 invalid = TRUE
 Communication channels reestablished
 Master broadcasted resource hash value bitmaps
 Non-local Process blocks cleaned out
Fri Apr 16 17:21:36 2010
 LMS 1: 0 GCS shadows cancelled, 0 closed
Fri Apr 16 17:21:36 2010
 LMS 0: 0 GCS shadows cancelled, 0 closed
 Set master node info
 Submitted all remote-enqueue requests
 Dwn-cvts replayed, VALBLKs dubious
 All grantable enqueues granted
 Post SMON to start 1st pass IR
Fri Apr 16 17:21:36 2010
 LMS 1: 7736 GCS shadows traversed, 0 replayed
Fri Apr 16 17:21:36 2010
 LMS 0: 7707 GCS shadows traversed, 0 replayed
Fri Apr 16 17:21:36 2010
 Submitted all GCS remote-cache requests
 Post SMON to start 1st pass IR
 Fix write in gcs resources
Fri Apr 16 17:21:36 2010
Instance recovery: looking for dead threads
Fri Apr 16 17:21:36 2010
Beginning instance recovery of 1 threads
Reconfiguration complete
Fri Apr 16 17:21:37 2010
 parallel recovery started with 3 processes
Fri Apr 16 17:21:37 2010
Started redo scan
Fri Apr 16 17:21:37 2010
Completed redo scan
 3728 redo blocks read, 118 data blocks need recovery
Fri Apr 16 17:21:37 2010
Started redo application at
 Thread 1: logseq 1468, block 70517
Fri Apr 16 17:21:37 2010
Recovery of Online Redo Log: Thread 1 Group 10 Seq 1468 Reading mem 0
  Mem# 0 errs 0: /data3/oradata/INFRABV/redo10.log
Fri Apr 16 17:21:37 2010
Completed redo application
Fri Apr 16 17:21:37 2010
Completed instance recovery at
 Thread 1: logseq 1468, block 74245, scn 83661866
 116 data blocks read, 117 data blocks written, 3728 redo blocks read
Switch log for thread 1 to sequence 1469
Fri Apr 16 19:04:41 2010
GES: Potential blocker (pid=18839) on resource CF-00000000-00000000;
 enqueue info in file /data1/oracle/10.2.0/db1/admin/INFRABV/bdump/infrabv2_lmd0_8671.trc and DIAG trace file
Fri Apr 16 19:04:41 2010
Killing enqueue blocker (pid=18839) on resource CF-00000000-00000000
 by killing session 398.25
Fri Apr 16 19:06:07 2010
Killing enqueue blocker (pid=18839) on resource CF-00000000-00000000
 by killing session 398.25
Fri Apr 16 19:12:11 2010
Killing enqueue blocker (pid=18839) on resource CF-00000000-00000000
 by terminating the process
Fri Apr 16 19:27:29 2010
It continued with the killing till sunday when the node2 was rebooted also.
Can you send me some more clues on the racgmain check process? When I do a ps -ef now I don't get any racgmain check process, is that wrong? Waht is the interval for the racgmain run? Can you tell me the effect of the racgmai process killing on the db? There no DIAG file present so it was not created.
Any help is appreciated.
0
jiruizCommented:
RACG—Extends clusterware to support Oracle-specific requirements and complex resources. Runs server callout scripts when FAN events occur. In Linux, the processes are racgmain and racgimon

You must have one racgmain!

I don't know but I think the best you can do now is reboot the cluster orderly and apply the patch in racgwrap (look above)
0
sharschoAuthor Commented:
Thanks for your response, I got some info which I used to look for things and below is what I got.
The only line that I get when I do a ps -ef is my own grep:
oracle    4797  4889  0 14:43 pts/3    00:00:00 grep racgmain check
I do see this one:
 ps -ef|grep racgimon
oracle   19983     1  0 Apr18 ?        00:00:00 /data1/oracle/10.2.0/db1/bin/racgimon startd INFRABV
it exists on bodes nodes.
But when I do:
$  ps -ef|grep racgmain
oracle   12100  4889  0 14:45 pts/3    00:00:00 grep racgmain
I get my own grep. Is it OK like this? Is it Ok to have only the racgimon? I will have to check with the corporate office to get permission to apply the patch. Cant I start the racgmain process manually?
0
sharschoAuthor Commented:
I also get this error in the crsd.log
2010-04-19 15:40:38.907: [  OCRSRV][3031468960]th_select_handler: Failed to retrieve procctx from ht. constr = [-1604294368] retval lht [-27] Signal CV.
0
jiruizCommented:
From Oracle:

This message can be ignored as described in unpublished bug 4494370.

Schedule a maintenance window and turn off this messaging as user root on one of the clusternodes:

# crsctl debug log crs OCRSRV:0

Stop and start CRS after changing the tracing level to pick up the change.
0
sharschoAuthor Commented:
Ok thanks for the info. We will stop abd start everything on Thursday morning. I will do more readings on racgmain processes also.
0
jiruizCommented:
you're welcome
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Linux

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.