Link to home
Start Free TrialLog in
Avatar of Calbrenar
Calbrenar

asked on

Red Hat Cluster Suit -- Oracle HA --Trying to repair cluster

I'm trying to repair a cluster that was setup by someone else who left the office.  I'm attempting to follow the commands in the manual to remove the broken one so I can re-add it but the service command is telling me the service is unrecognized.

When I do service --status-all I can see services running such a clurgmgrd and clvmd, but when I do service clurgmgrd stop I get unrecognized service.

Not sure why its doing it but i'm leery of following the rest of the directions to delete and readd the cluster when the services are definately running even when the service command doesn't see it.
Avatar of Calbrenar
Calbrenar

ASKER

Oh I wanted to add.  We are using scripts to modify the default cluster setup for Oracle according to this article here.
 
http://www.samag.com/documents/s=9370/sam0704a/0704a.htm
ASKER CERTIFIED SOLUTION
Avatar of jgiordano
jgiordano
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
When I first attempted to get oracle working it kept telling me oracle wasn't started even after starting it.  So I started looking at the cluster but there was no way to start.  After rebooting the machines one at a time I was able to get into the cluster manager on both servers but one of them is not part of the cluster and gives an error about going to the manager.
 
because this node is not currently part of a cluster
 
So I decided to try removing it and readding it and ran into the problems above.  I'm trying your steps above currently its stuck at "Waiting for services to stop:"
It sounds like your cluster is in a pretty funky state. I feel your pain RHEL Cluster Suite isn't exactly the easiest cluster prodcut to work with. Is this system productionalized?

Have you considered using Conga to manage the cluster in addition to the command line? I find it much easier to manage/add/remove the resources of the cluster.
not sure what conga is but we're basically stuck with using hte cluster as configured as its already been released and is in use by the customer so we have to maintain it and develop releases for it.   I think it may just be this one member that is screwed up the other appears to be working but i'm afraid to delete it out of the cluster without fully stopping all the services.   I tried doing CTRL-C to stop it then rerunning it  again but it sitll sticks at waiting for services to stop
Conga is 2 packages that you can install onto a separate RH box or members of the cluster. Luci is the management piece and ricci is the agent piece. It allows you to do many of the tasks you are doing now via gui.

Back to your server, I would try rebooting again.
Do you have any type of fencing enabled?
Are you referring to the Cluster Configuration tool?  It has 2 tabs Cluster configuration and Cluster Management.  It shows the machine i'm having problems with but says for status not a member.
 
I believe fencing is enabled because when I reboot the the computer and watch the console I see it talking about fencing and it takes a long time to do anything.
No conga is web based, I haven't used the cluster config gui tool.


That may mean that it is not enabled. I had a HUGE issue with my cluster until I enabled some type of fencing. Can you cat the  /etc/cluster/cluster.conf on both nodes and make sure that they are @ the same rev level. Also check the section for fencing and see what is in that section.
The one on the broken one says config_version 15.  The good one says 14.  
 
Both say fence_daemon post_fail_delay = 0
Both machines are enclosed in <fence></fence> tags and fencedevice is set to fence_manual.
 
The version most likely is different because I deleted the broken member from the failover domain.  In fact I'm sure tha'ts true because when I hit save on the good one and reloaded the config they are both set to 15 now.
Just because they are both 15 doesn't mean that they are identical; you want to have them identical. You currently are not using any sort of fencing since it is set to manual. I would recommend you configure some type of fencing. Many of your problems will go away once fencing is correctly configured. Even though they have an option for manual, the cluster takes forever to start and stop fencing when it is not configured.  
I'd be happy to configure it once I can get rid of the bad cluster.  If I can remove everyting off if it then I can add it back into the cluster from scratch.  But I'm not sure how to do it since the command you listed never finishes.
It might be hung and the only solution is to reboot. I am dealt with a similar issue with clvmd and needed to reboot.
machine wouldn't even reboot i had to reboot the VM.  Now the commands work.  Going to remove it form the cluster and readd.  Wish me luck -- thanks for the help A+!
Great! - Definitely use the VM fencing you will notice that the fence daemon starts and stops much easier.