Solved

Red Hat Cluster Suit -- Oracle HA --Trying to repair cluster

Posted on 2008-11-02
14
1,270 Views
Last Modified: 2013-12-19
I'm trying to repair a cluster that was setup by someone else who left the office.  I'm attempting to follow the commands in the manual to remove the broken one so I can re-add it but the service command is telling me the service is unrecognized.

When I do service --status-all I can see services running such a clurgmgrd and clvmd, but when I do service clurgmgrd stop I get unrecognized service.

Not sure why its doing it but i'm leery of following the rest of the directions to delete and readd the cluster when the services are definately running even when the service command doesn't see it.
0
Comment
Question by:Calbrenar
  • 7
  • 7
14 Comments
 

Author Comment

by:Calbrenar
ID: 22863460
Oh I wanted to add.  We are using scripts to modify the default cluster setup for Oracle according to this article here.
 
http://www.samag.com/documents/s=9370/sam0704a/0704a.htm
0
 
LVL 11

Accepted Solution

by:
jgiordano earned 500 total points
ID: 22863610
Here are the steps to stop the cluster services on a member, what are you trying to repair?

 To stop the cluster software on a member, type the following commands in this order:

   1.      service rgmanager stop
   2.      service gfs stop, if you are using Red Hat GFS
   3.      service clvmd stop, if CLVM has been used to create clustered volumes
   4.      service cman stop


http://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/en-US/RHEL510/Cluster_Administration/s1-admin-start-CA.html

0
 

Author Comment

by:Calbrenar
ID: 22863632
When I first attempted to get oracle working it kept telling me oracle wasn't started even after starting it.  So I started looking at the cluster but there was no way to start.  After rebooting the machines one at a time I was able to get into the cluster manager on both servers but one of them is not part of the cluster and gives an error about going to the manager.
 
because this node is not currently part of a cluster
 
So I decided to try removing it and readding it and ran into the problems above.  I'm trying your steps above currently its stuck at "Waiting for services to stop:"
0
 
LVL 11

Expert Comment

by:jgiordano
ID: 22863645
It sounds like your cluster is in a pretty funky state. I feel your pain RHEL Cluster Suite isn't exactly the easiest cluster prodcut to work with. Is this system productionalized?

Have you considered using Conga to manage the cluster in addition to the command line? I find it much easier to manage/add/remove the resources of the cluster.
0
 

Author Comment

by:Calbrenar
ID: 22863760
not sure what conga is but we're basically stuck with using hte cluster as configured as its already been released and is in use by the customer so we have to maintain it and develop releases for it.   I think it may just be this one member that is screwed up the other appears to be working but i'm afraid to delete it out of the cluster without fully stopping all the services.   I tried doing CTRL-C to stop it then rerunning it  again but it sitll sticks at waiting for services to stop
0
 
LVL 11

Expert Comment

by:jgiordano
ID: 22863789
Conga is 2 packages that you can install onto a separate RH box or members of the cluster. Luci is the management piece and ricci is the agent piece. It allows you to do many of the tasks you are doing now via gui.

Back to your server, I would try rebooting again.
Do you have any type of fencing enabled?
0
 

Author Comment

by:Calbrenar
ID: 22863800
Are you referring to the Cluster Configuration tool?  It has 2 tabs Cluster configuration and Cluster Management.  It shows the machine i'm having problems with but says for status not a member.
 
I believe fencing is enabled because when I reboot the the computer and watch the console I see it talking about fencing and it takes a long time to do anything.
0
How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

 
LVL 11

Expert Comment

by:jgiordano
ID: 22863812
No conga is web based, I haven't used the cluster config gui tool.


That may mean that it is not enabled. I had a HUGE issue with my cluster until I enabled some type of fencing. Can you cat the  /etc/cluster/cluster.conf on both nodes and make sure that they are @ the same rev level. Also check the section for fencing and see what is in that section.
0
 

Author Comment

by:Calbrenar
ID: 22863828
The one on the broken one says config_version 15.  The good one says 14.  
 
Both say fence_daemon post_fail_delay = 0
Both machines are enclosed in <fence></fence> tags and fencedevice is set to fence_manual.
 
The version most likely is different because I deleted the broken member from the failover domain.  In fact I'm sure tha'ts true because when I hit save on the good one and reloaded the config they are both set to 15 now.
0
 
LVL 11

Expert Comment

by:jgiordano
ID: 22863841
Just because they are both 15 doesn't mean that they are identical; you want to have them identical. You currently are not using any sort of fencing since it is set to manual. I would recommend you configure some type of fencing. Many of your problems will go away once fencing is correctly configured. Even though they have an option for manual, the cluster takes forever to start and stop fencing when it is not configured.  
0
 

Author Comment

by:Calbrenar
ID: 22863856
I'd be happy to configure it once I can get rid of the bad cluster.  If I can remove everyting off if it then I can add it back into the cluster from scratch.  But I'm not sure how to do it since the command you listed never finishes.
0
 
LVL 11

Expert Comment

by:jgiordano
ID: 22863962
It might be hung and the only solution is to reboot. I am dealt with a similar issue with clvmd and needed to reboot.
0
 

Author Comment

by:Calbrenar
ID: 22864010
machine wouldn't even reboot i had to reboot the VM.  Now the commands work.  Going to remove it form the cluster and readd.  Wish me luck -- thanks for the help A+!
0
 
LVL 11

Expert Comment

by:jgiordano
ID: 22864014
Great! - Definitely use the VM fencing you will notice that the fence daemon starts and stops much easier.
0

Featured Post

What Security Threats Are You Missing?

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

Join & Write a Comment

How to Create User-Defined Aggregates in Oracle Before we begin creating these things, what are user-defined aggregates?  They are a feature introduced in Oracle 9i that allows a developer to create his or her own functions like "SUM", "AVG", and…
I. Introduction There's an interesting discussion going on now in an Experts Exchange Group — Attachments with no extension (http://www.experts-exchange.com/discussions/210281/Attachments-with-no-extension.html). This reminded me of questions tha…
Video by: Steve
Using examples as well as descriptions, step through each of the common simple join types, explaining differences in syntax, differences in expected outputs and showing how the queries run along with the actual outputs based upon a simple set of dem…
This demo shows you how to set up the containerized NetScaler CPX with NetScaler Management and Analytics System in a non-routable Mesos/Marathon environment for use with Micro-Services applications.

743 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

12 Experts available now in Live!

Get 1:1 Help Now