Solved

Red Hat Cluster Suit -- Oracle HA --Trying to repair cluster

Posted on 2008-11-02
14
1,285 Views
Last Modified: 2013-12-19
I'm trying to repair a cluster that was setup by someone else who left the office.  I'm attempting to follow the commands in the manual to remove the broken one so I can re-add it but the service command is telling me the service is unrecognized.

When I do service --status-all I can see services running such a clurgmgrd and clvmd, but when I do service clurgmgrd stop I get unrecognized service.

Not sure why its doing it but i'm leery of following the rest of the directions to delete and readd the cluster when the services are definately running even when the service command doesn't see it.
0
Comment
Question by:Calbrenar
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 7
  • 7
14 Comments
 

Author Comment

by:Calbrenar
ID: 22863460
Oh I wanted to add.  We are using scripts to modify the default cluster setup for Oracle according to this article here.
 
http://www.samag.com/documents/s=9370/sam0704a/0704a.htm
0
 
LVL 11

Accepted Solution

by:
jgiordano earned 500 total points
ID: 22863610
Here are the steps to stop the cluster services on a member, what are you trying to repair?

 To stop the cluster software on a member, type the following commands in this order:

   1.      service rgmanager stop
   2.      service gfs stop, if you are using Red Hat GFS
   3.      service clvmd stop, if CLVM has been used to create clustered volumes
   4.      service cman stop


http://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/en-US/RHEL510/Cluster_Administration/s1-admin-start-CA.html

0
 

Author Comment

by:Calbrenar
ID: 22863632
When I first attempted to get oracle working it kept telling me oracle wasn't started even after starting it.  So I started looking at the cluster but there was no way to start.  After rebooting the machines one at a time I was able to get into the cluster manager on both servers but one of them is not part of the cluster and gives an error about going to the manager.
 
because this node is not currently part of a cluster
 
So I decided to try removing it and readding it and ran into the problems above.  I'm trying your steps above currently its stuck at "Waiting for services to stop:"
0
Enterprise Mobility and BYOD For Dummies

Like “For Dummies” books, you can read this in whatever order you choose and learn about mobility and BYOD; and how to put a competitive mobile infrastructure in place. Developed for SMBs and large enterprises alike, you will find helpful use cases, planning, and implementation.

 
LVL 11

Expert Comment

by:jgiordano
ID: 22863645
It sounds like your cluster is in a pretty funky state. I feel your pain RHEL Cluster Suite isn't exactly the easiest cluster prodcut to work with. Is this system productionalized?

Have you considered using Conga to manage the cluster in addition to the command line? I find it much easier to manage/add/remove the resources of the cluster.
0
 

Author Comment

by:Calbrenar
ID: 22863760
not sure what conga is but we're basically stuck with using hte cluster as configured as its already been released and is in use by the customer so we have to maintain it and develop releases for it.   I think it may just be this one member that is screwed up the other appears to be working but i'm afraid to delete it out of the cluster without fully stopping all the services.   I tried doing CTRL-C to stop it then rerunning it  again but it sitll sticks at waiting for services to stop
0
 
LVL 11

Expert Comment

by:jgiordano
ID: 22863789
Conga is 2 packages that you can install onto a separate RH box or members of the cluster. Luci is the management piece and ricci is the agent piece. It allows you to do many of the tasks you are doing now via gui.

Back to your server, I would try rebooting again.
Do you have any type of fencing enabled?
0
 

Author Comment

by:Calbrenar
ID: 22863800
Are you referring to the Cluster Configuration tool?  It has 2 tabs Cluster configuration and Cluster Management.  It shows the machine i'm having problems with but says for status not a member.
 
I believe fencing is enabled because when I reboot the the computer and watch the console I see it talking about fencing and it takes a long time to do anything.
0
 
LVL 11

Expert Comment

by:jgiordano
ID: 22863812
No conga is web based, I haven't used the cluster config gui tool.


That may mean that it is not enabled. I had a HUGE issue with my cluster until I enabled some type of fencing. Can you cat the  /etc/cluster/cluster.conf on both nodes and make sure that they are @ the same rev level. Also check the section for fencing and see what is in that section.
0
 

Author Comment

by:Calbrenar
ID: 22863828
The one on the broken one says config_version 15.  The good one says 14.  
 
Both say fence_daemon post_fail_delay = 0
Both machines are enclosed in <fence></fence> tags and fencedevice is set to fence_manual.
 
The version most likely is different because I deleted the broken member from the failover domain.  In fact I'm sure tha'ts true because when I hit save on the good one and reloaded the config they are both set to 15 now.
0
 
LVL 11

Expert Comment

by:jgiordano
ID: 22863841
Just because they are both 15 doesn't mean that they are identical; you want to have them identical. You currently are not using any sort of fencing since it is set to manual. I would recommend you configure some type of fencing. Many of your problems will go away once fencing is correctly configured. Even though they have an option for manual, the cluster takes forever to start and stop fencing when it is not configured.  
0
 

Author Comment

by:Calbrenar
ID: 22863856
I'd be happy to configure it once I can get rid of the bad cluster.  If I can remove everyting off if it then I can add it back into the cluster from scratch.  But I'm not sure how to do it since the command you listed never finishes.
0
 
LVL 11

Expert Comment

by:jgiordano
ID: 22863962
It might be hung and the only solution is to reboot. I am dealt with a similar issue with clvmd and needed to reboot.
0
 

Author Comment

by:Calbrenar
ID: 22864010
machine wouldn't even reboot i had to reboot the VM.  Now the commands work.  Going to remove it form the cluster and readd.  Wish me luck -- thanks for the help A+!
0
 
LVL 11

Expert Comment

by:jgiordano
ID: 22864014
Great! - Definitely use the VM fencing you will notice that the fence daemon starts and stops much easier.
0

Featured Post

Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Linux users are sometimes dumbfounded by the severe lack of documentation on a topic. Sometimes, the documentation is copious, but other times, you end up with some obscure "it varies depending on your distribution" over and over when searching for …
Cursors in Oracle: A cursor is used to process individual rows returned by database system for a query. In oracle every SQL statement executed by the oracle server has a private area. This area contains information about the SQL statement and the…
This video shows how to set up a shell script to accept a positional parameter when called, pass that to a SQL script, accept the output from the statement back and then manipulate it in the Shell.
This video shows how to configure and send email from and Oracle database using both UTL_SMTP and UTL_MAIL, as well as comparing UTL_SMTP to a manual SMTP conversation with a mail server.

734 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question