Solved

Problems with two node redhat cluster with quorum disk.

Posted on 2010-09-02
7
2,769 Views
Last Modified: 2012-08-14
Hello,

I'm building a RHEL 5.5 cluster with 2 nodes and a quorum disk. When I reboot both the nodes in the cluster at the same time, clustat shows both the nodes online and the quorum disk. When I disable one node, it fences the node and reboots. The problem is, this node is not able to quorate to the cluster and never joins back to the cluster. Hence the cluster is left with one node and the quorum disk and the node which was rebooted thinks it is online (split-brain).

Following is how my cluster.conf looks like. Any comments/suggestions will be helpful :)


<?xml version="1.0"?>
<cluster alias="clu" config_version="25" name="clu">
        <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="30"/>
        <clusternodes>
                <clusternode name="node1" nodeid="1" votes="1">
                        <fence>
                                <method name="1">
                        --                                  
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="node2" nodeid="2" votes="1">
                        <fence>
                                <method name="1">
                                  --
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman expected_votes="3"/>
        <fencedevices>
          --
        </fencedevices>
        <rm>
                <failoverdomains>
                        <failoverdomain name="nofailback" nofailback="1" ordered="1" restricted="1">
                                <failoverdomainnode name="node1" priority="1"/>
                                <failoverdomainnode name="node2" priority="2"/>
                        </failoverdomain>
                </failoverdomains>
                <resources/>
        </rm>
        <totem consensus="4800" join="60" token="136000" token_retransmits_before_loss_const="20"/>
        <quorumd interval="3" label="quorumdisk" min_score="1" tko="15" votes="1"/>
</cluster>

0
Comment
Question by:smary
  • 5
  • 2
7 Comments
 
LVL 77

Expert Comment

by:arnold
ID: 33597128
Is the quorum disk on separate system?

I'm not seeing fencedevices nor fence method.
0
 
LVL 1

Author Comment

by:smary
ID: 33598267
Quorum is a shared SAN space between these two nodes and I do have fencedevices and fence method, but I've not included it here. It shouldnt make any difference in deciding this, but I can include it if required.
0
 
LVL 77

Assisted Solution

by:arnold
arnold earned 500 total points
ID: 33598457
Are you using iscsi to access the storage or do you have FC connection to the drives?
When both nodes are up, can both nodes access the data on the shared drive?
Which fencingmethod and fencing device are you using?
A complete cluster.conf would be helpful.

See if the info in http://sourceware.org/cluster/doc/usage.txt helps.
The example has FC connected storage (brocade fencing)
0
Best Practices: Disaster Recovery Testing

Besides backup, any IT division should have a disaster recovery plan. You will find a few tips below relating to the development of such a plan and to what issues one should pay special attention in the course of backup planning.

 
LVL 77

Expert Comment

by:arnold
ID: 33598525
One other thing, make sure that both nodes have an identical copy of the cluster.conf file.
Each node may have itself listed as active and you have post_fail_delay="0"  which I think may lead to both trying to comeup as active nodes.

Any error message in /var/log/messages on either node.  You should also not reboot both at the same time. power off one, then reboot the other which should configure itself as the active one.  Then power up the second node which should then run and attempt to rejoin the cluster.  Double check that the IP resource is configured.
0
 
LVL 77

Expert Comment

by:arnold
ID: 33598598
Not sure which version this deals with centos is based on the redhat foundation http://www.centos.org/docs/5/

If you have access to rhn.redhat.com a similar set of documents should be available there.
0
 
LVL 1

Author Comment

by:smary
ID: 33613217
"Are you using iscsi to access the storage or do you have FC connection to the drives?"
-> FC Connection

"When both nodes are up, can both nodes access the data on the shared drive?"
-> Should be. Both the nodes can see the quorum drive online at the same time.

Which fencingmethod and fencing device are you using?"
-> IBM blade center.

I've verified the both the nodes have identical cluster.conf. The problem is, eventually when one of the two nodes goes down and is rebooted, that node can not form a quorum with the cluster and shows as offline. The node is not able to rejoin the cluster. So the logs in /var/log/messages says that 'Cluster is not quorate. Refusing connections'

I was not able to find much useful info about two node cluster with quorum disk, hence seeking help here.

Any idea why is the node not able to form quorum and join the cluster.
0
 
LVL 77

Accepted Solution

by:
arnold earned 500 total points
ID: 33613987
Other than the storage fencing, don't you also have an IP based fencing?
Can you post the cluster config?
You need to fence based on all the shared resources.
IP,
http://sources.redhat.com/cluster/wiki/FAQ/Fencing

http://www.redhat.com/docs/manuals/csgfs/browse/rh-cs-en/s1-config-fence-devices.html

What is the error if any on the active node.  I think the issue is that when a rebooted node comes up, it can not
0

Featured Post

What is SQL Server and how does it work?

The purpose of this paper is to provide you background on SQL Server. It’s your self-study guide for learning fundamentals. It includes both the history of SQL and its technical basics. Concepts and definitions will form the solid foundation of your future DBA expertise.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
Delete a folder on a linux computer on a regular basis 10 36
Raid 0 2 61
Server backups 5 43
Web resource - Man pages for SUSE Enterprise Linux 11 1 26
AWS Glacier is Amazons cheapest storage option and is their answer to a ‘Cold’ storage service.  Customers primarily use this service for archival purposes and storage of infrastructure backups.  Its unlimited storage potential and low storage cost …
I previously wrote an article addressing the use of UBCD4WIN and SARDU. All are great, but I have always been an advocate of SARDU. Recently it was suggested that I go back and take a look at Easy2Boot in comparison.
Learn several ways to interact with files and get file information from the bash shell. ls lists the contents of a directory: Using the -a flag displays hidden files: Using the -l flag formats the output in a long list: The file command gives us mor…
Get a first impression of how PRTG looks and learn how it works.   This video is a short introduction to PRTG, as an initial overview or as a quick start for new PRTG users.

773 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question