Solved

Redhat Cluster issue

Posted on 2010-09-23
37
1,884 Views
Last Modified: 2012-06-27
hi
I am having bellow issue with my cluster.

I am using luci to administration all my cluster
one service httpd1 , it suppose to run on node1, and fail over domain is node1 and node2

when i am trying to relocate this service to node1,
from luci server, i am seeing this error :

 luci[3666]: Unable to retrieve batch 2010230059 status from luci.local:11111: clusvcadm start failed to start httpd1:

I am very familiar with this error, as i know its ricci in node1 ( where httpd1 suppose to run) is not responsive.

i have checked, ricci is running on node1,
pgrep ricci
5544

I have checked port 11111 is listening
netstat -an  | grep 11111
tcp        0      0 0.0.0.0:11111               0.0.0.0:*                   LISTEN

I have check iptables is flush .

but still i cant relocate this service on node1, its running well on node2,but not node1 .


from node1( where the problem is )

 clustat
Cluster Status for ng1 @ Thu Sep 23 14:27:34 2010
Member Status: Quorate

 Member Name                             ID   Status
 ------ ----                             ---- ------
 riciserver..local                  1 Online, rgmanager
 dns1..local              2 Online, rgmanager
 node1..local                   3 Online, Local, rgmanager
 node2..local                  4 Online, rgmanager

 Service Name                   Owner (Last)                   State
 ------- ----                   ----- ------                   -----
 service:httpd1                 node2..local     started
 

i m missing something.. but don't understand what..
can any one please help me with this.. thanks

0
Comment
Question by:fosiul01
  • 25
  • 12
37 Comments
 
LVL 76

Expert Comment

by:arnold
Comment Utility
Are there any logs on node2 specifically for the attempt to start httpd1 service??
Is the IP on which httpd1 is supposed to be listening shifted as well?

0
 
LVL 29

Author Comment

by:fosiul01
Comment Utility
Currently httpd1 service is running on node2 ( as its its failover domain)

it suppose to run on node1 but it does not .


when i try to relocate the service from luci server, i see that error on luci server but no log in node1 or node 2

and from luci server when i try to get node1 information via luci web Gui its say :

The ricci agent afor this node is unresponsive. node-specifiq information is not available at this time.

but as i said, ricci is running node node1, iptables flush,

dont understand what could be the problem..


0
 
LVL 29

Author Comment

by:fosiul01
Comment Utility
From luci server, when i try to view log of node1, some times it will show log but some time it will show its not responsive.. this is wired ..

0
 
LVL 29

Author Comment

by:fosiul01
Comment Utility
like this log

luci[3666]: Error reading from node1.xxxxx.local:11111: timeout

and ricci is running in node1....



0
 
LVL 76

Expert Comment

by:arnold
Comment Utility
if you ran the same check on node1 what does it report?
Not familiar with the luci,ricci

Check iptables?

Are you able to login into node1 and see through the various logs what is going on?
Could it be that the certificates they use have expired or are not trusted??
luci, ricci, messages?
0
 
LVL 29

Author Comment

by:fosiul01
Comment Utility
if you ran the same check on node1 what does it report?  : what kind of check you want me to do ??

Check iptables? : its flushed

iptables -L
Chain INPUT (policy ACCEPT)
target     prot opt source               destination

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination

Chain RH-Firewall-1-INPUT (0 references)
target     prot opt source               destination


Are you able to login into node1 and see through the various logs what is going on? : there is nothing unusual in node1 log. its connected to cluster,
in node1 another service is running. and i cant relocate that service to another node.
so bascally i cant move or relocate any service from node1 .


0
 
LVL 76

Expert Comment

by:arnold
Comment Utility
See if you configured exclusions i.e. servicex and servicey can not run on the same node?

Check the various logs that you need luci, ricci, clummgr, etc. To see whether it is reporting anything.
Have you updated either of the systems kernel/cluster suite?
0
 
LVL 29

Author Comment

by:fosiul01
Comment Utility
Have you updated either of the systems kernel/cluster suite? :
 i updated server by yum update. but that should not make any difference, because i checked all the rpm is same to other nodes, and this nodes suddenly giving this trouble ...,

 See if you configured exclusions i.e. servicex and servicey can not run on the same node?  ::
no nothing.. basically ricci is very unstable on this server. i will try to delete this node and join again, see if that helps


0
 
LVL 29

Author Comment

by:fosiul01
Comment Utility
There is another service is running on this node, i cant even relocate that service to another node

some times, its saying ( from luci gui interface)
The ricci agent for this node is unresponsive. Node-specific information is not available at this time.             

some times its ok.

making me mad now


0
 
LVL 76

Expert Comment

by:arnold
Comment Utility
There has to be some log entry that describes what is preventing the relocation of the service.
0
 
LVL 29

Author Comment

by:fosiul01
Comment Utility
thats the problem ...

only 2 log

either

luci[3666]: Unable to retrieve batch 2010230059 status from luci.local:11111: clusvcadm start failed to start httpd1:

or


luci[3666]: Error reading from node1.xxxxx.local:11111: timeout
or

The ricci agent for this node is unresponsive. Node-specific information is not available at this time.

and every single of this error related to ricci daemon ..

is there any way to debug cluster log ?? enable some debugging to see what causing ..


0
 
LVL 29

Author Comment

by:fosiul01
Comment Utility
Some one from redhat cluster mailing list confermed me its a bug for 5.6

https://bugzilla.redhat.com/show_bug.cgi?id=564490

now i need to know how put the patch ..

0
 
LVL 76

Accepted Solution

by:
arnold earned 500 total points
Comment Utility
Download and create the patchfile ricci.patch

Download and install the source of conga:
wget
http://mirrors.kernel.org/centos/5/os/SRPMS/conga-0.12.2-12.el5.centos.1.src.rpm
rpm -i conga-0.12.2-12.el5.centos.1.src.rpm
cd /usr/src/redhat/SOURCE
gzip -cd < conga-0.12.2-12.tar.gz | tar -xf -


patch -p0 < /path/to/where_the_patch/ricci.patch
This should tell you that it updated the files
conga-*/ricci/common/executils.cpp.

Alternatively you can edit the file and replace the line #174 marked by (-)  with the ones with the (+) signs. then repeat the replacement of the second line marked with (-) with the following (+) marked lines.  
You might want to configure the installation of this version in /usr/local such that should an update that has this issue still, it will not cause your system problems.

There is an INSTALL file follow the instructions withinexec.

0
 
LVL 29

Author Comment

by:fosiul01
Comment Utility
Hi yah
Sorry i was not available from last 2 days


you said , Download and create the patchfile ricci.patch  :

how you want me to create this patch file ?? from where will i  downland this  ??
thanks

0
 
LVL 29

Author Comment

by:fosiul01
Comment Utility
Hi, also please have a look on this one
http://www.experts-exchange.com/OS/Linux/Q_26501563.html
0
 
LVL 76

Expert Comment

by:arnold
Comment Utility
The link you posted has a download section where the patch information is displayed in plain text.
Copy and paste the data into a vi patch_filename.
It's fairly straight forward.  Once you download and install the conga source.  Then expand the conga tar in /usr/src/redhat/SOURCES/conga-*.tar.gz

while you are in the  /usr/src/redhat/SOURCES/ sa your CWD.
run:
patch -p0 < patch_filename
You will see what the patch did
ie. hunk #so an do line number #such and such.
Once you're done, cd into the conga- directory and follow the INSTALL instructions.

0
 
LVL 76

Expert Comment

by:arnold
Comment Utility
Oh, note the conga package includes other patches bz* that you need to apply prior to compile/install of the conga related applications.

If you have a test environment, you might want to test first.
0
 
LVL 29

Author Comment

by:fosiul01
Comment Utility
Ok i need little bit of help

its saying , cant find the file to patch

my ricci.path is in : root/patch/ricci.patch

and i went to

/usr/src/redhat/SOURCES/conga-0.12.2



then execute

patch -p0 < root/patch/ricci.patch

but its saying,

can't find file to patch at input line 4
Perhaps you used the wrong -p or --strip option?
The text leading up to this was:
--------------------------
|diff -up conga-0.12.2/ricci/common/executils.cpp.waitpidfix conga-0.12.2/ricci/common/executils.cpp
|--- conga-0.12.2/ricci/common/executils.cpp.waitpidfix 2010-03-17 14:38:38.000000000 -0500
|+++ conga-0.12.2/ricci/common/executils.cpp    2010-03-17 14:40:22.000000000 -0500
--------------------------
File to patch:



so from where i should execute the patch command
thanks

0
Complete Microsoft Windows PC® & Mac Backup

Backup and recovery solutions to protect all your PCs & Mac– on-premises or in remote locations. Acronis backs up entire PC or Mac with patented reliable disk imaging technology and you will be able to restore workstations to a new, dissimilar hardware in minutes.

 
LVL 29

Author Comment

by:fosiul01
Comment Utility
no its ok

it saying

patch -p0 < /root/patch/ricci.patch
patching file conga-0.12.2/ricci/common/executils.cpp
Hunk #1 succeeded at 173 (offset -15 lines).

i had to run this from Source directory
0
 
LVL 29

Author Comment

by:fosiul01
Comment Utility
what does it mean ..

[root@beaver conga-0.12.2]# ./configure
D-BUS required, but I am unable to locate it. Is it installed?



i trying to find in google.. but no luck ..

0
 
LVL 29

Author Comment

by:fosiul01
Comment Utility
this is giving me problem..

installed dbus

[root@beaver conga-0.12.2]# ./configure --include_zope_and_plone=yes
D-BUS version 1.1.2 detected  -> major 1, minor 1
missing zope directory, extract zope source-code into it and try again


dont understand how to say where is zope source code

the path of zope source code is :
 /usr/src/redhat/SOURCES/conga-0.12.2/Zope-2.9.8-final




0
 
LVL 76

Expert Comment

by:arnold
Comment Utility
Note within the source directory there is a plone and a zope zipped archive that were included within the conga RPM.  You need to expand them if you want them included.
0
 
LVL 29

Author Comment

by:fosiul01
Comment Utility
this is the directory

[root@beaver conga-0.12.2]# pwd
/usr/src/redhat/SOURCES/conga-0.12.2
[root@beaver conga-0.12.2]# ls
autogen.sh        doc             Plone-2.5.5-CMFPlone.patch
BRANCH            download_files  Plone-2.5.5.tar.gz
clustermon.spec   INSTALL         ricci
configure         luci            Zope-2.9.8-final
conga.spec        make            Zope-2.9.8-final.tgz
conga.spec.in.in  Makefile
COPYING           Plone-2.5.5
[root@beaver conga-0.12.2]#
 ./configure --include_zope_and_plone=yes
D-BUS version 1.1.2 detected  -> major 1, minor 1
missing zope directory, extract zope source-code into it and try again


so how will tell the /configure that zope source code is in /usr/src/redhat/SOURCES/conga-0.12.2/ Zope-2.9.8-final



0
 
LVL 76

Expert Comment

by:arnold
Comment Utility
why did you expand zope and plone in the congo directory versus in the SOURCES?

move Zope* and Plone* out of conga and see if that is fixed?
Do you need zope or plone?
0
 
LVL 29

Author Comment

by:fosiul01
Comment Utility
Do you need zope or plone? : no i dont need,...

I select NO, but it does not do anything ..

0
 
LVL 29

Author Comment

by:fosiul01
Comment Utility
move Zope* and Plone* out of conga and see if that is fixed? :

I tryed that one

[root@beaver SOURCES]# ls
bz469881.patch  bz519268.patch              luci_db.tar.gz
bz501780.patch  bz521884.patch              Plone-2.5.5
bz508142.patch  bz530129.patch              Plone-2.5.5-CMFPlone.patch
bz514051.patch  conga-0.12.2                Plone-2.5.5.tar.gz
bz517114.patch  conga-0.12.2.tar.gz         Zope-2.9.8-final
bz519050.patch  conga-centos-updated.patch  Zope-2.9.8-final.tgz
bz519252.patch  Data.fs



Do you need zope or plone? :

 ./configure --include_zope_and_plone=no
D-BUS version 1.1.2 detected  -> major 1, minor 1
Run 'make' to compile conga and clustermon
Run 'make conga' to compile conga
Run 'make clustermon' to compile clustermon

make :

error: command 'gcc' failed with exit status 1
make[2]: *** [build] Error 1
make[2]: Leaving directory `/usr/src/redhat/SOURCES/conga-0.12.2/luci/conga_ssl'
make[1]: *** [luci] Error 2
make[1]: Leaving directory `/usr/src/redhat/SOURCES/conga-0.12.2/luci'
make: *** [luci] Error 2


going to mad ..

0
 
LVL 76

Expert Comment

by:arnold
Comment Utility
Try the following:
make distclean
./autogen.sh
./configure
make
make install

0
 
LVL 29

Author Comment

by:fosiul01
Comment Utility
installing from source ... its impossible for me ..

how ever

http://rhn.redhat.com/errata/RHBA-2010-0716.html

red hat published this packages yesterday which suppose to be fixed the issue ..

ricci-0.12.2-12.el5_5.4.i386.rpm
ricci-0.12.2-12.el5_5.4.i386.rpm


but i already has this

luci-0.12.2-12.el5.centos.1
 ricci-0.12.2-12.el5.centos.1


are not they same version ??


0
 
LVL 29

Author Comment

by:fosiul01
Comment Utility
I think i know the problem ...


ip ro
10.0.0.0/26 dev eth0  proto kernel  scope link  src 10.0.0.4
10.0.0.0/26 dev br0  proto kernel  scope link  src 10.0.0.52
192.168.122.0/24 dev virbr0  proto kernel  scope link  src 192.168.122.1
169.254.0.0/16 dev br0  scope link
default via 10.0.0.10 dev br0   : it should go via eth0

but question is how to change this to etho ??

getting interested ...


0
 
LVL 29

Author Comment

by:fosiul01
Comment Utility
here br0 , because i installed KVM (virtualization)

but it should not be any problem if it goes via br0...


0
 
LVL 29

Author Comment

by:fosiul01
Comment Utility
i change it to eth0 ...

stop all cluster related service.. then restart again... still no luck .....



0
 
LVL 29

Author Comment

by:fosiul01
Comment Utility
i found this interesting .. but dont know if its normal or not

i typed this command in 3 srver

tcpdump -i eth0 ip multicast


and for some reason.. i am seeting same output in 3 server which is

11:26:13.700399 IP http1.xxxxx.local.5149 > 239.192.2.185.netsupport: UDP, length 118


example.. Same output in every 3 server..

is this normal output ?? ( here http1 is having the trouble to locate or relocate services in the cluster)

so basically, what ever i am seeting in http1 server i am seeing the same out put on rest ..

here 239.192.2.185 is  the multicast address of clsuter

0
 
LVL 29

Author Comment

by:fosiul01
Comment Utility
I reboot the whole cluster, every single server

when every one has been rebooted ..

every thing was looking alright!!

[root@http1 ~]# clustat
Cluster Status for ng1 @ Tue Sep 28 13:03:45 2010
Member Status: Quorate

 Member Name                             ID   Status
 ------ ----                             ---- ------
 beaver.xx.local                  1 Online, rgmanager
 publicdns1.xxx.local              2 Online, rgmanager
 http1.xxx.local                   3 Online, Local, rgmanager
 mail01.xxx.local                  4 Online, rgmanager

 Service Name                   Owner (Last)                   State
 ------- ----                   ----- ------                   -----
 service:httpd1                 http1.xxx.local      started    ------------------- this suppose to be here.
 service:mysql-server           mail01.xxx.local     started
 service:public-dns             publicdns1.xxx.local started

but now i was trying to relocate that service from http1.xxx.locate to mail01.xxx.local

or even trying to access http1.xxx.local from luci server, same problem again ......


so something else is upsetting.. dont know ...


0
 
LVL 76

Expert Comment

by:arnold
Comment Utility
run netstat -rn
Change the IP on the br0 interface.
Alternatively you can set a weight on eth0 to make it more preferred.
i.e. eth0 weight 0
br0 weight 100
0
 
LVL 29

Author Comment

by:fosiul01
Comment Utility
i change it by

ip route change default via 10.0.0.10 dev eth0

but when i do tcpdump on br0, it does not sent any multicast,

multicast is going out via eth0 which is good for cluster

is not it ?



0
 
LVL 76

Expert Comment

by:arnold
Comment Utility
You need to make it a permanent change, I think you can add weight to the /etc/sysconfig/network-scripts/ifcfg-eth0 and the same to the br0 to make this change permanent or you can always add the route rule to the rc.local

Not sure the cluster communication is a multicast.
0
 
LVL 29

Author Comment

by:fosiul01
Comment Utility
I broke this node, and trying to install from scratch to see in which point its breaks

anyway, thanks for your advise and patience
0

Featured Post

How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

Join & Write a Comment

Daily system administration tasks often require administrators to connect remote systems. But allowing these remote systems to accept passwords makes these systems vulnerable to the risk of brute-force password guessing attacks. Furthermore there ar…
Introduction We as admins face situation where we need to redirect websites to another. This may be required as a part of an upgrade keeping the old URL but website should be served from new URL. This document would brief you on different ways ca…
Learn several ways to interact with files and get file information from the bash shell. ls lists the contents of a directory: Using the -a flag displays hidden files: Using the -l flag formats the output in a long list: The file command gives us mor…
Learn how to get help with Linux/Unix bash shell commands. Use help to read help documents for built in bash shell commands.: Use man to interface with the online reference manuals for shell commands.: Use man to search man pages for unknown command…

743 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

15 Experts available now in Live!

Get 1:1 Help Now