lhrslsshahi
asked on
AWS- KeepAlived notify script not working
Hello Experts,
To test HA if I stop keepalived the KeepAlived action does run the notify shell script but does not disassociate and reassociate EIP to another instance. I can see the VRRP changing the instance from Master to the Backup and vice versa.
However if I run the keepalived-notfy.sh manually it works as expected. Please see attached keepalived.conf and the shell script.
Centos 7.2 is the OS version. Any help would be appreciated.
-keepalived_conf.txt
-Keepalived_notify_sh.txt
To test HA if I stop keepalived the KeepAlived action does run the notify shell script but does not disassociate and reassociate EIP to another instance. I can see the VRRP changing the instance from Master to the Backup and vice versa.
However if I run the keepalived-notfy.sh manually it works as expected. Please see attached keepalived.conf and the shell script.
Centos 7.2 is the OS version. Any help would be appreciated.
-keepalived_conf.txt
-Keepalived_notify_sh.txt
ASKER
I will come back to you. Thanks for your patience.
ASKER
I have turned on the logging and get the below;
HTTPSConnectionPool(host=' ec2.us-eas t-1.amazon aws.com', port=443): Max retries e
xceeded with url: / (Caused by ConnectTimeoutError(<botoc ore.awsreq uest.AWSHT TPS
Connection object at 0x32f0a90>, 'Connection to ec2.us-east-1.amazonaws.co m time
d out. (connect timeout=60)'))
+ info=
+ '[' 255 -ne 0 ']'
+ echo 'Could not find info for 44.193.35.115'
Could not find info for 44.193.35.115
+ exit 1
HTTPSConnectionPool(host='
xceeded with url: / (Caused by ConnectTimeoutError(<botoc
Connection object at 0x32f0a90>, 'Connection to ec2.us-east-1.amazonaws.co
d out. (connect timeout=60)'))
+ info=
+ '[' 255 -ne 0 ']'
+ echo 'Could not find info for 44.193.35.115'
Could not find info for 44.193.35.115
+ exit 1
I have no problem at all with that site
19:49:45$ ping -c1 ec2.us-east-1.amazonaws.com
PING ec2.amazonaws.com (54.239.28.168) 56(84) bytes of data.
64 bytes from 54.239.28.168: icmp_seq=1 ttl=227 time=241 ms
--- ec2.amazonaws.com ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 241.582/241.582/241.582/0.000 ms
09:32:48$ telnet ec2.us-east-1.amazonaws.com 443
Trying 54.239.28.168...
Connected to ec2.us-east-1.amazonaws.com.
Escape character is '^]'.
^]
telnet> q
Connection closed.
09:33:16$
Please try these commands yourself and post back.
ASKER
I dont have any problems with these tests. When I run the script manually it goes through.
Are you saying you actually tried the specific commands in https:#a41920173 ? If not, please try them.
I know "when you run the script manually it goes through" but that may be because it does not try to visit amazonaws.com. I really need that sort of detail to get a handle on your problem.
Assuming that you did try the specific commands, I suspect you are running into a DNS time-out. If a local DNS doesn't have a requested address in its cache, it can take that DNS some time to get that address from somewhere else. Once it has it, it sticks around in cache for a while. That would agree with your observation that the whole script works for you later.
If this is your problem, you will need to force DNS to get the wanted address by inserting logic at the head of the script or in the calling script before invoking the failing script. Retry the ping -c1 ... command until it is successful. To cater for the site's actually being down, limit this to, say, 20 attempts.
I know "when you run the script manually it goes through" but that may be because it does not try to visit amazonaws.com. I really need that sort of detail to get a handle on your problem.
Assuming that you did try the specific commands, I suspect you are running into a DNS time-out. If a local DNS doesn't have a requested address in its cache, it can take that DNS some time to get that address from somewhere else. Once it has it, it sticks around in cache for a while. That would agree with your observation that the whole script works for you later.
If this is your problem, you will need to force DNS to get the wanted address by inserting logic at the head of the script or in the calling script before invoking the failing script. Retry the ping -c1 ... command until it is successful. To cater for the site's actually being down, limit this to, say, 20 attempts.
ASKER
I am getting the same results
ping -c1 ec2.us-east-1.amazonaws.co m
PING ec2.amazonaws.com (54.239.28.176) 56(84) bytes of data.
64 bytes from 54.239.28.176: icmp_seq=1 ttl=246 time=1.07 ms
telnet ec2.us-east-1.amazonaws.co m 443
Trying 54.239.28.176...
Connected to ec2.us-east-1.amazonaws.co m.
Escape character is '^]'.
q
Connection closed by foreign host.
The below is the script when run manually with debugging.
Fri Dec 9 12:17:37 EST 2016
+ echo BACKUP
+ export AWS_ACCESS_KEY_ID=BKIAIR4U DFXNGDCN2X 5A
+ AWS_ACCESS_KEY_ID=AKIAIR4U DFXNGDZN2X 5C
+ export AWS_SECRET_ACCESS_KEY=ZHup 8GD8jZPLza 3JbUO5LniW lLjNqogvKz tH2gsM
+ AWS_SECRET_ACCESS_KEY=EHup 8GD8jZPLza 3JbUO5LniW lWjNqogvKz tH2gsL
+ export AWS_DEFAULT_REGION=us-east -1
+ AWS_DEFAULT_REGION=us-east -1
+ EIP=44.193.35.215
+ INSTANCE_ID=i-e9ca0d61
++ /bin/aws ec2 describe-addresses --public-ips 44.193.35.215 --output text
+ info='ADDRESSES eipalloc-fc6079c3 eipassoc-cf98aef0 vpc
i-f9ca0d61 eni-0e03f7e2 498606453616 172.30.2.61 34.193.35.215'
+ '[' 0 -ne 0 ']'
++ echo ADDRESSES eipalloc-fc6079c3 eipassoc-cf98aef0 vpc i-e9ca0d61 eni-0e03f7e
2 498606453616 172.30.2.61 44.193.35.215
++ awk '{print $2;}'
+ allocation_id=eipalloc-fc6 079c3
++ echo ADDRESSES eipalloc-fc6079c3 eipassoc-cf98aef0 vpc i-e9ca0d61 eni-0e03f7e
2 498606453616 172.30.2.61 44.193.35.215
++ awk '{print $3;}'
+ association_id=eipassoc-cf 98aef0
+ [[ eipassoc-cf98aef0 == eipassoc-* ]]
+ /bin/aws ec2 disassociate-address --association-id eipassoc-cf98aef0
+ /bin/aws ec2 associate-address --instance-id i-e9ca0d61 --allocation-id eipall
oc-fc6079c3
ping -c1 ec2.us-east-1.amazonaws.co
PING ec2.amazonaws.com (54.239.28.176) 56(84) bytes of data.
64 bytes from 54.239.28.176: icmp_seq=1 ttl=246 time=1.07 ms
telnet ec2.us-east-1.amazonaws.co
Trying 54.239.28.176...
Connected to ec2.us-east-1.amazonaws.co
Escape character is '^]'.
q
Connection closed by foreign host.
The below is the script when run manually with debugging.
Fri Dec 9 12:17:37 EST 2016
+ echo BACKUP
+ export AWS_ACCESS_KEY_ID=BKIAIR4U
+ AWS_ACCESS_KEY_ID=AKIAIR4U
+ export AWS_SECRET_ACCESS_KEY=ZHup
+ AWS_SECRET_ACCESS_KEY=EHup
+ export AWS_DEFAULT_REGION=us-east
+ AWS_DEFAULT_REGION=us-east
+ EIP=44.193.35.215
+ INSTANCE_ID=i-e9ca0d61
++ /bin/aws ec2 describe-addresses --public-ips 44.193.35.215 --output text
+ info='ADDRESSES eipalloc-fc6079c3 eipassoc-cf98aef0 vpc
i-f9ca0d61 eni-0e03f7e2 498606453616 172.30.2.61 34.193.35.215'
+ '[' 0 -ne 0 ']'
++ echo ADDRESSES eipalloc-fc6079c3 eipassoc-cf98aef0 vpc i-e9ca0d61 eni-0e03f7e
2 498606453616 172.30.2.61 44.193.35.215
++ awk '{print $2;}'
+ allocation_id=eipalloc-fc6
++ echo ADDRESSES eipalloc-fc6079c3 eipassoc-cf98aef0 vpc i-e9ca0d61 eni-0e03f7e
2 498606453616 172.30.2.61 44.193.35.215
++ awk '{print $3;}'
+ association_id=eipassoc-cf
+ [[ eipassoc-cf98aef0 == eipassoc-* ]]
+ /bin/aws ec2 disassociate-address --association-id eipassoc-cf98aef0
+ /bin/aws ec2 associate-address --instance-id i-e9ca0d61 --allocation-id eipall
oc-fc6079c3
ASKER
On the standby instance when I keep the SSH open I get the below.
It definitely does behave differently.
The reason why I get the below because I have temporarily have assigned the EIP on purpose An error occurred (Resource.AlreadyAssociate d) when calling the AssociateAddress operation: resource eni-0e03f7e2 and 172.30.2.61 is already associated with public address 44.193.13.64
Tue Dec 13 05:07:20 EST 2016
+ echo MASTER
+ export AWS_ACCESS_KEY_ID=BKIAIR4U DFXNGDCN2X 5A
+ AWS_ACCESS_KEY_ID=CKIAIR4U DFXNGDCN2X 5A
+ export AWS_SECRET_ACCESS_KEY=FHup 8GD8jZPLza 3JbUO5LniW lWjNqogvKz tH2gsZ
+ AWS_SECRET_ACCESS_KEY=NHup 8GD8jZPLza 3JbUO5LniW lWjNqogvKz tH2gsM
+ export AWS_DEFAULT_REGION=us-east -1
+ AWS_DEFAULT_REGION=us-east -1
+ EIP=44.193.35.215
+ INSTANCE_ID=i-e9ca0d61
++ /bin/aws ec2 describe-addresses --public-ips 44.193.35.215 --output text
+ info='ADDRESSES eipalloc-fc6079c3 eipassoc-e86360d7 vpc i-e9cb0c78 eni-320efade 498606453616 172.30.2.60 44.193.35.215'
+ '[' 0 -ne 0 ']'
++ echo ADDRESSES eipalloc-fc6079c3 eipassoc-e86360d7 vpc i-e9cb0c78 eni-320efade 498606453616 172.30.2.60 44.193.35.215
++ awk '{print $2;}'
+ allocation_id=eipalloc-fc6 079c3
++ echo ADDRESSES eipalloc-fc6079c3 eipassoc-e86360d7 vpc i-e9cb0c78 eni-320efade 498606453616 172.30.2.60 44.193.35.215
++ awk '{print $3;}'
+ association_id=eipassoc-e8 6360d7
+ [[ eipassoc-e86360d7 == eipassoc-* ]]
+ /bin/aws ec2 disassociate-address --association-id eipassoc-e86360d7
+ /bin/aws ec2 associate-address --instance-id i-e9ca0d61 --allocation-id eipalloc-fc6079c3
An error occurred (Resource.AlreadyAssociate d) when calling the AssociateAddress operation: resource eni-0e03f7e2 and 172.30.2.61 is already associated with public address 44.193.13.64
It definitely does behave differently.
The reason why I get the below because I have temporarily have assigned the EIP on purpose An error occurred (Resource.AlreadyAssociate
Tue Dec 13 05:07:20 EST 2016
+ echo MASTER
+ export AWS_ACCESS_KEY_ID=BKIAIR4U
+ AWS_ACCESS_KEY_ID=CKIAIR4U
+ export AWS_SECRET_ACCESS_KEY=FHup
+ AWS_SECRET_ACCESS_KEY=NHup
+ export AWS_DEFAULT_REGION=us-east
+ AWS_DEFAULT_REGION=us-east
+ EIP=44.193.35.215
+ INSTANCE_ID=i-e9ca0d61
++ /bin/aws ec2 describe-addresses --public-ips 44.193.35.215 --output text
+ info='ADDRESSES eipalloc-fc6079c3 eipassoc-e86360d7 vpc i-e9cb0c78 eni-320efade 498606453616 172.30.2.60 44.193.35.215'
+ '[' 0 -ne 0 ']'
++ echo ADDRESSES eipalloc-fc6079c3 eipassoc-e86360d7 vpc i-e9cb0c78 eni-320efade 498606453616 172.30.2.60 44.193.35.215
++ awk '{print $2;}'
+ allocation_id=eipalloc-fc6
++ echo ADDRESSES eipalloc-fc6079c3 eipassoc-e86360d7 vpc i-e9cb0c78 eni-320efade 498606453616 172.30.2.60 44.193.35.215
++ awk '{print $3;}'
+ association_id=eipassoc-e8
+ [[ eipassoc-e86360d7 == eipassoc-* ]]
+ /bin/aws ec2 disassociate-address --association-id eipassoc-e86360d7
+ /bin/aws ec2 associate-address --instance-id i-e9ca0d61 --allocation-id eipalloc-fc6079c3
An error occurred (Resource.AlreadyAssociate
This is a different problem isn't it? Is it a problem at all?
Your new posts mention address 44.193.35.215 but your original error log https:#a41919846 mentions 44.193.35.115
Your new posts mention address 44.193.35.215 but your original error log https:#a41919846 mentions 44.193.35.115
ASKER
Sorry... I was trying to mask the IP :-) Same problem.
Resource.AlreadyAssociated is not the error you were getting originally. Originally , /bin/aws ec2 describe-addresses --public-ips 44.193.35.215 --output text was getting a connection time out. When you ran the script manually, that did not happen.
Perhaps at boot time you run the script too early, before the network is ready. Or it could be DNS caching as I suggested before.
Perhaps at boot time you run the script too early, before the network is ready. Or it could be DNS caching as I suggested before.
ASKER
You are right that is not the original error. I have done this on the purpose by assigning the second instance an EIP. This was not done manually the only difference is that I
had the SSH session open. No DNS issues does the output from the following command behave differently when a SSH session is not open.
info=`/bin/aws ec2 describe-addresses --public-ips $EIP --output text`
had the SSH session open. No DNS issues does the output from the following command behave differently when a SSH session is not open.
info=`/bin/aws ec2 describe-addresses --public-ips $EIP --output text`
I do not know what /bin/aws does, but I would not expect it to be affected by whether an ssh session is open or not.
I recommend you try my suggestion from https:#a41923068. Something like
I recommend you try my suggestion from https:#a41923068. Something like
for((i = 0; i < 20; i++))
do
ping -c1 ec2.us-east-1.amazonaws.com && break
done
if [ $i -eq 20 ]
then
echo "Unable to contact ec2.us-east-1.amazonaws.com" >&2
exit 1
fi
Unless other start-up items depend on this script's completion, background it. That way, something else may get done that allows it to complete.
ASKER
+ EIP=44.193.35.215
+ INSTANCE_ID=i-e9ca0d61
++ /bin/aws ec2 describe-addresses --public-ips 44.193.35.215 --output text
+ info='ADDRESSES eipalloc-fc6079c3 eipassoc-cf0d10f0 vpc i-e0cb0c78 eni-320efade 498606453616 172.30.2.60 44.193.35.215'
+ '[' 0 -ne 0 ']'
++ echo ADDRESSES eipalloc-fc6079c3 eipassoc-cf0d10f0 vpc i-e0cb0c78 eni-320efade 498606453616 172.30.2.60 44.193.35.215
++ awk '{print $2;}'
+ allocation_id=eipalloc-fc6 079c3
++ echo ADDRESSES eipalloc-fc6079c3 eipassoc-cf0d10f0 vpc i-e0cb0c78 eni-320efade 498606453616 172.30.2.60 44.193.35.215
++ awk '{print $3;}'
+ association_id=eipassoc-cf 0d10f0
+ [[ eipassoc-cf0d10f0 == eipassoc-* ]]
+ /bin/aws ec2 disassociate-address --association-id eipassoc-cf0d10f0
+ /bin/aws ec2 associate-address --instance-id i-f9ca0d61 --allocation-id eipalloc-fc6079c3
HTTPSConnectionPool(host=' ec2.us-eas t-1.amazon aws.com', port=443): Max retries exceeded with url: / (Caused by ConnectTimeoutError(<botoc ore.awsreq uest.AWSHT TPSConnect ion objec
t at 0x37e8c50>, 'Connection to ec2.us-east-1.amazonaws.co m timed out. (connect timeout=60)'))
I have done the above have included ping in the script its now disassociating the EIP from the first instance but is timing out when associating the EIP to second instance.
+ INSTANCE_ID=i-e9ca0d61
++ /bin/aws ec2 describe-addresses --public-ips 44.193.35.215 --output text
+ info='ADDRESSES eipalloc-fc6079c3 eipassoc-cf0d10f0 vpc i-e0cb0c78 eni-320efade 498606453616 172.30.2.60 44.193.35.215'
+ '[' 0 -ne 0 ']'
++ echo ADDRESSES eipalloc-fc6079c3 eipassoc-cf0d10f0 vpc i-e0cb0c78 eni-320efade 498606453616 172.30.2.60 44.193.35.215
++ awk '{print $2;}'
+ allocation_id=eipalloc-fc6
++ echo ADDRESSES eipalloc-fc6079c3 eipassoc-cf0d10f0 vpc i-e0cb0c78 eni-320efade 498606453616 172.30.2.60 44.193.35.215
++ awk '{print $3;}'
+ association_id=eipassoc-cf
+ [[ eipassoc-cf0d10f0 == eipassoc-* ]]
+ /bin/aws ec2 disassociate-address --association-id eipassoc-cf0d10f0
+ /bin/aws ec2 associate-address --instance-id i-f9ca0d61 --allocation-id eipalloc-fc6079c3
HTTPSConnectionPool(host='
t at 0x37e8c50>, 'Connection to ec2.us-east-1.amazonaws.co
I have done the above have included ping in the script its now disassociating the EIP from the first instance but is timing out when associating the EIP to second instance.
I don't have the detailed knowledge of what aws is doing here to be able to help.
In general though, a socket that has been used and closed does not become available again for a time-out period, unless it was opened (and is re-opened I think) with SO_REUSEADDR
ASKER
Thanks Duncan so meaybe I could do a break after it dissaociated the EIP?
ASKER CERTIFIED SOLUTION
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
ASKER
Apologies for not coming back soon have been off due to Christmas holidays. Please can we keep this question open.
Objecting on behalf of the question author
@lhrslsshahi: You could have done this yourself you know
Now that you're back, please try the solution in https://#a41931519 and get back with the results ASAP
@lhrslsshahi: You could have done this yourself you know
Now that you're back, please try the solution in https://#a41931519 and get back with the results ASAP
ASKER
Please can we keep this open for a while. Apologies I havent had time to go back to this project.
ASKER
Thanks Duncan for your help!
That doesn't look like your problem though, so I suggest you log what the script is doing. Something like
Open in new window
then the rest of your script as before.Stop the keepalives again and see what shows up in /tmp