Link to home
Start Free TrialLog in
Avatar of lhrslsshahi
lhrslsshahi

asked on

AWS- KeepAlived notify script not working

Hello Experts,

To test HA if I stop keepalived the KeepAlived action does run the notify  shell script but does not disassociate and reassociate EIP to another instance. I can see the VRRP changing the instance from Master to the Backup and vice versa.

However if I run the keepalived-notfy.sh manually it works as expected.  Please see attached keepalived.conf and the shell script.

Centos 7.2 is the OS version. Any help would be appreciated.
-keepalived_conf.txt
-Keepalived_notify_sh.txt
Avatar of Duncan Roe
Duncan Roe
Flag of Australia image

Could be a difference in the environment between when you run it manually and when HA runs it. In particular, I always start bash scripts #!/bin/bash -p which stops functions exported to the environment from being used.
That doesn't look like your problem though, so I suggest you log what the script is doing. Something like
#!/bin/bash -p
exec >>/tmp/Keepalived_notify.out 2>&1
date
set -x
echo $3 > /var/run/keepalived.status

export AWS_ACCESS_KEY_ID=ACIAIR4UDFXNGDCN2X5B

Open in new window

then the rest of your script as before.
Stop the keepalives again and see what shows up in /tmp
Avatar of lhrslsshahi
lhrslsshahi

ASKER

I will come back to you. Thanks for your patience.
I have turned on the logging and get the below;

HTTPSConnectionPool(host='ec2.us-east-1.amazonaws.com', port=443): Max retries e
xceeded with url: / (Caused by ConnectTimeoutError(<botocore.awsrequest.AWSHTTPS
Connection object at 0x32f0a90>, 'Connection to ec2.us-east-1.amazonaws.com time
d out. (connect timeout=60)'))
+ info=
+ '[' 255 -ne 0 ']'
+ echo 'Could not find info for 44.193.35.115'
Could not find info for 44.193.35.115
+ exit 1
I have no problem at all with that site
19:49:45$ ping -c1 ec2.us-east-1.amazonaws.com
PING ec2.amazonaws.com (54.239.28.168) 56(84) bytes of data.
64 bytes from 54.239.28.168: icmp_seq=1 ttl=227 time=241 ms

--- ec2.amazonaws.com ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 241.582/241.582/241.582/0.000 ms
09:32:48$ telnet ec2.us-east-1.amazonaws.com 443
Trying 54.239.28.168...
Connected to ec2.us-east-1.amazonaws.com.
Escape character is '^]'.
^]
telnet> q
Connection closed.
09:33:16$ 

Open in new window

Please try these commands yourself and post back.
I dont have any problems with these tests. When I run the script manually it goes through.
Are you saying you actually tried the specific commands in https:#a41920173 ? If not, please try them.
I know "when you run the script manually it goes through" but that may be because it does not try to visit amazonaws.com. I really need that sort of detail to get a handle on your problem.
Assuming that you did try the specific commands, I suspect you are running into a DNS time-out. If a local DNS doesn't have a requested address in its cache, it can take that DNS some time to get that address from somewhere else. Once it has it, it sticks around in cache for a while. That would agree with your observation that the whole script works for you later.
If this is your problem, you will need to force DNS to get the wanted address by inserting logic at the head of the script or in the calling script before invoking the failing script. Retry the ping -c1 ... command until it is successful. To cater for the site's actually being down, limit this to, say, 20 attempts.
I am getting the same results

ping -c1 ec2.us-east-1.amazonaws.com
PING ec2.amazonaws.com (54.239.28.176) 56(84) bytes of data.
64 bytes from 54.239.28.176: icmp_seq=1 ttl=246 time=1.07 ms

 telnet ec2.us-east-1.amazonaws.com 443
Trying 54.239.28.176...
Connected to ec2.us-east-1.amazonaws.com.
Escape character is '^]'.
q
Connection closed by foreign host.

The below is the script when run manually with debugging.

Fri Dec  9 12:17:37 EST 2016
+ echo BACKUP
+ export AWS_ACCESS_KEY_ID=BKIAIR4UDFXNGDCN2X5A
+ AWS_ACCESS_KEY_ID=AKIAIR4UDFXNGDZN2X5C
+ export AWS_SECRET_ACCESS_KEY=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
+ AWS_SECRET_ACCESS_KEY=EHup8GD8jZPLza3JbUO5LniWlWjNqogvKztH2gsL
+ export AWS_DEFAULT_REGION=us-east-1
+ AWS_DEFAULT_REGION=us-east-1
+ EIP=44.193.35.215
+ INSTANCE_ID=i-e9ca0d61
++ /bin/aws ec2 describe-addresses --public-ips 44.193.35.215 --output text
+ info='ADDRESSES       eipalloc-fc6079c3       eipassoc-cf98aef0       vpc
i-f9ca0d61      eni-0e03f7e2    498606453616    172.30.2.61     34.193.35.215'
+ '[' 0 -ne 0 ']'
++ echo ADDRESSES eipalloc-fc6079c3 eipassoc-cf98aef0 vpc i-e9ca0d61 eni-0e03f7e
2 498606453616 172.30.2.61 44.193.35.215
++ awk '{print $2;}'
+ allocation_id=eipalloc-fc6079c3
++ echo ADDRESSES eipalloc-fc6079c3 eipassoc-cf98aef0 vpc i-e9ca0d61 eni-0e03f7e
2 498606453616 172.30.2.61 44.193.35.215
++ awk '{print $3;}'
+ association_id=eipassoc-cf98aef0
+ [[ eipassoc-cf98aef0 == eipassoc-* ]]
+ /bin/aws ec2 disassociate-address --association-id eipassoc-cf98aef0
+ /bin/aws ec2 associate-address --instance-id i-e9ca0d61 --allocation-id eipall
oc-fc6079c3
On the standby instance when I keep the SSH open I get the below.

It definitely does behave differently.
The reason why I get the below  because I have temporarily have assigned the EIP on purpose An error occurred (Resource.AlreadyAssociated) when calling the AssociateAddress operation: resource eni-0e03f7e2 and 172.30.2.61 is already associated with public address 44.193.13.64

Tue Dec 13 05:07:20 EST 2016
+ echo MASTER
+ export AWS_ACCESS_KEY_ID=BKIAIR4UDFXNGDCN2X5A
+ AWS_ACCESS_KEY_ID=CKIAIR4UDFXNGDCN2X5A
+ export AWS_SECRET_ACCESS_KEY=FHup8GD8jZPLza3JbUO5LniWlWjNqogvKztH2gsZ
+ AWS_SECRET_ACCESS_KEY=NHup8GD8jZPLza3JbUO5LniWlWjNqogvKztH2gsM
+ export AWS_DEFAULT_REGION=us-east-1
+ AWS_DEFAULT_REGION=us-east-1
+ EIP=44.193.35.215
+ INSTANCE_ID=i-e9ca0d61
++ /bin/aws ec2 describe-addresses --public-ips 44.193.35.215 --output text
+ info='ADDRESSES       eipalloc-fc6079c3       eipassoc-e86360d7       vpc     i-e9cb0c78      eni-320efade    498606453616    172.30.2.60     44.193.35.215'
+ '[' 0 -ne 0 ']'
++ echo ADDRESSES eipalloc-fc6079c3 eipassoc-e86360d7 vpc i-e9cb0c78 eni-320efade 498606453616 172.30.2.60 44.193.35.215
++ awk '{print $2;}'
+ allocation_id=eipalloc-fc6079c3
++ echo ADDRESSES eipalloc-fc6079c3 eipassoc-e86360d7 vpc i-e9cb0c78 eni-320efade 498606453616 172.30.2.60 44.193.35.215
++ awk '{print $3;}'
+ association_id=eipassoc-e86360d7
+ [[ eipassoc-e86360d7 == eipassoc-* ]]
+ /bin/aws ec2 disassociate-address --association-id eipassoc-e86360d7
+ /bin/aws ec2 associate-address --instance-id i-e9ca0d61 --allocation-id eipalloc-fc6079c3

An error occurred (Resource.AlreadyAssociated) when calling the AssociateAddress operation: resource eni-0e03f7e2 and 172.30.2.61 is already associated with public address 44.193.13.64
This is a different problem isn't it? Is it a problem at all?
Your new posts mention address 44.193.35.215 but your original error log https:#a41919846 mentions 44.193.35.115
Sorry... I was trying to mask the IP :-) Same problem.
Resource.AlreadyAssociated is not the error you were getting originally. Originally , /bin/aws ec2 describe-addresses --public-ips 44.193.35.215 --output text was getting a connection time out. When you ran the script manually, that did not happen.
Perhaps at boot time you run the script too early, before the network is ready. Or it could be DNS caching as I suggested before.
You are right that is not the original error. I have done this on the purpose by assigning the second instance an EIP. This was not done manually the only difference is that I
had the SSH session open. No DNS issues does the output from the following command behave differently when a SSH session is not open.

info=`/bin/aws ec2 describe-addresses --public-ips $EIP --output text`
I do not know what /bin/aws does, but I would not expect it to be affected by whether an ssh session is open or not.
I recommend  you try my suggestion from https:#a41923068. Something like
for((i = 0; i < 20; i++))
do
  ping -c1 ec2.us-east-1.amazonaws.com && break
done
if [ $i -eq 20 ]
then
  echo "Unable to contact ec2.us-east-1.amazonaws.com" >&2
  exit 1
fi

Open in new window

Unless other start-up items depend on this script's completion, background it. That way, something else may get done that allows it to complete.
+ EIP=44.193.35.215
+ INSTANCE_ID=i-e9ca0d61
++ /bin/aws ec2 describe-addresses --public-ips 44.193.35.215 --output text
+ info='ADDRESSES       eipalloc-fc6079c3       eipassoc-cf0d10f0       vpc     i-e0cb0c78      eni-320efade    498606453616    172.30.2.60     44.193.35.215'
+ '[' 0 -ne 0 ']'
++ echo ADDRESSES eipalloc-fc6079c3 eipassoc-cf0d10f0 vpc i-e0cb0c78 eni-320efade 498606453616 172.30.2.60 44.193.35.215
++ awk '{print $2;}'
+ allocation_id=eipalloc-fc6079c3
++ echo ADDRESSES eipalloc-fc6079c3 eipassoc-cf0d10f0 vpc i-e0cb0c78 eni-320efade 498606453616 172.30.2.60 44.193.35.215
++ awk '{print $3;}'
+ association_id=eipassoc-cf0d10f0
+ [[ eipassoc-cf0d10f0 == eipassoc-* ]]
+ /bin/aws ec2 disassociate-address --association-id eipassoc-cf0d10f0
+ /bin/aws ec2 associate-address --instance-id i-f9ca0d61 --allocation-id eipalloc-fc6079c3

HTTPSConnectionPool(host='ec2.us-east-1.amazonaws.com', port=443): Max retries exceeded with url: / (Caused by ConnectTimeoutError(<botocore.awsrequest.AWSHTTPSConnection objec
t at 0x37e8c50>, 'Connection to ec2.us-east-1.amazonaws.com timed out. (connect timeout=60)'))

I have done the above have included ping in the script its now disassociating the EIP from the first instance but is timing out when associating the EIP to second instance.
I don't have the detailed knowledge of what aws is doing here to be able to help.
In general though, a socket that has been used and closed does not become available again for a time-out period, unless it was opened (and is re-opened I think) with SO_REUSEADDR
Thanks Duncan so meaybe I could do a break after it dissaociated the EIP?
ASKER CERTIFIED SOLUTION
Avatar of Duncan Roe
Duncan Roe
Flag of Australia image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Apologies for not coming back soon have been off due to Christmas holidays. Please can we keep this question open.
Objecting on behalf of the question author

@lhrslsshahi: You could have done this yourself you know

Now that you're back, please try the solution in https://#a41931519 and get back with the results ASAP
Please can we keep this open for a while. Apologies I havent had time to go back to this project.
Thanks Duncan for your help!