Link to home
Start Free TrialLog in
Avatar of Awallisk
AwalliskFlag for Singapore

asked on

BGP Flap and no issue with link - BGP hold time expire

Hi There

 Could someone advise me on what could be the issue here. I have check the provider link and no issue with the provider link. They are not detecting any alarm on their link or notice any issue at all. This is very perplexing.

Mar 28 15:43:59.167 SGT: %BGP-3-NOTIFICATION: received from neighbor 169.x.x.x 4/0 (hold time expired) 0 bytes
Mar 28 15:43:59.167 SGT: %BGP-5-ADJCHANGE: neighbor 169.x.x.x Down BGP Notification received
Mar 28 15:43:59.175 SGT: %BGP_SESSION-5-ADJCHANGE: neighbor 169.x.x.x IPv4 Unicast topology base removed from session  BGP Notification received
Mar 28 15:44:11.676 SGT: %BGP-3-BGP_NO_REMOTE_READ: 169.x.x.x connection timed out - has not accepted a message from us for 3000ms (hold time), 0 messages pending transmition.
Mar 28 15:44:11.676 SGT: %BGP-3-NOTIFICATION: sent to neighbor 169.x.x.x active 4/0 (hold time expired) 0 bytes
Mar 28 15:44:11.688 SGT: %BGP_SESSION-5-ADJCHANGE: neighbor 169.x.x.x IPv4 Unicast topology base removed from session  BGP Notification sent
Mar 28 15:44:20.808 SGT: %BGP-5-ADJCHANGE: neighbor 169.x.x.x Up
Mar 28 15:44:41.572 SGT: %BGP-3-NOTIFICATION: received from neighbor x.x.x.x 4/0 (hold time expired) 0 bytes
Mar 28 15:44:41.576 SGT: %BGP-5-ADJCHANGE: neighbor x.x.x.x Down BGP Notification received
Mar 28 15:44:41.588 SGT: %BGP_SESSION-5-ADJCHANGE: neighbor x.x.x.x IPv4 Unicast topology base removed from session  BGP Notification received
Mar 28 15:44:49.393 SGT: %BGP-5-ADJCHANGE: neighbor x.x.x.x Up

There three tunnel JKT , SG and HKG
The logs is from JKT router. both HKG and SNG is sending neighbour hold time expire.
I have check with the provider and they are not seeing any issue on their link. Interface at my end is showing up but BGP is down as you can see.

1. what could be the reason
2. What can be done to resolve this
3. what should i do to troubleshoot this.

I really don't know what causing this.
Avatar of bamsi
bamsi
Flag of Philippines image

have you checked if the MTU settings on both end (CE and PE) of the equipments? MTU settings should be the same along the path
Avatar of Awallisk

ASKER

Hi Bamsi

  Thanks for your advice. I thought the same thing and have check the MTU setting and it has been set to the same across the board.  Just a question if the MTU were indeed a mismatch would the flap occur constantly? Cause right now it intermittent and the flap occur again today at for 2 min. Same error hold timer expire.

I have done continuous ping to the PE and it show no packet loss. This is as much a mystery to me now....
SOLUTION
Avatar of bamsi
bamsi
Flag of Philippines image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Hi Bamsi

  Okay so it might MTU size might be the cause of it. I haven't try pinging with higher packet size. So should I ping with DF ?

So if I ping with packet size 1000 ... If it drops that means mtu might be the cause of it. Also if I use "ip Tcp Mss 1400 " on one of the router would it means that all router would use a default packet size of 1400 minus the header? Would it be best to use this command to control the packet size ?
i would prefer asking my provider their MTU settings rather than changing segment sizes. That way, if the MTU sizes does match, you could dig further using debugs rather than changing stuff on a working network.
Thanks Bamsi... Will definitely ask the provider on the MTU setting. The weird thing is that it was working fine for almost two years until recently I am getting flap on the bgp with hold timer expire and it intermittent.
When I see BGP flaps on a previously working circuit and that has no apparent link problem, it's almost always the circuit.  Occasionally the hardware is at fault but I would look at the circuit first.
i second Jan's advise, if it has been working for 2 years and suddenly it went nuts even if there were no changes on the hardware, i would always check with the circuit provider.

its rarely an issue on a hardware, especially if only 1 out of 3 connections is problematic.

i would advise you check the provider for both ends of the circuit.
That was the first that I had check and now they have setup extended ping on all WAN IP. During the outage today the provider provided their result of extended ping and no packet loss for the last 24hr and during the mentioned period. I have requested them to check numerous time and no alarm on their media nor any issue with the link. 8-(.

I can't blame when they have proven there wasn't any issue at all. I will have to check the MTU part and see how it turn up.
pings aren't data.  I've seen extended pings work fine but the minute that data is thrown on the line, errors occur.

Is this copper or fiber?
Hi Jan

The circuit is on fiber and the issue is I have no visibility of the provider network and furthermore they have provided logs that show the service is up during the mentioned period.

I guess I will request the provider to recheck the circuit maybe provision a new parallel circuit for testing.  As a customer I guess I have the rights to request that.
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
The interface is clean no error at all and there no instances of the interface flap.
Do you externally syslog router logs and, if so, can you debug BGP and interface information?  It may require some disk space but might shed some light.

And, as far as I'm concerned, if you are having BGP flaps, your provider has an equal or more responsibility to assist with troubleshooting.
We don't really externally syslog router logs and yes they are being responsive to our need and I will need to work with them further on findings the issue.

I guess I have to do the below tomorrow during business time.

1. Recheck the MTU setting with the provider
2. Request the provider to provision a parallel link
3. Do a debug on bgp and interface - dependent on the management of they are okay with running debug.

Thank you so much Jan and Bamsi. You insight on the issue is definitely helpful. I will provide update on the debug 8-).

Thanks guys....
Is this a new BGP  peering relationship or existing that all of sudden started going up/down? Is this continuous?

MTU differences will not cause this issue, PMTD will probe the path and establish an acceptable MTU by both peers that will be used to negotiate the TCP connection between the peers.

Can you post the configuration of the interface used for bgp peering and the bgp config?
do not use actual IP addresses.

Most likely its one of the following; recursive routing problem, multi-hop problem, or the BGP next-hop is not reachable.

harbor235 ;}
Harbor went where I was going to go. If BGP is sending you a route to the neighbor IP that is unreachable- and you are preferring it due to AD or something, you're going to flap. Monitor your routing table carefully to see if the route to the neighbor IP changes/flaps. This is especially possible if you're using BGP multihop.
If the neighbor IP is on the same subnet as the interface and the interface remains up, there is no issue with knowing the route.

If the neighbor is a multi-hop/remote loopback IP, then a static route for that neighbor IP needs to be configured pointing that neighbor IP to the other end of the interface.
Do we know that the neighbor is on the same subnet? I do see where Awallisk states it has been up for two years but that does not mean a configuration change has not been made.

Awallisk, this is an existing connection that has been up for two years? no changes have been made to your config, are yuo using bgp multi-hop?

If there are circuit issues you can take a look at the interface stats (show interface if Cisco?)

What type of hardware is your BGP speaker?
What type of circuit? Ethernet? SONET? etc ...
What type of optics? SM, MM?
Post show interface or similar to look for errors, clear counters?
Fiber type? Did you try to clean the fiber ends?
How long is the fiber run? how many connectors from demarc to the router handoff?

Post interface config and BGP config (remove piblic IPs)

I see where Awallisk states this is fiber? have you tested the db loss from the demarc to the router handoff? each connector that the fiber run goes through is ~.5db loss

Have you tried new patch cables?

There could be lots of things wrong here, but we need to rule some things out, we can start to do that by looking at the interface config and the BGP config. The BGP config will also show us the current BGP filter policy implemented that could impact things as well


harbor235 ;}
Hi Harbors/Jan/mike

 Thank you so much for all the comment and i really appreciate the help. We build IPSEC VPN Tunnels from Jakarta to Hong Kong and Singapore over provider MPLS link and then form BGP tunnels over these VPN Tunnels.

The issue is intermittent it can be okay for few days and then it will flaps for 2 -3 mins and than back up again. then it might flap again or be stable for a couple of days.


For the BGP config and interface config - i am trying to access the router but having a problem right now accessing it. hope to get this info soon.

•      Jakarta to Singapore neighbor is on the same subnet and Jakarta to Hkg neighbor is on the same subnet but Singapore and HKG are not on the same subnet.
•      Attached is the Tunnel configuration
•      No changes to the config for the last two years
•      Interface show no crc error and all connection has been check and recheck at our end.
•      Awaiting update from Provider to provision a new parallel link.
•      Have tested the DB with the meter and it is within spec.

THank you so much guys...
Tunnel-config.docx
What is y our bandwidth utilization on these links?
Hi Jan

  Bandwidth utilization is around 60-80 percent during peak period.  It never reach 100 percent will it be a problem for 60-80 percent utilization?

Regards
Iskandar
I wouldn't think at 60%.  But maybe at 80% or more.  It depends upon the usage of other links and the backplane capacity of the chassis.
Hi Jan

  Understood but for the past week it never hit the 80% mark and yet I am still getting the flap. Like today traffic utilization was around 60% and the bgp flap at 1720hrs GMT +8.
A debug would surely be helpful.
Hi Jan

  Yeah definitely will run the debug and paste the output here. 8:)...

Thanks
Do you see flapping from both sites at the same time every time you see flapping?  

Or can you see flapping from just Singapore or just HKG?

Can you post the policy SHAPE-TUNNEL-20MBPS-PQ-1500KBPS?
If the problem is intermittent, it's not the configuration. I thought you meant it used to work and has now started flapping constantly, which could have been a routing change.

So I would carry on with some of the other suggestions given here but don't worry about that, in my opinion.
Avatar of Kaushik Pandya
Kaushik Pandya

Hi Jan,

Thanks a lot for valuable information which is not at all available in any of the documents.
MTU does matter in BGP flaps and stability.
I recently experienced the BGP going into cycle from idle to establish and after hold time expired again from establish to idle.
This has been resolved by setting MTU same across the path including transmission.

Immediately it resolved the issue :)

Thank you once again.
Hello @kaushik

Did you isolate the issue ?