• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 5135
  • Last Modified:

Unable to ping until arp cache cleared

Hello Experts

I have 3 servers connected to Cisco Catalyst Switch C2960 and this switch has uplink to one of the access switch and ultimately this access switch connected to our Core Switch.

Today I encountered an issue these server are unreachable from vlans other that its own.  I just cleared arp-cache and it started pinging.

Please can you help what could be the issue to avoid in future

Thanks
0
cciedreamer
Asked:
cciedreamer
  • 81
  • 38
  • 20
  • +2
5 Solutions
 
BlueComputeCommented:
Did you dump the arp cache before you cleared it?  What were the incorrect entries in the ARP table?  How did they get there?
0
 
cciedreamerAuthor Commented:
Actually I was able to see the mac address of the server on Core Switch before clearing it.

Thanks
0
 
cciedreamerAuthor Commented:
This occurred again few minutes ago.

I cleared arp-cache and started working.

Please help me to solve this.

Thanks
0
What Security Threats Are We Predicting for 2018?

Cryptocurrency, IoT botnets, MFA, and more! Hackers are already planning their next big attacks for 2018. Learn what you might face, and how to defend against it with our 2018 security predictions.

 
cciedreamerAuthor Commented:
Just to wanted to let you know that we have HSRP running.

Thanks
0
 
SouljaCommented:
You state you are running HSRP, yet you state the access switch is uplinked to the Core switch. Is there another core switch that isn't being mention.

How are these HSRP devices talking to each other. Is there a direct link between them?
0
 
cciedreamerAuthor Commented:
Sorrt for the missing information.
Access Switch ha redudant links Core 1 and Core 2

And both core has direct link between them.
0
 
SouljaCommented:
When this issue occurs do you notice any changes in STP or HSRP in your logs? How is your STP setup? Are you load balancing  VLANS between HSRP members, or just sending all to the primary switch?
0
 
cciedreamerAuthor Commented:
The most amazing part is I dont recievr any logs on of the switch
Actually we using 2 core switches as Active and Standby scenario.
Currently our core 1 is active and all traffic  is coming to core 1. I have default confiugration of STP.

Thanks
0
 
SouljaCommented:
Do you have a layer 3 or Layer 2 link between the cores? If layer two, are all vlans spanned across it?
0
 
cciedreamerAuthor Commented:
Its Layer 2 and all vlans accross it.
0
 
SouljaCommented:
I think the key here is you need to collect some logs to determine if there is anything going on during this ARP issue.
0
 
SouljaCommented:
Can you post your hsrp config?
0
 
cciedreamerAuthor Commented:
The problem has occurred now.

Since last 4 hours I was to pinging this servers from other vlan worksation and there was no problem
Once I stopped the ping the problme happened after 5 min.

If I'll make clear arp cache from on core 1 it will work,

Thanks
0
 
cciedreamerAuthor Commented:
This is HSRP config of that vlan

Core1

interface Vlan1 description 
 ip address 10.1.1.254 255.255.255.0
 ip route-cache flow
 standby delay minimum 20 reload 25
 standby 1 ip 10.1.1.1
 standby 1 priority 110
 standby 1 preempt

Open in new window


Core 2

interface Vlan2
 ip address 10.1.1.253 255.255.255.0
 ip route-cache flow
 standby delay minimum 20 reload 25
 standby 2 ip 10.1.1.1
 standby 2 priority 95
 standby 2 preempt
end

Open in new window

0
 
SouljaCommented:
Why do you have two different vlan interface numbers and standby numbers?

Can you post the sh standby output from each switch?
0
 
cciedreamerAuthor Commented:
Sorry I just copied the wrong the interface from Core 1

interface Vlan2
 ip address 10.1.1.254 255.255.255.0
 ip route-cache flow
 standby delay minimum 20 reload 25
 standby 1 ip 10.1.1.1
 standby 1 priority 110
 standby 1 preempt

Open in new window


sh standby Core 1

Vlan2 - Group 2
  Local state is Active, priority 110, may preempt
  Hellotime 3 sec, holdtime 10 sec
  Next hello sent in 1.143
  Virtual IP address is 10.1.1.1 configured
  Active router is local
  Standby router is 10.1.1.253 expires in 9.844
  Virtual mac address is 0000.0c07.ac02
  1 state changes, last state change 8w1d
  IP redundancy name is "hsrp-Vl2-2" (default)

Open in new window


sh standby Core 2

Vlan2 - Group 2
  Local state is Standby, priority 95, may preempt
  Hellotime 3 sec, holdtime 10 sec
  Next hello sent in 1.931
  Virtual IP address is 10.1.1.1 configured
  Active router is 10.1.1.254, priority 110 expires in 7.440
  Standby router is local
  43 state changes, last state change 8w1d
  IP redundancy name is "hsrp-Vl2-2" (default)

Open in new window

0
 
InfamusCommented:
Why is vlan1 and vlan 2 virtual ip the same?
0
 
cciedreamerAuthor Commented:
No it was typo mistake I have rectified it.
0
 
cciedreamerAuthor Commented:
Please help me this case is repeatedly happening.
0
 
InfamusCommented:
Are you using any dynamic routing? if yes, which one?
0
 
SouljaCommented:
Would it be possible to disconnect the second core and see if the issue reoccurs?
0
 
cciedreamerAuthor Commented:
Disconnect the core from where I mean access switch or core 1 itself
0
 
cciedreamerAuthor Commented:
I don't have any dynamic routing I have 1 default static route
0
 
InfamusCommented:
when you do sh ip route, what is the difference between when it is working and not working?

Is it possible to do what Soulja recommended?

If not, can you have one workstation configure the gateway to actual IP of the VLAN interface and do a continuous ping and see what happens?
0
 
cciedreamerAuthor Commented:
I have lot of workstation in this vlan 2 but no one facing any issues accept this 3 servers

One thing to be noted when do I continous ping to the server from workstation on other vlan I dont face the issue. Once stop the ping I appears again .
0
 
cciedreamerAuthor Commented:
The problem appeared again.
0
 
SouljaCommented:
What are your logs saying. Can you post recent logs?
0
 
cciedreamerAuthor Commented:
No logs Its empty.

Thanks
0
 
InfamusCommented:
did you try "sh log" on the switch?
0
 
cciedreamerAuthor Commented:
Yes.there old logs nothing recent.
0
 
SouljaCommented:
Are you logging to a syslog server?
0
 
cciedreamerAuthor Commented:
Yes. I have seen the logs there but nothing related to this issue.
0
 
Craig BeckCommented:
Do you have a duplicate IP address anywhere?

If you check the logs on the active HSRP router what do you see?
0
 
cciedreamerAuthor Commented:
Hi,

No I don't have duplicated IP address. No logs on Core Switches related with this issue.

I keep continuous ping to server, once stopped I have to clear arp cache to bring connection up.

There are 2 physical server having IP 10.1.1.14, 10.1.1.15 and 10.1.1.17. The HA setup. The 3rd IP is VIP (10.1.1.17)

I have to keep continuous ping to 10.1.1.17 keep connections live.



Thanks
0
 
Craig BeckCommented:
That really does sound like a duplicate IP issue.  Do you have Proxy-ARP running on any devices on that VLAN?
0
 
cciedreamerAuthor Commented:
As I mentioned they 2 physical server and there IP's 10.1.1.14 and 10.1.1.15. They have HA and its connected to Virutal IP 10.1.1.17.

The client they see the VIP

Now one my core I am seeing 2 IP's ( 10.1.1.15 and 10.1.1.17) with same MAC.
0
 
Craig BeckCommented:
Seeing two IPs with the same MAC is fine, but seeing two MACs with the same IP would be bad.

So what I think is happening is the core is seeing the wrong MAC address mapped to 10.1.1.17 IP when the server is inaccessible from the other VLAN.

When you try to get to the server from a different VLAN and it fails, instead of clearing the ARP cache can you do:

show ip arp | include 10.1.1.

instead, then clear the ARP cache, and re-run the commmand and post the two results?
0
 
cciedreamerAuthor Commented:
Hi,
Here is the result during the issue

Core1 # sh arp | in 10.1.1.17          
Internet  10.1.1.17               7   0014.5ebc.7466  ARPA   Vlan2

Core1 # sh arp | in 10.1.1.15        
Internet  10.1.1.17               7   0014.5ebc.7466  ARPA   Vlan2

I cleared the arp-cache on Core then I did sh arp again. I received same result as above.
0
 
Craig BeckCommented:
Can you post what I asked for and what 'actually' comes back with no edits?
Core1 # sh arp | in 10.1.1.15        
Internet  10.1.1.17               7   0014.5ebc.7466  ARPA   Vlan2
The output you posted isn't consistent with what you asked the switch, and I asked for the complete output for the 10.1.1.x subnet, not just the IP in question.
0
 
cciedreamerAuthor Commented:
Ok I'll post it.

Meanwhile the server team have flushed the arp cache on both servers, so far its working I'll have to monitor than an hour I'll get back to you.

Thanks
0
 
Craig BeckCommented:
Ok that's a step in the right direction.  You shouldn't have to specifically flush the ARP cache though, ever.  If you do there might be something wrong with the way the vIP is working.
0
 
cciedreamerAuthor Commented:
Sir,

The problem appeared again

Here is the result as you requested

Before flushing arp cache

Core1#sh ip arp vlan 2   
Protocol  Address          Age (min)  Hardware Addr   Type   Interface
Internet  10.1.1.11               0   0015.1736.a65d  ARPA   Vlan2
Internet  10.1.1.9               34   001e.4a16.edf5  ARPA   Vlan2
Internet  10.1.1.14               5   0014.5ebc.0c84  ARPA   Vlan2
Internet  10.1.1.15              13   0014.5ebc.7466  ARPA   Vlan2
Internet  10.1.1.13               0   0015.1736.a3fd  ARPA   Vlan2
Internet  10.1.1.1                -   0000.0c07.ac02  ARPA   Vlan2
Internet  10.1.1.26               7   001a.6424.88b0  ARPA   Vlan2
Internet  10.1.1.27               0   0019.9903.c098  ARPA   Vlan2
Internet  10.1.1.24               0   0014.5efc.2328  ARPA   Vlan2
Internet  10.1.1.25               8   001a.6424.87e0  ARPA   Vlan2
Internet  10.1.1.30               0   0019.9903.bf69  ARPA   Vlan2
Internet  10.1.1.31              35   0019.9903.c093  ARPA   Vlan2
Internet  10.1.1.28               0   0019.9902.166c  ARPA   Vlan2
Internet  10.1.1.29               0   0019.9903.bfac  ARPA   Vlan2
Internet  10.1.1.19              34   0015.1736.a3fd  ARPA   Vlan2
Internet  10.1.1.17               5   0014.5ebc.7466  ARPA   Vlan2
Internet  10.1.1.22              34   0015.1736.a65d  ARPA   Vlan2
Internet  10.1.1.41              34   0030.057a.52ff  ARPA   Vlan2

Open in new window


After flushing arp cache

Core1#sh ip arp vlan 2          
Protocol  Address          Age (min)  Hardware Addr   Type   Interface
Internet  10.1.1.11               0   0015.1736.a65d  ARPA   Vlan2
Internet  10.1.1.9                0   001e.4a16.edf5  ARPA   Vlan2
Internet  10.1.1.14               0   0014.5ebc.0c84  ARPA   Vlan2
Internet  10.1.1.15               0   0014.5ebc.7466  ARPA   Vlan2
Internet  10.1.1.13               0   0015.1736.a3fd  ARPA   Vlan2
Internet  10.1.1.1                -   0000.0c07.ac02  ARPA   Vlan2
Internet  10.1.1.26               0   001a.6424.88b0  ARPA   Vlan2
Internet  10.1.1.27               0   0019.9903.c098  ARPA   Vlan2
Internet  10.1.1.24               0   0014.5efc.2328  ARPA   Vlan2
Internet  10.1.1.25               0   001a.6424.87e0  ARPA   Vlan2
Internet  10.1.1.30               0   0019.9903.bf69  ARPA   Vlan2
Internet  10.1.1.28               0   0019.9902.166c  ARPA   Vlan2
Internet  10.1.1.29               0   0019.9903.bfac  ARPA   Vlan2
Internet  10.1.1.19               0   0015.1736.a3fd  ARPA   Vlan2
Internet  10.1.1.17               0   0014.5ebc.7466  ARPA   Vlan2
Internet  10.1.1.22               0   0015.1736.a65d  ARPA   Vlan2
Internet  10.1.1.41               0   0030.057a.52ff  ARPA   Vlan2

Open in new window

0
 
cciedreamerAuthor Commented:
Just wanted to inform that when this occurs, the servers cannot ping the VLAN VIP but I can ping SVI L3 Vlan interface
0
 
Craig BeckCommented:
Ok can you get the ARP output from the servers when the issue happens?
0
 
cciedreamerAuthor Commented:
Actually the server administration team is in germany it might get delayed and we are in KSA.

Different Time Zones
Is there any remote utility I can use to get the output of the server.
0
 
Craig BeckCommented:
If you can use PSEXEC or something similar you could execute the command on the server remotely.  Or, RDP to the server?
0
 
cciedreamerAuthor Commented:
Its a linux server and I dont have access to this servers
0
 
Craig BeckCommented:
Ah ok so someone should be able to do that remotely via SSH.
0
 
cciedreamerAuthor Commented:
I have got the credentials how I can arp output on linux
0
 
cciedreamerAuthor Commented:
I will arp result once the problem appear.
0
 
cciedreamerAuthor Commented:
this is interface configuration on 2 servers.

Server 1

bond0     Link encap:Ethernet  HWaddr 00:14:5E:BC:0C:84
          inet addr:[b]10.1.1.14[/b]  Bcast:10.1.1.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1
          RX packets:319772441 errors:0 dropped:0 overruns:0 frame:0
          TX packets:438580960 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:99762580571 (95141.0 Mb)  TX bytes:438533096008 (418217.7 Mb)

bond1     Link encap:Ethernet  HWaddr 00:14:5E:BC:0C:85
          inet addr:192.168.128.30  Bcast:192.168.128.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1
          RX packets:4117467 errors:0 dropped:0 overruns:0 frame:0
          TX packets:94902 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:482387910 (460.0 Mb)  TX bytes:4648806 (4.4 Mb)

Open in new window


Server 2

bond0     Link encap:Ethernet  HWaddr 00:14:5E:BC:74:66
          inet addr:[b]10.1.1.15[/b]  Bcast:10.1.1.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1
          RX packets:916867429 errors:0 dropped:0 overruns:0 frame:0
          TX packets:765935729 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:900507151734 (858790.5 Mb)  TX bytes:538632706090 (513680.1 Mb)

bond0:3   Link encap:Ethernet  HWaddr 00:14:5E:BC:74:66
          inet addr:[b]10.1.1.17  [/b]Bcast:10.1.1.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1

Open in new window


Server 2
0
 
cciedreamerAuthor Commented:
Sir,
I came to conclusion
the problem is having with only server 2 that has 10.1.1.15 ( the interface details as mentioned above)
I have no problem with server 1

Here is the arp -n result from server 2 ( when problem occurred again)

sdm2ha:~ # arp -n
Address                  HWtype  HWaddress           Flags Mask            Iface
10.1.1.29                ether   00:19:99:03:BF:AC   C                     bond0
10.1.1.30                ether   00:19:99:03:BF:69   C                     bond0
10.1.1.22                ether   00:15:17:36:A6:5D   C                     bond0
10.1.1.11                ether   00:15:17:36:A6:5D   C                     bond0
10.1.1.24                ether   00:14:5E:FC:23:28   C                     bond0
10.1.1.18                ether   00:14:5E:BC:0C:84   C                     bond0
10.1.1.28                ether   00:19:99:02:16:6C   C                     bond0
10.1.1.81                ether   00:19:99:80:38:8A   C                     bond0
10.1.1.1                 ether   00:E0:81:B6:50:1B   C                     bond0

sdm2ha:~ #
0
 
cciedreamerAuthor Commented:
Here is the interface details on server 2

bond0     Link encap:Ethernet  HWaddr 00:14:5E:BC:74:66
          inet addr:10.1.1.15  Bcast:10.1.1.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1
          RX packets:916867429 errors:0 dropped:0 overruns:0 frame:0
          TX packets:765935729 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:900507151734 (858790.5 Mb)  TX bytes:538632706090 (513680.1 Mb)

bond0:0   Link encap:Ethernet  HWaddr 00:14:5E:BC:74:66
          inet addr:10.1.1.21  Bcast:10.1.1.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1

bond0:2   Link encap:Ethernet  HWaddr 00:14:5E:BC:74:66
          inet addr:10.1.1.23  Bcast:10.1.1.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1

bond0:3   Link encap:Ethernet  HWaddr 00:14:5E:BC:74:66
          inet addr:10.1.1.17  Bcast:10.1.1.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1

Open in new window

0
 
cciedreamerAuthor Commented:
Once the problem occur I cannot ping all the interfaces IP of server 2

Thanks
0
 
InfamusCommented:
Someone correct me if I'm wrong.

I was reading through the post and found out that you assigned same IP (.17) on both of the servers, is that correct?

How is HA cofigured to use the VIP (.17)?

Can you remove .17 from server 2 and see what happens?  (meaning able to ping .15).

Also the difference on the ARP table from before and after is that this ip is no longer there after clearing the cache.

Internet  10.1.1.31              35   0019.9903.c093  ARPA   Vlan2

Do you know what this device is?
0
 
cciedreamerAuthor Commented:
No after discussing with server team 17 is not assigned to server 1

Server 1 has no issues.

The only issue with server This is confirmed.

No I cannot remove .17 they some bonding on linux server.

No idea about this device may one of the workstation in this vlan
0
 
InfamusCommented:
Can you post sh run int gixx/xx on the switchport that server 2 is connected?

I know there are connected to multiple ports, right?
0
 
cciedreamerAuthor Commented:
I have no access to this switch.
Only the information I have for switch is


WS-C2960G-24TC-L
Catalyst 2960G IOS 12.2(25)SEE

May be IOS version could be the cause ?

Thanks
0
 
InfamusCommented:
I'm thinking more of hardware NIC issue.

I was thinking about bpduguard or port security might be blocking since there are multiple IP's on the same MAC but that might not be the case.
0
 
cciedreamerAuthor Commented:
Continuous ping to this server keeps connection running but that's not fix :)
0
 
cciedreamerAuthor Commented:
This is weird issue making us mad. Please I request to bear with me.
0
 
cciedreamerAuthor Commented:
I have noted one more thing while doing continous ping to 10.1.1.17 the reply is 1ms but at some intervals it gives 10ms then again to 1ms
0
 
InfamusCommented:
Just a thought, is there a way to compare the NIC driver version between Server 1 and Server 2 if they are same hardware?

Any chance you can have server admin to update server 2's NIC driver?
0
 
Craig BeckCommented:
I think they've just got the bonding config wrong.  It happens quite a lot.
0
 
cciedreamerAuthor Commented:
But it was working and no changes took place. This is started happening since yesterday.
0
 
InfamusCommented:
Yeah, I'm trying to think of all the possibilities since this thread is going nowhere.
0
 
InfamusCommented:
Hardware always goes bad

:P
0
 
cciedreamerAuthor Commented:
See once I stopped the ping to any of the ip of interface of server the problem appears

Thanks
0
 
Craig BeckCommented:
It looks to me like the two servers have different bonding configurations.

Maybe server 1 was owning the VIP until you started experiencing issues, and when server 2 assumed the VIP that's when you started to see issues.

Thinking about it, that just looks like a pure bond.  We can't tell how the bond is working from your outputs.  Is it 802.3ad, or active/passive, etc?

But yes, hardware could be causing an issue too.
0
 
cciedreamerAuthor Commented:
How about Cisco IOS version ?
0
 
Craig BeckCommented:
I don't think this is a switch issue.  Granted though the version of IOS could do with an update, but generally it's fine.

I can't see any issues with that version of code related to ARP.
0
 
cciedreamerAuthor Commented:
Please let me know your question that will help us to identify the problem I will forward to the server team.
0
 
InfamusCommented:
Since this is a physical server and have multiple NICs, they can disable one of the NICs one at a time and try to find out which NIC is having an issue.
0
 
Craig BeckCommented:
+1

I'd ask the server team to disable bonding and VIP and see if issues persist.
0
 
cciedreamerAuthor Commented:
Any more suggestion on this

The problem we cant reboot the servers and switches.
Rebooting the server required 45min downtime and we cant' afford that.

Anything troubleshooting we can do within the switch core and access switch.

(2 Servers- Active/Active Clustering)<--->(2960 SW)<--->(Access Switch)

                                                                                                |               |

                                                                                                |               |

                                                                                          Core 1         Core2
0
 
Craig BeckCommented:
You don't need to reboot the servers, just disconnect all NICs apart from the primary (or just leave one connected) and see if that has an effect.
0
 
cciedreamerAuthor Commented:
Anything else I can check more apart from disconnecting NIC's Thanks for your understanding I appreciate your help.
0
 
InfamusCommented:
I think you should really try above recommendations first to isolate the possible major cause.

As Craig mentioned, you don't need any down time by removing one NIC from the bond.
0
 
cciedreamerAuthor Commented:
Actually I recieved the wrong information from one of the adming

The configuration is Active/Passive.

They are saying duplicate IP address but I cannot see any duplication.

Thanks
0
 
cciedreamerAuthor Commented:
Please see the attached interfaces configuration. The cluster is Active/Passive.

Thanks for the help.
interfaces-configuration.docx
0
 
Craig BeckCommented:
Why do your servers have multiple IP addresses?

Can you show the routing table from the servers?
0
 
cciedreamerAuthor Commented:
Really have no clue but the default gateway to our core HSRP ip
The servers have ip range under 10.1.1. subnet to make sure that if ip fails so they can use other.
0
 
Craig BeckCommented:
So the servers have multiple IPs on the 10.1.1. subnet in case one fails??

That's not how bonding is supposed to be used.  You assign one IP to the bond and it chooses which NIC to use.
0
 
cciedreamerAuthor Commented:
This i was meant but may be could't explain properly.
So far we have done most troubledhooting on the swutching but didnt got any clue
0
 
Craig BeckCommented:
I really don't think it's a switch issue.  If normal clients are accessible there can't really be a switch problem.
0
 
cciedreamerAuthor Commented:
I am totally going crazy and trying all the possible any ways.

Why pinging to server 2  IP addressed ( any of them from multiple ip's in 10.1.1.X) keeps the connection alive once stopped I cannot ping after few minutes and then necessarily I have flush arp cache for this vlan on Core1
0
 
Craig BeckCommented:
I don't know, but you really need to just connect each server with one NIC.
0
 
cciedreamerAuthor Commented:
I have noted something.

I have observed the arp entry for one of the IP of server 2 on our Core 2

Is that normal ?
0
 
cciedreamerAuthor Commented:
For example, server 2 has got the following ip addresses. But on core 2 I can see only 10.1.1.15

but on core 1 I can see them all

10.1.1.15 > 0014.5ebc.7466

10.1.1.17 > 0014.5ebc.7466

10.1.1.23 > 0014.5ebc.7466

10.1.1.21 > 0014.5ebc.7466
0
 
Craig BeckCommented:
That IP will be the primary IP address.  Things like broadcast traffic will come from that address so it's normal to see only that IP address especially if the services using the other IP addresses aren't being used for a period of time.

What would be bad is:

1] Seeing the same IP address with multiple MAC addresses.
2] Seeing the same MAC address via multiple ports on the core switches.

If [1] is true you have a duplicate IP address.
If [2] is true you have a loop somewhere.
0
 
cciedreamerAuthor Commented:
Doing wireshark from worksation out this vlan and sending ping packets to the server during the problem will help ??
0
 
Craig BeckCommented:
It might do, but I think there'd be more worth in trying to check the NIC/bond config on the servers at this point.
0
 
cciedreamerAuthor Commented:
We'll shutdown the server 2 and lets see if that make any differnce.
0
 
Craig BeckCommented:
Are the servers redundant, or just the NICs in each server?  What I mean is, does server1 provide the same services as server2?
0
 
cciedreamerAuthor Commented:
Yes. they do.
0
 
Craig BeckCommented:
Ok so how is the redundancy configured?  Are the servers active/passive?  Is the redundancy facilitated by DNS?
0
 
cciedreamerAuthor Commented:
They are active/passive. Its DNS redundancy.
0
 
cciedreamerAuthor Commented:
The problem appear again after switching off server 2 :( :(.

I cannot ping to server 1 readl IP and VIP
0
 
cciedreamerAuthor Commented:
What if I decrease the arp timeout to 20 min ??
0
 
Craig BeckCommented:
No you shouldn't need to touch ARP timers at all.

Why does each server have multiple IP addresses on the bond?
0
 
cciedreamerAuthor Commented:
After shutting down the server 2 all the IP addresses were on server2 now switch to serever 1. Bold are the IP addresses switched

Internet  10.1.1.14               1   0014.5ebc.0c84  ARPA   Vlan2
Internet  10.1.1.18               4   0014.5ebc.0c84  ARPA   Vlan2
Internet  10.1.1.17               4   0014.5ebc.0c84  ARPA   Vlan2
Internet  10.1.1.23               4   0014.5ebc.0c84  ARPA   Vlan2
Internet  10.1.1.21               4   0014.5ebc.0c84  ARPA   Vlan2
0
 
Craig BeckCommented:
Ok so let's just confirm...

Does server1 have any of the same IP addresses as server2?
0
 
cciedreamerAuthor Commented:
The active server takes this IP addresses

Internet  10.1.1.17               4   0014.5ebc.0c84  ARPA   Vlan2
Internet  10.1.1.23               4   0014.5ebc.0c84  ARPA   Vlan2
Internet  10.1.1.21               4   0014.5ebc.0c84  ARPA   Vlan2
0
 
cciedreamerAuthor Commented:
Now we will shutdown server 1 and make 2 up
0
 
Craig BeckCommented:
Ok so I don't understand how each server knows that it's active or passive?

Your issue is duplicate IP addresses as I said right at the start of this thread.  Here you've just said that the server picks-up the addresses...
Internet  10.1.1.17               4   0014.5ebc.0c84  ARPA   Vlan
...and earlier you showed us that same IP address with a different MAC...
Internet  10.1.1.17               5   0014.5ebc.7466  ARPA   Vlan2
So, we need to understand how the active/passive is configured between the servers?  If I'm correct there's no way you can do this with a plain bond on the NICs - they're server independent so each server doesn't know about the other's configuration.  Therefore it could easily be the case that the switch is seeing the wrong server with the IP address, or even worse - both servers with the same IP at the same time.
0
 
Craig BeckCommented:
Also there's just something that's sitting at the back of my mind, but are you running ARP inspection on the switches?
0
 
cciedreamerAuthor Commented:
They are running veritas cluster software for manging the clustering .

the respective active server takes this IP 10.1.1.17,23.21

Switch shows the mac address of the active  server

Server 1 ( Active) 0014.5ebc.0c84      
10.1.1.14
10.1.1.17
10.1.1.18
10.1.1.21
10.1.1.23

Server 2 ( if active)  0014.5ebc.7466
10.1.1.15    
10.1.1.17
10.1.1.21
10.1.1.23


If the server is on standby the switch will show only the real ip and mac address associated with it.
Real IP addresses of servers

Server 1 10.1.1.14 and 10.1.1.18    0014.5ebc.0c84          
Server 2 10.1.1.15                           0014.5ebc.7466

Once
0
 
Craig BeckCommented:
Ok so why does server1 have two real IP addresses but server2 only has one?

The servers are supposed to be the same.
0
 
cciedreamerAuthor Commented:
Sorry this is also a logical IP addresses not real.
0
 
Craig BeckCommented:
So why are there two redundancy/logical IP addresses?
0
 
cciedreamerAuthor Commented:
ARP inspection is disabled on our Core

Core1#sh ip arp inspection vlan 2

Source Mac Validation      : Disabled
Destination Mac Validation : Disabled
IP Address Validation      : Disabled

 Vlan     Configuration    Operation   ACL Match          Static ACL
 ----     -------------    ---------   ---------          ----------
    2     Disabled         Inactive                      

 Vlan     ACL Logging      DHCP Logging
 ----     -----------      ------------
    2     Deny             Deny
0
 
cciedreamerAuthor Commented:
Each IP addresses having some services associated with it. It's called as PAC Radiology Server.
0
 
Craig BeckCommented:
0
 
cciedreamerAuthor Commented:
I turned on the arp inspection on this vlan

The problem appeared, I cleared arp cache on Core and this it didn't work.

I disabled the arp inspection again and cleared cache then it work. started pinging.
0
 
cciedreamerAuthor Commented:
We got something the problem is with both servers.
Whichever of them is active, it will stop pinging I have tried by shuting down the server one by one.
0
 
Craig BeckCommented:
You don't want ARP inspection - that's why I was asking if you had it enabled.

In the link I posted the problem seems to be similar, so you should try the solution in that post.

After working with Veritas and Redhat, we have concluded that the problem is our router does not reply to "ARP REPLY" type Gratuitous ARP packets.  This is the default behavior for VCS.  We had to modify the 'online' script within VCS so that it sends out "ARP REQUEST" type Gratuitous ARP packets.

I would also check on the core switches that you don't have the no ip gratuitous-arp command configured.
0
 
cciedreamerAuthor Commented:
no ip gratuitous-arp  should be configured or not configured ??
0
 
cciedreamerAuthor Commented:
Ok We'll the try the above solution also
0
 
Craig BeckCommented:
You don't want gratuitous ARP to be turned off.  So if you issue the following command...

show run | inc gratu

...and get nothing back you are ok.
0
 
cciedreamerAuthor Commented:
We tried above solution but same thing.

 ip gratuitous-arp is disabled.
0
 
Craig BeckCommented:
I think you would want gratuitous arp to be enabled on the switch.

conf t
ip gratuitous-arp
end
0
 
cciedreamerAuthor Commented:
but it was working before without this command.
0
 
Craig BeckCommented:
It should be enabled by default, so you won't see that line in the config.  So I think it's enabled, not disabled.

Gratuitous ARP will help to update the ARP table on other devices when the passive server takes the IP address from the other server.  Without this you won't be able to see the newly-active server for a few minutes once it assumes the IP address as the MAC-IP mapping will be wrong.

AFAIK the switch will only respond to ARP-REQUEST packets, not ARP-REPLY packets like the servers are sending, so you need to modify the server-side to send ARP-REQUEST packets instead.

As to why it was working before - I don't know.  What changed?
0
 
InfamusCommented:
Have you tried to disable NIC's one at a time?

If everything was working fine and nothing was changed then I only can think of hardware failure.
0
 
cciedreamerAuthor Commented:
i tried shuting the server 1 by 1 and both has some problem
0
 
cciedreamerAuthor Commented:
Here is the latest update to this problem.

Today we tried shuting down the server 1 by 1 to see if the this problem remain.Yes, firstly we turned off Server 2 so all the resume transeffer to Server 1, there was a problem after a certain period I cannot ping the real and logical ip addresses of server 1

then we did vice versa, Server 1 down and Server 2 up, still the same problem.

Then we run both the servers at the same time ( active/passive) same problem.

Just for more  info.

Server 1 has real ip address 10.1.1.14

Server 2 10.1.1.15

Logical IP addresses : 10.1.1.18, 10.1.1.21, 10.1.1.23 ( services associated to it )

VIP: 10.1.1.17

When one of the server is active it takes the above logical addresses and core switch show arp table to 1 single mac addresses

and standby show arp entry on core for only realy ip address

For eg: If server 1 is active, here is the arp table on core

Internet  10.1.1.15               0   0014.5ebc.7466  ARPA   Vlan2

Internet  10.1.1.18              96   0014.5ebc.7466  ARPA   Vlan2

Internet  10.1.1.17              96   0014.5ebc.7466  ARPA   Vlan2

Internet  10.1.1.23              96   0014.5ebc.7466  ARPA   Vlan2

Internet  10.1.1.21              96   0014.5ebc.7466  ARPA   Vlan2

Server 2 arp entry on Core

Internet  10.1.1.14              0  0014.5ebc.0c84  ARPA   Vlan2

And if the server 2 is active, this will be arp table

Internet  10.1.1.14              0  0014.5ebc.0c84  ARPA   Vlan2

Internet  10.1.1.18              96  0014.5ebc.0c84 ARPA   Vlan2

Internet  10.1.1.17              96  0014.5ebc.0c84 ARPA   Vlan2

Internet  10.1.1.23              96  0014.5ebc.0c84 ARPA   Vlan2

Internet  10.1.1.21              96  0014.5ebc.0c84 ARPA   Vlan2

Server 1 arp entry on Core :

Internet  10.1.1.15               0   0014.5ebc.7466  ARPA   Vlan2
0
 
InfamusCommented:
Have you discuss about this with your server admin from the link Craig mentioned?

Here is the modification we had to make to /opt/VRTSvcs/bin/IP/online:

     `$arping -A -c 5 -I $Device $Address`;
 
to
 
    `$arping -U -c 5 -I $Device $Address`;
0
 
cciedreamerAuthor Commented:
They have tried but no luck.

Within the same subnet, we are not facing any issue.
0
 
cciedreamerAuthor Commented:
What if I create static arp entry on Core Switch ??
0
 
Craig BeckCommented:
You can't create a static ARP entry if you have redundancy between the two servers - the MAC will change everytime the active drops and the passive takes over.

If you don't see the issue on the same subnet this is definitely an ARP problem - it has to be.  Can you show us the config for the SVI on each core?
0
 
cciedreamerAuthor Commented:
Core 1

interface Vlan2
 ip address 10.1.1.254 255.255.255.0
 ip route-cache flow
 standby delay minimum 10 reload 25
 standby 2 ip 10.1.1.1
 standby 2 priority 110
 standby 2 preempt
end

Open in new window


Core  2

interface Vlan2
 ip address 10.1.1.253 255.255.255.0
 ip route-cache flow
 standby delay minimum 20 reload 25
 standby 2 ip 10.1.1.1
 standby 2 priority 95
 standby 2 preempt
end

Open in new window


I have been also reading forums what if add static arp entry on Cluster Server 1 and 2 for default gateway ??
0
 
cciedreamerAuthor Commented:
I have found something very interesting.

On my core switch, the arp entry of HSRP VIP ( 10.1.1.1)  is 0000.0c07.ac02

the arp entry on active server is also 0000.0c07.ac02


but when the problem occur the arp entry on server changed to 00:E0:81:B6:50:1B
After tracing this mac, this is one of the Access Switch in the network.

??? Any clues
0
 
Craig BeckCommented:
That MAC address is a TYAN device - they typically make server and PC motherboards.

If you use the show mac address-table | inc 501b command that should tell you which port on the access switch it's connected to.  Find out what that device is and disconnect it if you can, then test again.
0
 
Craig BeckCommented:
I have been also reading forums what if add static arp entry on Cluster Server 1 and 2 for default gateway ??
You could do that, but you shouldn't have to.  I don't think it would help anyway.

It's technically ok to do this, although I suggested not doing it on the switch for the servers, as the switches use a virtual MAC address for the standby address, but the servers aren't.
0
 
cciedreamerAuthor Commented:
Voila !! I add static arp entry on severs pointing default gateway (HSRP IP) and the problem disappeared.

As I mentioned above I have found something very interesting.

On my core switch, the arp entry of HSRP VIP ( 10.1.1.1)  is 0000.0c07.ac02

the arp entry on active server is also 0000.0c07.ac02


but when the problem occur the arp entry on server changed to 00:E0:81:B6:50:1B
After tracing this mac, this is one of the Access Switch in the network.


Thanks
0
 
cciedreamerAuthor Commented:
Basically I didn't understand its clustering issue or switch ??

Thanks
0
 
InfamusCommented:
It's not the switch as far as I'm concern.
0
 
InfamusCommented:
Did you look for the Tyan device as Craig suggested?
0
 
Craig BeckCommented:
I think you're just masking a problem.  What you've done is force the servers to ignore the MAC address of the other 'rogue' device when it talks using the 10.1.1.1 address.  This is more of a workaround than a permanent fix, and it doesn't indicate that the switch is causing a problem.

You really should find the device with the 00:E0:81:B6:50:1B MAC address and disconnect it, then test without the static ARP mapping.
0
 
InfamusCommented:
I totally agree and that's why I asked.
0
 
cciedreamerAuthor Commented:
I deleted the static arp.
Then the problem occurred.

I traced that mac address and found on one of the access switch. I disabled the port.

The client started pinging to server this without flushing cache or static entries.

Let me investigate about this device.
0
 
Craig BeckCommented:
That device must have the 10.1.1.1 address on it - or it's running Proxy ARP.
0
 
cciedreamerAuthor Commented:
What I found so far that a non-manage swithc ( TP-Link) is connected to the access switch port and a host with problematic mac address is connected to this unmanaged switch.

I am trying to find this host now in the network. Anyhow I disabled  this port.
0
 
InfamusCommented:
That's scary...

:P
0
 
cciedreamerAuthor Commented:
Well found it. Their was a PC connected to this non-managed switch and the gateway (HSRP) IP was assigned to this PC.

Very scary :(
0
 
Craig BeckCommented:
Excellent, so I was right with my first and second post then!

Oh well, glad you got it sorted now :-)
0
 
cciedreamerAuthor Commented:
Thanks craigbeck and infamous for your help. Finally puting an end to this thread.

Just a summary of the solution-

- During the problem, I collected the arp result on the server.
- And noticed the server was pointing to wrong to mac address
- We started the tracing the mac on in network.
- Once the device was found we noted the PC was having a duplicated HSRP IP configured.
0
 
InfamusCommented:
Glad it is resolved.

Good Luck!!!
0
 
InfamusCommented:
On the second thought, you should enable port security and bpduguard on the access ports if not enabled.
0

Featured Post

New feature and membership benefit!

New feature! Upgrade and increase expert visibility of your issues with Priority Questions.

  • 81
  • 38
  • 20
  • +2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now