ARP collisions cause DoS

Hi. We host our company's live web applications at a third-party's datacentre.  We have a Cisco PIX 515E firewall at the datacentre that represents the edge point of our equipment before it links to the ISP's equipment and hence to the internet. A few days ago the PIX experienced almost a total loss of connectivity for over an hour. This was the first time this has occurred in the two years that we've hosted our systems there. The chain of events was roughly this:

22:23:13 - Cisco PIX 515E firewall starts reporting dozens of ARP request collisions and ARP response collisions on its external interface. For instance:

<164>Mar 26 2007 23:27:05: %PIX-4-405001: Received ARP request collision from x.x.x.x/yyyy.yyyy.cb31 on interface outside
<164>Mar 26 2007 23:27:05: %PIX-4-405001: Received ARP response collision from x.x.x.x/yyyy.yyyy.cc31 on interface outside

(I have substituted x.x.x.x where the IP of our PIX's outside interface was, and yyyy.yyyy for the beginning of the MAC address reported in the log. Otherwise the log entries are unchanged)

The ARP collisions reported seemed to indicate a duplication of our IP address (x.x.x.x) within the ISP's network. I managed to trace it back to their gateway (x.x.x.1) and reported this to them. A traceroute to the IP address in question from outside got as far as the ISP's router and timed out but I think this normally happens anyway
 
The problem disappeared at 23:27:33 in the midst of the ISP's investigations (deduced from the PIX log):
<164>Mar 26 2007 23:27:19: %PIX-4-405001: Received ARP request collision from x.x.x.x/yyyy.yyyy.cc31 on interface outside
<164>Mar 26 2007 23:27:19: %PIX-4-405001: Received ARP request collision from x.x.x.x/yyyy.yyyy.cb31 on interface outside
<162>Mar 26 2007 23:27:19: %PIX-2-106001: Inbound TCP connection denied from a.a.a.a/1312 to z1.z1.z1.z1/445 flags SYN  on interface outside
<164>Mar 26 2007 23:27:19: %PIX-4-106023: Deny tcp src outside:a.a.a.a/1313 dst inside:z2.z2.z2.z2/445 by access-group "PERMIT_INET_IN"
<164>Mar 26 2007 23:27:19: %PIX-4-106023: Deny tcp src outside:a.a.a.a/1316 dst DMZ:z3.z3.z3.z3/445 by access-group "PERMIT_INET_IN"

(a.a.a.a is a host on the ISP's network which I believe was being used by their technicians to attempt to connect to our equipment (intentionally denied by me)
z1.z1.z1.z1, z2.z2.z2.z2 and z3.z3.z3.z3 are our web servers)

I reloaded the PIX for peace of mind at 23:55:37 although it had been running happily for months without being reloaded

Initially yyyy.yyyy.cc31 seemed to me to relate to a BroadCom card, perhaps a DELL machine, but then I determined that it resolved to x.x.x.1, the router next hop down from our firewall. That MAC address actually seemed to resolve to x.x.x.253 as well. It seems to be set up as a failover system. Our ISP say that the traffic is nothing out of the ordinary but I have searched all our PIX logs from the past two years and this traffic has not appeared before now. In the hour we lost service the PIX logged almost nothing but this traffic. That would seem to indicate to me that the two are related but the ISP deny this

I'm not saying that the ARP traffic isn't part of the normal working of the router failover monitoring process but there could still perhaps have been some event that caused a huge increase in that traffic. Perhaps the ISP restarted one or more routers as part of the investigation, or unplugged and reattached a cable. Any information would be helpful but none seems forthcoming from them, other than they say that everything was normal

The main issue is not apportioning blame as such, but determining whether the problem could be the fault of our firewall. Neither of the MAC addresses mentioned in the ARP collision messages in the PIX log relate to our equipment but the IP address (x.x.x.x) does. As I say it's the IP address of our outside interface. The way I see it the outside interface is seeing conflicting data but I need to confirm for sure that this confusion couldn't be caused by our PIX as, if it is, I need to take steps to avoid it happening again

I have done a lot of reading up on this issue since the downtime but I can't seem to determine exactly what happened. The Cisco documentation for error 405001 mentions that the traffic could be legitimate but whether legitimate or not the traffic seems to be deadly for our firewall. Would these messages be caused by a normal failover config on our ISP's router? The ARP collision traffic we saw coincided with a DoS so I'm sure the two are completely related as we've never seen that traffic on the PIX before in two years

As far as I'm concerned the traffic comes from one of four sources:

1) A cable plugged incorrectly by the ISP
2) A faulty config or a malfunctioning of the ISP's router (not our PIX)
3) A piece of kit from another hosted company at the ISP which shares the router (x.x.x.1) with us
4) Our PIX causing confusion or having a faulty config

My primary concern is eliminating and possibility of the source of the problem being number 4)

Anyway, I'd be interested to know if anybody has a different opinion as this has caused us a lot of hassle with some big clients and we are in danger of losing them. It's just one of those things no doubt but I'd be very interested to find out what happened and why so that we can avoid a repeat occurrence which we absolutely cannot afford
saville00Asked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

lrmooreCommented:
I'm almost certain that it is the ISP at fault here.
Ask them if there is anyone else that can possibly share the same broadcast boundary with your outside interface and their router. If yes, then any other firewall, like another PIX for example, in that same broadcast domain will use proxy arp to answer up for all of the addresses within its interface subnet.
I've seen the arp collisions when 2 pix's were in failover mode and one of them had a NIC going bad...
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
saville00Author Commented:
Thank you very much. That is what I thought was happening but with no access to the ISP's network to run any analysis and with the ISP themselves in full denial mode it's hard to be 100% sure on an issue such as this. Your opinion as somebody whose comments I've read with respect over the years counts for a lot with me

The answer is accepted but if anyone is still able to add comments I'd be interested in any additional thoughts

Thanks again
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Software Firewalls

From novice to tech pro — start learning today.