Internet outage - Solved

This is a question about Internet outage at a client today.  Up front:

1. I decided to ask a question instead of writing an article: My choice.
2. I rated it high priority even though I know the answer to provide incentive for experts to answer.
3. I do not want and will not accept Googled answers. I already know how to use Google. Use only your own words, skill, knowledge and experience.
4. I could not find a solution here but that does not mean there isn't one.
5. I will select the first two best answers: 1,500 points each.

Background. My client downtown for which I do the Financial Consulting work. I went there this morning and Internet was just fine from 8:00 am to about noon when it went "thud" and was gone. Normally that is external, I hooked up my Rocket Stick to finish a few things, and asked the Office Administrator if she had called the ISP. She was on the phone to them. Everyone has no internet so no (hosted Exchange) Outlook email, but servers and printers are running.

Hookup:  ISP modem, business internet, 6 static IP addresses allocated. At this point, one IP for office, one IP for Wireless Guest access (no access to serverss or network), and one IP for wireless POS solution for our ticketing system.

The modem is attached to an ISP Cisco 891 (?) box, which lashup is to provide high speed internet to the office. Hooked to this is a Juniper Netscreen VPN router / firewall. Hooked to the Juniper are a couple of HP Switches to distribute Internet services to employees and servers and the same switches provide employees with connection to folders and printers.

There is another router that has one of the Static IP addresses and provided wireless internet to the POS devices for ticketing.

Troubleshooting  The internet is now out (as I stated earlier) and I went down to the room where the servers are and where the switches / internet gear is. I have passed by the Office Admin desk and she is being fed ISP Pablum that it is our issue and not theirs.  Turns out in hindsight it was our issue but that certainly was not apparent at the time. We asked for technical support onsite and they agreed.

We rent our facilities from a host and we learn their internet went out at the same time. Too easy! Must be an external issue but it turns out it was not.

By this time, I have contacted my colleague who normally provides IT services to this client but he was engaged at another of our clients today. We start talking.  I do not see any lights (even power) on the Cisco 891. Let's restart it. We do - nothing. It turns out Cisco (who cannot design user friendly gear for love nor money) puts the light on the back where you cannot see them. I put out the Netscreen and Cisco and see the lights.

Did you restart the Juniper Netscreen? No, I said. Let's restart it. But I will observe here that both he (somewhere else) and I (on my Rocket Stick) could make tunnels to the Netscreen. He disconnects and I restart the Netscreen Nothing. He logs into the Netscreen and tells me the Netscreen CPU is running full tilt and orange on the dashboard. Why?

Did you restart the HP Switches?  No, I said. If you want to do that, I need everyone to log off so they do not lose work. I go to the office, gather everyone and tell them to log off immediately. They do. Back to the server room, unplug / re-plug the HP Switches. Nothing. Now the switches, router, Cisco box have all been restarted. Let's restart the ISP Modem. We do and nothing. My colleague tells me that his pings through the VPN tunnel are 4/5 lost and 1/5 connect.

Back to the office and the ISP Technician has arrived. We know each other because he has been here before (a year ago) when the ISP put in the Cisco box to raise speeds. I am happy to have him instead of a stranger. We go down to the server room and he asks: Did you restart this, that and the other thing? Yes said I.

He checks connections. Nothing obviously wrong. He verifies there is no internet, calls "home" to check a few things about our system, starts his computer. It is now a couple of hours or so later and still no internet.

He connects to the POS router (which is a different IP on the same ISP modem) and gets a decent signal. Interesting but that is not our business IP. No internet at the business IP address.

Now, have you grasped all this? Have you started to form an opinion?

He asked:  Have you restarted the POS router? No, I have not. Different system I said. He agrees. But he says, I am going to restart it. He does. After a couple of minutes, Internet to the business has returned, we walk down to the office and all is well. My colleague sends me an email that the Juniper Netscreen CPU is back to normal.

The question: What was wrong, and what caused the internet to go out (or at least to be so utterly slow as to not respond).

Have you seen this before? Any thoughts?
LVL 102
JohnBusiness Consultant (Owner)Asked:
Who is Participating?
 
QlemoConnect With a Mentor Batchelor, Developer and EE Topic AdvisorCommented:
I agree packets going back and forth between routers (because of errornous routing info) or switches (network loops) would lead to such behaviour.

The Juniper not able to deliver packets because of issues with the modem could explain a high CPU load. But since the VPN worked fine, that is unlikely.

I assume the POS router sent packets to Juniper instead of to the modem, but the POS traffic would have been severely hit by that too.

Maybe the POS router lead to changing the port settings (speed or duplex) on the modem, so a lot of traffic congestion happened (undetected?) between Juniper and ISP modem, and that could be fixed only be restarting or disconnecting the POS router. That would fit to CPU load and outgoing traffic not going thru unless initiated from outside.

In hindsight I would have just disconnected the POS router from the modem to check if it was the culprit. Too late now, of course ;-).
0
 
Steve McCarthy, MCSE, MCSA, MCP x8, Network+, i-Net+, A+, CIWA, CCNA, FDLE FCIC, HIPAA Security OfficerIT Consultant, Network Engineer, Windows Network Administrator, VMware AdministratorCommented:
So, without a diagram of the connections I can't be sure, BUT it sounds to me that the POS router is not the silo it should be.  It sounds to me that there is a configuration issue on the network.  I would first say to look at this router as providing DHCP which your PC's are picking up, but if the rest of the network is running good and utilizes DHCP that would not be the issue.

Instead, it sounds to me like it is a DNS issue.  I understand that this router is a separate system, but if you restart it and it supposedly has no purpose with the normal running of the network, just POS and the POS Internet, and does restore the Internet, then that is where I am looking.  If this device is causing the DNS issue, then you would be getting timeouts.

So, one thing to check is where DNS is supposed to be taking place and ensure that it is setup properly there.  Make sure there are not forwarders or conditional forwarders to this router.  In a good working environment, this should not cause the problem you are having.

Yes, I have seen this problem before.

For networks, but best advise is to use the KISS method.  I think something is misconfigured in your network and I would bet it resolves around DNS pointing somewhere to this router or this router providing some of that functionality.
0
 
JohnBusiness Consultant (Owner)Author Commented:
Thank you for your reply. It was not a DHCP issue. Servers are providing DHCP and that was all working. It was not a DNS issue at least at the source.

it sounds to me that the POS router is not the silo it should be.
It sounds to me that there is a configuration issue on the network

Decent observations. The POS router is not a complete silo in that its internet is from the same ISP and same ISP modem as the business internet. That is not an uncommon configuration in a small business. The lack of being a completely isolated silo for the POS router plays into the issue, I think it is fair to say. I did allow that restarting the POS router brought back internet.

It is not (to my way of thinking) a network configuration issue. The basic network is simple:  ISP modem -> Juniper Netscreen Router -> Switches and company machines.

If this device is causing the DNS issue, then you would be getting timeouts.  <- The POS router was not causing a DNS issue. It appeared to be having a different issue.  Business DNS by the way is supplied by the Server 2012 R2 Domain Controller as is DHCP. DHCP and DNS have nothing to do with the routers.

So I am not being critical or dismissive, but the issue was not a DNS issue.

Thank you again for responding.
0
KuppingerCole Reviews AlgoSec in Executive Report

Leading analyst firm, KuppingerCole reviews AlgoSec's Security Policy Management Solution, and the security challenges faced by companies today in their Executive View report.

 
ecarboneCommented:
My thoughts:

1. Your DNS Servers have the POS router as their gateway or forwarder. And because the POS router was acting all wonky, your DNS servers could not resolve external addresses. Once the POS router was rebooted, DNS lookups were back to normal. OR, your DNS servers couldn't see that router as it was rebooting, so it failed over to the other router.

or
2. There was some major network activity happening that had priority/QoS and it was hogging up all of your bandwidth.

or
3. Pure coincidence because the on-site ISP tech already had someone on the outside resolving the issue and maybe they resolved it (externally) at the same time he was rebooting the POS router.
0
 
Steve McCarthy, MCSE, MCSA, MCP x8, Network+, i-Net+, A+, CIWA, CCNA, FDLE FCIC, HIPAA Security OfficerIT Consultant, Network Engineer, Windows Network Administrator, VMware AdministratorCommented:
Maybe that is not the issue, and I mentioned above that it was not a DHCP issue as the other things worked. I agree somewhat that the DHCP and DNS have nothing to do with routers.  Somewhat....  My premise was that something might be pointing DNS to that router or that possibly it was setup as a forwarder.  So in those situations, it DOES have everything to do with the router.
0
 
QlemoBatchelor, Developer and EE Topic AdvisorCommented:
The POS router connects to ISP modem or Juniper?
0
 
Sanga CollinsSystems AdminCommented:
If I see a situation where

1. Users are unable to surf the internet because its super slow
2. The juniper VPN stays up and I can log in and see CPU at 80% +++
3. ping from lan to wan or lan to wan drops 75% of packets

Usually it ends up being a network loop where someone has plugged a cable from one network port accidentally into another.

On my linux laptop once on the lan I usually run arp-scan -l to check the network ARP table and it will show the duplicate IPs to help track down where the issue is.


You issues could be completely difference ofcourse. Or they could be the same. Not a lot of information on the post to make a good judgement call.
0
 
JohnBusiness Consultant (Owner)Author Commented:
Qlemo wrote:  "The POS router connects to ISP modem or Juniper?"  Not in any way directly to the Juniper but both are connected to the ISP modem.

Sanga Collins wrote: "Usually it ends up being a network loop where someone has plugged a cable from one network port accidentally into another."   <-- Good thought but the server room cabling appeared to be fine.

Steve wrote: "So in those situations, it DOES have everything to do with the router."   The issue certainly does have to do with the POS router.

There was traffic (packets) coming from the router that should not have been .
0
 
masnrockConnect With a Mentor Commented:
I would've guessed that there was a flood of traffic destined to the POS router, but the ISP would've noticed that. So that said, I would assume that something was connected or configured incorrectly on the POS network. While you did say that the network was wireless, I would not be shocked if someone connected something in a way that caused traffic on the network that didn't belong. An example that comes to mind is that the POS router was connected to the ISP modem via a LAN port instead of the WAN port. That would be all sorts of wrong, but would make sense for traffic to be going from the router that should not have been.
0
 
JohnBusiness Consultant (Owner)Author Commented:
That is closer now. The packets were from the router not to the router. The ISP technician was on site with me and would certainly have noticed incoming traffic
0
 
JohnBusiness Consultant (Owner)Author Commented:
Thanks all.

We did not know that the POS router was the cause initially partly because the router was able to provide internet where our system could not.

The solution is murky (which is not entirely satisfactory). The ISP technician believes the (cheap) POS router created a broadcast storm which does have the capacity to disable internet. We know that this broadcast storm caused high Juniper CPU usage because it has been configured to stop such storms.

As soon as we restarted, all this activity stopped and Internet returned.

Can it happen again?  Yes.  We will upgrade the router firmware early next week. Then we will talk to the Ticketing System software supplier and tell them we intend to replace the POS router with a Juniper router. This should end the difficulties.

Thanks.
0
 
JohnBusiness Consultant (Owner)Author Commented:
Thank you.
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.