• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 293
  • Last Modified:

Loss of connectivity...kind of

We're running a Windows Server 2003 R2 SP2 with VMWare Server 1.0.2. All times of the day and night we'll get a SNMP message to say the server has gone offline, pinging by name and IP, no response. Ping the virtual machines which are hosted on that box though and these all work ok, no loss of connectivity at all. Usually within a few minutes connectivity will be restored of it's own accord. There's nothing in the Event Logs which seems to correspond with the times it's reported as being offline. There's only one NIC in use on the server so the VM's have to be talking out from the NIC we can't ping when it goes offline!

Anyone got any ideas?
0
backoffice
Asked:
backoffice
  • 13
  • 10
1 Solution
 
Netman66Commented:
0
 
backofficeAuthor Commented:
Thanks for the reply.

Sounds similar but the machine's not a DC and is running SP2, also has latest NIC drivers and firmware. Working down the list of that support article - 1, DFS isn't installed, 2, Internet Connection Firewall, Internet Connection Sharing, and Routing and Remote Access are all either disabled or not installed, 3, IP NAT Driver's already stopped, 4, RRAS is disabled, 5, This is removed in W2K3 SP1 and we're running SP2 so not applicable I think.
0
 
Netman66Commented:
You might also want to check for any power management settings, both on the NIC and for the OS (hard drive spin down, etc).

0
Fill in the form and get your FREE NFR key NOW!

Veeam is happy to provide a FREE NFR server license to certified engineers, trainers, and bloggers.  It allows for the non‑production use of Veeam Agent for Microsoft Windows. This license is valid for five workstations and two servers.

 
Netman66Commented:
I also read something about DNS where it stops serving queries.

They suggest clearing the cache occasionally.

0
 
backofficeAuthor Commented:
I've checked the power management, set to always on. The NIC doesn't seem to have any seperate settings. Also I'd have thought that this would have an effect on all the virtual machines on the box too and they'd all become unreachable at the same time which isn't happening - so far it's just been the host.

I've flushed the cache. Interesting, as I'm writing this we've had some traps come through to say one of the VM's hosted on that machine is unreachable and then 10ish seconds later, reachable again. No reports of the host or other VM's loosing connectivity at the same time though....
0
 
Netman66Commented:
Strange.....

If DNS stops resolving then things would certainly be unreachable, however your monitoring software could be set to use the IP addresses which would tell you if DNS is involved since monitoring should be unaffected by a DNS outage.

0
 
backofficeAuthor Commented:
Certainly is!

When it does lose connectivity it becomes un pingable by both name and IP too. If it wasn't for the VM's on the box still being contactable I'd have put it down to faulty lead, NIC, switch port etc. I'm just stumped at how they all use the same physical hardware, ports etc but it's effecting them as though they were individual machines.
0
 
Netman66Commented:
I wonder if this is a virtual switch issue within VMWare?

You could attempt to install a second NIC just for the Host and let the VMs have the original NIC.

This may tell us if this is a Virtual circuit issue since the VMs must now use the physical switch and another NIC to talk to the host.

0
 
Netman66Commented:
Are you running McAfee on the host?
0
 
backofficeAuthor Commented:
We can sort out the second NIC pretty soon as the server has two which were teamed, unteamed as part of troubleshooting the issue. I'll try and sort out configuring that this afternoon if possible.

Nope, no McAfee. We're running Lightspeed's TTC Security Agent on there.

Thanks for the help - I'll post up when I've sorted the second NIC out
0
 
Netman66Commented:
I wonder if the Teaming software was causing this?  It sounds like a Compaq - if so, have you installed the latest PSP?

0
 
Netman66Commented:
There is also an issue with SP2 and Receive Side Scaling - if your NICs support this, disable it on them directly.

0
 
backofficeAuthor Commented:
Yep, it's a HP DL360G5 with the latest PSP installed.

Just been and disabled Receive Side Scaling on the NICs. I'll leave it a while and see if we get anything after that before doing the VM\ second NIC thing.
0
 
Netman66Commented:
Sounds like the right thing to do.

0
 
backofficeAuthor Commented:
Update - the host machine has not dropped it's connection as of yet. The VM machines on it though are now all dropping the connection for 1-2 seconds, not all at the same time though, individually.
0
 
Netman66Commented:
Bizarre....

We've shifted the issue into the VMs now.

Is there DNS on the VMs, perhaps clear their caches too.

0
 
backofficeAuthor Commented:
No joy I'm afraid - only about 10 minutes after flushing the DNS cache on the VMs one of them dropped it's connection again.
0
 
Netman66Commented:
I guess the next step is separating the VM and host NICs so we can further isolate the issue.

0
 
backofficeAuthor Commented:
Yep, will do. Unfortunately I won't have a chance to do this until tomorrow now but I'll post again as soon as it's done. Thanks again for your help today.
0
 
Netman66Commented:
No problem.  Lots of weird things happen when you host multiple domains or even forests on the same subnet and try to get them to communicate properly.

Since AD Sites and Services depends on subnet info as does Replication, then same subnet info for all things would be a bit confusing for it.

So does DNS - my bet may be that the Reverse Lookup zones are interfering with each other since they host the same subnet.

At any rate, it's best to separate them so there is real routing going on between them.
0
 
backofficeAuthor Commented:
Just to update you I've not had a chance to look at this again since as some higher priority problems came up...best estimate at the moment is this afternoon or Monday morning now.
0
 
Netman66Commented:
No problem.
0
 
backofficeAuthor Commented:
Well I was due to look at this today but going through my Inbox this morning I've not had any alerts to say that any other VM's have dropped off since the weekend. Maybe they just needed a while to take in the settings...either way it seemed to get alot better once the Receive Side Scaling setting had been changed so I'll accept this as the solution. Thanks for all your help.
0

Featured Post

 The Evil-ution of Network Security Threats

What are the hacks that forever changed the security industry? To answer that question, we created an exciting new eBook that takes you on a trip through hacking history. It explores the top hacks from the 80s to 2010s, why they mattered, and how the security industry responded.

  • 13
  • 10
Tackle projects and never again get stuck behind a technical roadblock.
Join Now