Loss of connectivity...kind of

We're running a Windows Server 2003 R2 SP2 with VMWare Server 1.0.2. All times of the day and night we'll get a SNMP message to say the server has gone offline, pinging by name and IP, no response. Ping the virtual machines which are hosted on that box though and these all work ok, no loss of connectivity at all. Usually within a few minutes connectivity will be restored of it's own accord. There's nothing in the Event Logs which seems to correspond with the times it's reported as being offline. There's only one NIC in use on the server so the VM's have to be talking out from the NIC we can't ping when it goes offline!

Anyone got any ideas?
backofficeAsked:
Who is Participating?
 
Netman66Commented:
There is also an issue with SP2 and Receive Side Scaling - if your NICs support this, disable it on them directly.

0
 
Netman66Commented:
0
 
backofficeAuthor Commented:
Thanks for the reply.

Sounds similar but the machine's not a DC and is running SP2, also has latest NIC drivers and firmware. Working down the list of that support article - 1, DFS isn't installed, 2, Internet Connection Firewall, Internet Connection Sharing, and Routing and Remote Access are all either disabled or not installed, 3, IP NAT Driver's already stopped, 4, RRAS is disabled, 5, This is removed in W2K3 SP1 and we're running SP2 so not applicable I think.
0
How do you know if your security is working?

Protecting your business doesn’t have to mean sifting through endless alerts and notifications. With WatchGuard Total Security Suite, you can feel confident that your business is secure, meaning you can get back to the things that have been sitting on your to-do list.

 
Netman66Commented:
You might also want to check for any power management settings, both on the NIC and for the OS (hard drive spin down, etc).

0
 
Netman66Commented:
I also read something about DNS where it stops serving queries.

They suggest clearing the cache occasionally.

0
 
backofficeAuthor Commented:
I've checked the power management, set to always on. The NIC doesn't seem to have any seperate settings. Also I'd have thought that this would have an effect on all the virtual machines on the box too and they'd all become unreachable at the same time which isn't happening - so far it's just been the host.

I've flushed the cache. Interesting, as I'm writing this we've had some traps come through to say one of the VM's hosted on that machine is unreachable and then 10ish seconds later, reachable again. No reports of the host or other VM's loosing connectivity at the same time though....
0
 
Netman66Commented:
Strange.....

If DNS stops resolving then things would certainly be unreachable, however your monitoring software could be set to use the IP addresses which would tell you if DNS is involved since monitoring should be unaffected by a DNS outage.

0
 
backofficeAuthor Commented:
Certainly is!

When it does lose connectivity it becomes un pingable by both name and IP too. If it wasn't for the VM's on the box still being contactable I'd have put it down to faulty lead, NIC, switch port etc. I'm just stumped at how they all use the same physical hardware, ports etc but it's effecting them as though they were individual machines.
0
 
Netman66Commented:
I wonder if this is a virtual switch issue within VMWare?

You could attempt to install a second NIC just for the Host and let the VMs have the original NIC.

This may tell us if this is a Virtual circuit issue since the VMs must now use the physical switch and another NIC to talk to the host.

0
 
Netman66Commented:
Are you running McAfee on the host?
0
 
backofficeAuthor Commented:
We can sort out the second NIC pretty soon as the server has two which were teamed, unteamed as part of troubleshooting the issue. I'll try and sort out configuring that this afternoon if possible.

Nope, no McAfee. We're running Lightspeed's TTC Security Agent on there.

Thanks for the help - I'll post up when I've sorted the second NIC out
0
 
Netman66Commented:
I wonder if the Teaming software was causing this?  It sounds like a Compaq - if so, have you installed the latest PSP?

0
 
backofficeAuthor Commented:
Yep, it's a HP DL360G5 with the latest PSP installed.

Just been and disabled Receive Side Scaling on the NICs. I'll leave it a while and see if we get anything after that before doing the VM\ second NIC thing.
0
 
Netman66Commented:
Sounds like the right thing to do.

0
 
backofficeAuthor Commented:
Update - the host machine has not dropped it's connection as of yet. The VM machines on it though are now all dropping the connection for 1-2 seconds, not all at the same time though, individually.
0
 
Netman66Commented:
Bizarre....

We've shifted the issue into the VMs now.

Is there DNS on the VMs, perhaps clear their caches too.

0
 
backofficeAuthor Commented:
No joy I'm afraid - only about 10 minutes after flushing the DNS cache on the VMs one of them dropped it's connection again.
0
 
Netman66Commented:
I guess the next step is separating the VM and host NICs so we can further isolate the issue.

0
 
backofficeAuthor Commented:
Yep, will do. Unfortunately I won't have a chance to do this until tomorrow now but I'll post again as soon as it's done. Thanks again for your help today.
0
 
Netman66Commented:
No problem.  Lots of weird things happen when you host multiple domains or even forests on the same subnet and try to get them to communicate properly.

Since AD Sites and Services depends on subnet info as does Replication, then same subnet info for all things would be a bit confusing for it.

So does DNS - my bet may be that the Reverse Lookup zones are interfering with each other since they host the same subnet.

At any rate, it's best to separate them so there is real routing going on between them.
0
 
backofficeAuthor Commented:
Just to update you I've not had a chance to look at this again since as some higher priority problems came up...best estimate at the moment is this afternoon or Monday morning now.
0
 
Netman66Commented:
No problem.
0
 
backofficeAuthor Commented:
Well I was due to look at this today but going through my Inbox this morning I've not had any alerts to say that any other VM's have dropped off since the weekend. Maybe they just needed a while to take in the settings...either way it seemed to get alot better once the Receive Side Scaling setting had been changed so I'll accept this as the solution. Thanks for all your help.
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.