Link to home
Start Free TrialLog in
Avatar of DigitalInfuzion
DigitalInfuzionFlag for United States of America

asked on

VMware iSCSI reset on each hour

We have a strange problem: On the top of each hour (7:00, 8:00, 9:00, etc) we get a iSCSI reset.  I am working with Dell, EMC and VMware on this but still don't have an answer.  The key issue seems to be a VMware Host is dropping connections when the latency builds up and then  does a power on reset, which effects all hosts.  I'm not sure why this happens at the top every hour as opposed to a random time.  Once the reset happens within a minute everything is back to normal until the next hour.  The EMC VNXe3200 does not report any errors only the VMware Hosts report errors and this happens on all 3 hosts at the same time.

iSCSI configuration: We have 3 Dell R630's running VMware ESXi 6.0 (2809209) which connects to an EMC VNXe 3200 (SP A/ SP B Active/Active) on two unique Dell 6224 (1 GB) switches.  Each host has 2 connections on two different subnets to the SAN.  MTU 1500.  This is a separate iSCSI only network.


When the reset occurs, the system on the applications is one of network loss (Outlook can't connect, then connects OK).
We did have only 1 LUN configured though I'm in the process have adding more LUNs and distributing the VM's across them to help with I/O.  It seems that some process must kick off at the top of each hour but I can't find it and all 3 vendors say they don't have any idea what it could be.

Any help would be much appreciated. - Thanks...
Avatar of Julian Parker
Julian Parker
Flag of United Kingdom of Great Britain and Northern Ireland image

What network driver is being used on the Guest VM's??

Have you tried increasing the MTU? In some cases this can make things worse as we had issues in one of our environments.

Alas we dont use Dell kit and dont have that model NAS :-(

Can you monitor the systems to find out what each is doing prior to the fault showing? Is it heavy on the Network IO?
Avatar of DigitalInfuzion

ASKER

We upgraded the Dell NIC Firmware and VMware NIC driver (Broadcom) to the latest version.  I can't change the MTU, unless I bring down the whole environment which is something that is hard to do at this time.

VMware Errors:
Lost access to volume ... due to connectivity issues.  Recovery attempt is in progress and outcome will be reported shortly.
Sometimes I get: Path redundancy to storage device ... degraded.
Successfully restored access to volume ... following connectivity issues.
So does the VM host have 2 physical NIC's assigned or is it all in a vswitch and the guests are using the vmnic as opposed to say e1000?

Have you checked the physical switch connectivity/error logs? Changed ports/vlans?

It could still be a physical cable!

You need to go thru all the logs on the guest, ESX server switch and anything else in between and get as much info as possible. I would expect that the support personnel from Dell et al have gone thru all of this but I'm surprised they came up with nothing!

Feel free to attach logs here, The more info you can give the easier it is for people to make informed comments which in the long run also help others.
I have attached a network diagram and Host 1 logs for events that start at the top of an hour.  These logs are very similar across each host (1/2/3) and each LUN (1/2/3).   They occur every hour on the hour.

The one thing we can't answer is why this happens every hour on the hour.  What is the trigger?  It seems like some process must kick off then, otherwise if it was only a network/driver/hardware/iSCSI problem it would happen more randomly.
iSCSI-Topology1.png
Host1-Logs.txt
whats the guest OS on the hosts?
We have multiple VM guests, mainly Windows 2008 R2  and 2012 R2 with a few Windows 7, 8, 10 and a couple Ubuntu servers.   We can move these around to various hosts within the cluster and we still have the same pattern (every hour on the hour we get a power on reset from VMware).

We have now upgraded to ESXi 6.0.0 3029758 (update 1) and still have the issues.  Dell has provided the proper drivers/firmware which is all up to date.  Still no errors on the EMC side of things.  I've sent many logs to VMware and they are still reviewing them.
ASKER CERTIFIED SOLUTION
Avatar of Julian Parker
Julian Parker
Flag of United Kingdom of Great Britain and Northern Ireland image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
The switch was the key.  It turns out that the switch management port and the iSCSI network got connected to the same LAN.  This allowed Corporate network traffic to interfere with the iSCSI network though they are on different subnets.   Disconnecting the management port resolved the issue.  Why this only happened on the hour, who knows??   --  Thank you for your assistance with this.