We help IT Professionals succeed at work.

ESXi Hosts repeatedly logging in/out of 10GbE iSCSI targets

msidnam
msidnam used Ask the Experts™
on
I have a strange problem. Our esxi hosts keep logging in and out of our 10GbE targets on our SAN (Nexsan E18). I've been working with Nexsan support and VMWare but everything seems in order. jumbo frames are enabled, switch settings are correct, mtu settings on the vswitch and portgroups are set correctly. Everything is working and we can see the datastores but the constant login and logout sometimes locks up the 10GbE ports. We have:

Nexsan E18
Power Connect 5524 switch
10GbE ports on the 5524 connected to the NexSan
esxi hosts 4.1
jumbo frames enabled on the Nexsan and the esxi hosts
esxi hosts are connected via 1GB
the powerconnect 5524 is only handling the iscsi traffic. its completely separate from our data and voip traffic.

nexsan support sees around 13 login/logouts in the space of 2 minutes. when connecting the 1GB iscsi port on the nexsan we do not see this behavior. unfortunately i dont have another switch with 10GbE ports to test. I can vmkping -s 8000 to the 10GbE ports but not 9000. VMWare told me to set the mtu on the switch to something above 9000. Dell told me that when you "iscsi enable" on the switch it already sets the mtu to 9216. I have a netapp (connected to 1GB iscsi) and i can vmkping with 9000.

Has anyone else come across anything like this?  
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®

Commented:
Probably a bad cable or a bad 10 Gb port on either the switch or SAN.  Do you have any spare cables or other 10Gb devices that you can use for testing?

Author

Commented:
We have a duplicate setup for DR and it happens on that setup as well. different cables, switches, host and nexsan e18.
Paul SolovyovskySenior IT Advisor
Top Expert 2008

Commented:
I have seen some switches not being able to handle to auto detect and try to renegotiate the connection.

I would recommend connecting a crossover cable between the ESXi host and the SAN.  This will isolate the NIC.

If the issue still occurs I would connect use a 1GB nic and connect the 10GB into another server and mount a LUN from windows.  This will isolate ESXi and/or SAN

I would also take a look at your actual traffic, on most cases the the drives on your SAN will be the choke point before your network connection, if you don't see that traffic is getting to 1GB of throughput (about 600Mbps in reality) I would configure 2 1GB nics, configure link aggregation, and mutipath.  This will not only give you throughput but load balancing and failover.

My $.02

Commented:
If 2 different Nexsan E18's are having the same problem, then it may be a negotiation issue between the power connect 5524.  As a test, can you try connecting to the 10Gb port, but without jumbo frames?
Andrew Hancock (VMware vExpert / EE Fellow)VMware and Virtualization Consultant
Fellow 2018
Expert of the Year 2017

Commented:
Are you running ESXi 4.1 U1?

What is the make/model of 10Gbe NIC?

Commented:
is the 10gb nic using copper or fibre

Author

Commented:
Thank you for all the suggestions. I'm going to answer below. I am very new to 10GbE so my apologies if im not answering correctly.

I don't have 10GbE nics on the host. We only have 1GB. The only 10GbE is on the nexsan itself connected to the switch via SFP (they are from cisco and came with the nexsan). I am assuming the are copper but i could be wrong.

I am running esxi 4.1. No update 1. I'm not sure the make/model of the nic on the nexsans. If we take the 10GbE out of the picture and only connect to the 1GB iscsi nexsan ports we dont see the login/logouts and everything is stable.

I've noticed in my portgroup that the nics are not at auto neg. they are at 1000 full fuplex. so i tried droppping the ports on the switch to 1000 full duplex but we still see the login/logouts. Maybe i should change the portgroup to auto neg?

Its very baffling. Nexsan cant duplicate it in their labs.
Andrew Hancock (VMware vExpert / EE Fellow)VMware and Virtualization Consultant
Fellow 2018
Expert of the Year 2017

Commented:
Okay, what are the make and model of NICs in the server?

are the NICS making and breaking in the server between host and physical switch?

Author

Commented:
The nics on the server: Broadcom Corporation Broadcom NetXtreme II BCM5709 1000Base-T.

Here is a log of the session dropping and then coming back from the vmware side:
iscsivmk_ConnRxNotifyFailure: vmhba39:CH:1 T:1 CN:0: Connection rx notifying failure: iSCSI Task Not Found. State=Online messages.0.gz:Aug  4 13:11:05 vmkernel: 1:00:20:18.372 cpu1:4851)iscsi_vmk: iscsivmk_ConnRxNotifyFailure: Sess [ISID: 00023d000002 TARGET: iqn.1999-02.com.nexsan:p5-10ge:nxs-b01-000:01b612bd TPGT: 2 TSIH: 0] messages.0.gz:Aug  4 13:11:05 vmkernel: 1:00:20:18.372 cpu1:4851)iscsi_vmk: iscsivmk_ConnRxNotifyFailure: Conn [CID: 0 L: 172.0.0.22:60098 R: 172.0.0.26:3260] messages.0.gz:Aug  4 13:11:05 vmkernel: 1:00:20:18.372 cpu1:4851)WARNING: iscsi_vmk: iscsivmk_StopConnection: vmhba39:CH:1 T:1 CN:0: iSCSI connection is being marked "OFFLINE" (Event:6) messages.0.gz:Aug  4 13:11:05 vmkernel: 1:00:20:18.372 cpu1:4851)WARNING: iscsi_vmk: iscsivmk_StopConnection: Sess [ISID: 00023d000002 TARGET: iqn.1999-02.com.nexsan:p5-10ge:nxs-b01-000:01b612bd TPGT: 2 TSIH: 0] messages.0.gz:Aug  4 13:11:05 vmkernel: 1:00:20:18.372 cpu1:4851)WARNING: iscsi_vmk: iscsivmk_StopConnection: Conn [CID: 0 L: 172.0.0.22:60098 R: 172.0.0.26:3260] messages.0.gz:Aug  4 13:11:09 vmkernel: 1:00:20:22.147 cpu1:4851)iscsi_vmk: iscsivmk_ConnNetRegister: socket 0x4100ae164510 network resource pool netsched.pools.persist.iscsi associated messages.0.gz:Aug  4 13:11:09 vmkernel: 1:00:20:22.540 cpu1:4851)WARNING: iscsi_vmk: iscsivmk_StartConnection: vmhba39:CH:1 T:1 CN:0: iSCSI connection is being marked "ONLINE"

On the nexsan side we see this:

0027:C1 07-Aug-2011 at 17:53:47:(I): [0] TCP DEV 141 Connected to 172.0.0.9:52033
0028:C0 07-Aug-2011 at 17:53:46:(I): [1] TCP DEV 254 Connected to 172.0.0.6:59711
0029:C1 07-Aug-2011 at 17:53:45:(I): [1] TCP DEV 108 Connected to 172.0.0.6:53187
0030:C0 07-Aug-2011 at 17:53:44:(I): [1] TCP DEV 253 Connected to 172.0.0.8:61154
0031:C0 07-Aug-2011 at 17:53:41:(I): [1] TCP DEV 252 Connected to 172.0.0.4:57413
0032:C0 07-Aug-2011 at 17:53:41:(I): [0] TCP DEV 370 Connected to 172.0.0.5:50244
0033:C0 07-Aug-2011 at 17:53:27:(I): [1] TCP DEV 251 Connected to 172.0.0.9:62519
0034:C1 07-Aug-2011 at 17:53:28:(I): [0] TCP DEV 140 Connected to 172.0.0.8:52679
0035:C0 07-Aug-2011 at 17:53:11:(I): [0] TCP DEV 369 Connected to 172.0.0.8:50694
0036:C1 07-Aug-2011 at 17:53:00:(I): [1] TCP DEV 107 Connected to 172.0.0.8:59458
0037:C1 07-Aug-2011 at 17:52:56:(I): [0] TCP DEV 139 Connected to 172.0.0.9:54036
0038:C1 07-Aug-2011 at 17:52:56:(I): [1] TCP DEV 106 Connected to 172.0.0.8:61143
0039:C0 07-Aug-2011 at 17:52:44:(I): [0] TCP DEV 368 Connected to 172.0.0.7:49416
0040:C1 07-Aug-2011 at 17:52:26:(I): [0] TCP DEV 138 Connected to 172.0.0.7:52359
0041:C0 07-Aug-2011 at 17:52:25:(I): [1] TCP DEV 250 Connected to 172.0.0.7:52670
0042:C0 07-Aug-2011 at 17:52:09:(I): [0] TCP DEV 367 Connected to 172.0.0.7:55498
0043:C0 07-Aug-2011 at 17:51:09:(I): [1] TCP DEV 249 Connected to 172.0.0.9:55000
0044:C1 07-Aug-2011 at 17:51:08:(I): [0] TCP DEV 137 Connected to 172.0.0.9:49278
0045:C0 07-Aug-2011 at 17:50:49:(I): [1] TCP DEV 248 Connected to 172.0.0.5:57524

The logs above are from two different time periods but you see the reconnects.
Andrew Hancock (VMware vExpert / EE Fellow)VMware and Virtualization Consultant
Fellow 2018
Expert of the Year 2017

Commented:
have you updated the nic drivers for this nic from the vmware site?
Andrew Hancock (VMware vExpert / EE Fellow)VMware and Virtualization Consultant
Fellow 2018
Expert of the Year 2017

Commented:
broadcom are notorious with nic driver issues, Im surprised VMware have not had you try updating the nic drivers, maybe you have.

its interesting because we have issues currently with this NIC under heavy load it stops responding, goes down, then up, and then resets. and works again, and intermittently will go down again.
Andrew Hancock (VMware vExpert / EE Fellow)VMware and Virtualization Consultant
Fellow 2018
Expert of the Year 2017

Commented:
but you should be able to detect this issue using ping. you can also see in the logs the nic link going up and down.

Author

Commented:
No I haven't. we installed straight from the 4.1 download. I can install nic updates on my DR site. that way it shouldn't interfere with the prod. I've never installed updated nic drivers before. do they have a step by step on line?

We currently have 4 prod esxi hosts (all the exact same hardware specs including nic) that are also connected to a netapp. but the netapp is not 10GbE, only 1GB and we dont see any issues.

Do you think it would be worth getting a pci nic and putting it in one of the hosts to see if the issue happens with a different nic manufacturer? Nexsan is going to biuld us a custom firmware that will reset the 10GbE nic controller if it locks up. but we still cant figure out the login/logouts. im just concerned that all this login/logouts will cause corruption on the VM's. we have about 20 now but will be putting on more, including desktops.
VMware and Virtualization Consultant
Fellow 2018
Expert of the Year 2017
Commented:
I would try updating nic drivers, it easy to do, download update bundle, and use the command line vihostupdate.pl to upload and update nic driver, restart esxi server and your down.

Author

Commented:
Everyone, thank you for your comments. As it turns out it was still an issue with the Nexsan. According to engineering the Nexsan would offer old sessions to the esxi hosts instead of new ones. The esxi hosts would log in but then (i guess) realize that it was an old session and then log out. Engineering created a beta firmware for us that resolved this issue by giving out new sessions. We've been running the beta firmware for several days with no login/logouts.

If no one objects, I will be giving points to hanccocka. While no one had the answer, one of his comments did help me in updating the drivers to my nics.