Hyper v, HP ISCSI SAN

Morning

We currently have an issue with iSCSI connectivity on a setup we have inherited from another IT provider.

To summerise, this is a Hyper-V 2K8 R2 cluster, which is using CSV's through ISCSI stored on a HP MSA 2324i iSCSI SAN.  Each host has 2 NIC's for ISCSI, and they are on there own subnet.  They are split between two switches for redundancy - nothing has changed prior to the fault.

Friday evening, all of the VM's went offline and the storage was marked as failed in the cluster.  On investigation I pinged the iSCSI interfaces on the SAN from the hyper-V hosts.  The ping times were either 1300ms+ or they were dropped, on all controller interfaces, from all hosts.

After several hours the storage came back online, the cluster storage could be brought back online and the ISCSI initiators on the hosts went from "reconnecting" to "connected".  Nothing had changed to bring it back online.

Several hours later the same issue occurs.  After 20+ hours it comes back up again, for one hour then dropped.

I have tried the following:

- Removed the physical cables and tried one at a time into each controller
- restarted the storage and management controllers
- restarted the hosts
- changed the IP's on the host interfaces on the SAN
- enabled loop protection on the switches

Given that all of the hosts are disconnected I am certain the SAN is the issue.  In the logs of the SAN you can see the port connections going up and down, however everything is reporting as healthy.  

Colleague is on the phone to HP but thought I would throw it out to EE in case anyone has had a similar issue they managed to fix.....

Thanks
LVL 12
DLeaverAsked:
Who is Participating?

[Webinar] Streamline your web hosting managementRegister Today

x
 
DLeaverConnect With a Mentor Author Commented:
Apologies for the delay

The issue ended up being due to another colleague plugging in two network hubs in a seperate part of the building.

The cabling at this site appears to be a little odd, and it wasn't behaving like a normal loop.  The degradation increased more by the day.  As soon as it was stated there had been two hubs added they were removed and the network performance came back - thats what happens when you are relying on all of the information being relayed and it isn't!

Marking as solved myself
0
 
VB ITSSpecialist ConsultantCommented:
I've seen a similar issue with a three node cluster, except my pings would just time out the majority of the time. Are all of your DCs on the cluster?

When you restarted the storage and management controllers, did you restart the hosts at the same time?
0
 
DLeaverAuthor Commented:
Solution found from further troubleshooting onsite
0
All Courses

From novice to tech pro — start learning today.