Hi all, Here’s the overview.
Current Infrastructure:
C7001 Blade enclosure containing
3 x ProLiant BL460c Gen8 blades
2 x HP B-series 8/12c SAN Brocade Switch BladeSystem c-Class
Storage:
1 x MSA 2050 SAN (about 8 months old)
1 x MSA 2040 SAN
The Gen8 blades are used as vmware 6.5 hosts connected through the brocade to the 2040 and 2050 SANs
The 2040 is used for archive storage
The VMS reside on the 2050 on fast SSD tierd storage with other slower storage for data.
The 2050 SAN has suddenly ground to a halt. VMs take 40 mins to start up and accessing them results in hanging.
The Hosts take 40 mins to boot hanging on nfs41client loaded successfully. They eventually boot and become available after about an hour.
Pinging the main file server results in massive pings plus response timeouts every 15 mins. Then it comes back. Its like the IO has been throttled to almost zero.
The 2040 SAN storage which is attached to a file server VM on the 2050 is ok. When the vm eventually boots, the storage can be written to via a share at full 1GB speed but anything copied to the 2050 storage grinds to a halt.
VMs on all hosts are affected.
The 2040 has dual fibre controllers of which two pairs are active.
The 2050 has dual fibre controllers of which one pair is active and one doesn’t seem to be active. I’m not sure why or if it was before.
Three things to note which may or may not be of interest:
We had a power failure before Christmas but everything came back up ok and seemed ok.
Another thing to note is that I noticed that I had patched the hosts by mistake to gen9+ image instead of the pregen9. Big mistake I thought (although it seemed ok until recently) I thought that that was the issue and that the network drivers were incompatible. I decided to re-image one of the hosts afresh with the correct VMware-ESXi-6.5.0-Update3-14990892-HPE-preGen9-650.U3.9.6.10.1-Dec2019.iso
However it took 40 mins to scan the hardware. It found all the storage eventually and I was able to install the image on the host’s internal 4GB Flash Storage.
On reboot with the correct image it again hangs with nfs41client client loaded successfully, with an even longer period till it moves onto the next file which is VFS and hangs.
If its hanging on the boot of an ISO it can’t be the mispatching of the hosts which is causing it? Another thought was that maybe thats because the other two hosts which were switched on at the time were flooding the fibre network with crap packets. These are just random thoughts though. I’m stumped.
Can fibre be switched to a slow speed by mistake?
Logs on all devices are unremarkable.
Recommendations/Questions?
I need the infrastructure up and running asap and was thinking of swapping over fibre from the working 2040 to the 2050 to test to see if that solves the issue but this is where my knowledge breaks down. I’m not up to speed with fibre and maybe you can’t just do that?
Gideon
https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-a00075761en_us&docLocale=en_US