asked on

Ultra SLOW 2050 SAN

Hi all, Here’s the overview.

Current Infrastructure:

C7001 Blade enclosure containing
3 x ProLiant BL460c Gen8 blades
2 x HP B-series 8/12c SAN Brocade Switch BladeSystem c-Class

Storage:
1 x MSA 2050 SAN (about 8 months old)
1 x MSA 2040 SAN

The Gen8 blades are used as vmware 6.5 hosts connected through the brocade to the 2040 and 2050 SANs
The 2040 is used for archive storage
The VMS reside on the 2050 on fast SSD tierd storage with other slower storage for data.
The 2050 SAN has suddenly ground to a halt. VMs take 40 mins to start up and accessing them results in hanging.

The Hosts take 40 mins to boot hanging on nfs41client loaded successfully. They eventually boot and become available after about an hour.

Pinging the main file server results in massive pings plus response timeouts every 15 mins. Then it comes back. Its like the IO has been throttled to almost zero.
The 2040 SAN storage which is attached to a file server VM on the 2050 is ok. When the vm eventually boots, the storage can be written to via a share at full 1GB speed but anything copied to the 2050 storage grinds to a halt.
VMs on all hosts are affected.

The 2040 has dual fibre controllers of which two pairs are active.
The 2050 has dual fibre controllers of which one pair is active and one doesn’t seem to be active. I’m not sure why or if it was before.

Three things to note which may or may not be of interest:

We had a power failure before Christmas but everything came back up ok and seemed ok.

Another thing to note is that I noticed that I had patched the hosts by mistake to gen9+ image instead of the pregen9. Big mistake I thought (although it seemed ok until recently) I thought that that was the issue and that the network drivers were incompatible. I decided to re-image one of the hosts afresh with the correct VMware-ESXi-6.5.0-Update3-14990892-HPE-preGen9-650.U3.9.6.10.1-Dec2019.iso
However it took 40 mins to scan the hardware. It found all the storage eventually and I was able to install the image on the host’s internal 4GB Flash Storage.

On reboot with the correct image it again hangs with nfs41client client loaded successfully, with an even longer period till it moves onto the next file which is VFS and hangs.

If its hanging on the boot of an ISO it can’t be the mispatching of the hosts which is causing it? Another thought was that maybe thats because the other two hosts which were switched on at the time were flooding the fibre network with crap packets. These are just random thoughts though. I’m stumped.

Can fibre be switched to a slow speed by mistake?

Logs on all devices are unremarkable.

Recommendations/Questions?

I need the infrastructure up and running asap and was thinking of swapping over fibre from the working 2040 to the 2050 to test to see if that solves the issue but this is where my knowledge breaks down. I’m not up to speed with fibre and maybe you can’t just do that?

Gideon

Member_2_231077

First step would be to get the logs via the SMU and then look through them or post them here, it may be something simple like a dead battery.
https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-a00075761en_us&docLocale=en_US

Nick Smith

ASKER

Hi, Thanks for the reply. As there were about 200 files I've zipped them and put them in a wetransfer link here:
https://we.tl/t-ODppauqFW2

I've skimmed through them but not sure what I'm looking for.

Member_2_231077

One of the controllers appears to have died, that kills performance since it disables the write cache on the remaining controller because it cannot mirror that RAM to its partner. Should be a red light on one controller but the SMU should also show a controller as bad.

The log is in the root of that zip - store<date>dot log. I don't know exactly what the following means but a string of entries with "abort" in them is obviously not good.

01/09 09:14:33.605592 [0]FCP Abort Received XRI=0xFFFFFFFF:
01/09 09:14:33.605625 [0]OID=0 SID=0x010300 OXID=0x07C2 RXID=0xFFFF
01/09 09:14:33.605666 [0]no nexus found
01/09 09:14:33.605751 w[0]ABTS RECEIVED s_id 0x10300 handle 0x3 ox_id 0x07C3

Edit:

Had another look and seems it has been rebooted and is working OK now? If it's still under warranty I would get HP to look through the log, they may have a log interpreter to get more info or they may send the logs to Dot Hill who make the MSA2000 series.

Nick Smith

ASKER

Hi, We found that one of the SFPs had died but the problem still remains so looks like a surge damaged a bit more. Yes an HP call is in progress. Thanks for the pointer in the log. I'll let you know how we progress.

Nick Smith

ASKER

Hi, Just an update on progress:
I have a call with HP open where they looked at the 2050 logs and also a supportshow dump from brocade switches, none of which showed any errors. I bought in a replacement HPE SFP which I have used to replace the failed one and speed seems to have been restored. They still can't explain why unless all four links are connected speed is crippled. I thought it would in theory work fine with one connected?

ASKER CERTIFIED SOLUTION

Member_2_231077

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial