Design: Our SAN has 4 ISCSI NICS. Each ESX Host has 4 NICS dedicated to ISCSI connected to 1 vSwitch (screenshot attached). There is no routable network between the SAN and ESX Hosts, just one flat switch configured for Jumbo Frames and optimized for ISCSI traffic.
We are getting terrible latency and sometimes APD on our ESX Hosts when the SAN experiences high IOPS (from an overnight SAN to SAN replication). When our ESX hosts witness the latency, they curl up in a ball and die (APD).
I wonder whether our VMware ISCSI configuration may have something to do with it. (We are investing in more disks on the SAN to get better IOPS/lower latency).
By using a PSP of Round Robin, we end up with 16 possible paths to the SAN for each SAN volume (4 SAN NICS x 4 VMKNICS = 16 paths).
Should I be spreading the VMKNICS across perhaps 4 vswitches to better handle load balancing? Someone suggested this but I cant find any evidence to support it. Or should we reduce the default Round Robin IOPS from 1000 to a lower number, perhaps 3? (I cant find the recommended setting for a Compellent San but a Dell Equalogic is 3 IOPS). Or can you spot anything else?