I am running ESXi 5.5u2 with the VMs stored on an illumos NFS server (omnios) with the zfs file system. Average write latencies are low for the most part (< 1 ms average), but several times per hour, there will be spikes into the 50-100 ms range. >75% of the time, these will be from domain controllers, but other VMs will randomly have similar spikes in write latency.
Examining write throughput and write operations per second shows no spikes in either metric during these events. Likewise, read rate and read ops are low during these events.
zfs is configured with a Zeusram slog device. The IOPs running through the Zeusram are not anywhere near their limit (the max I have seen is 500 write iops and the limit is in the 10s of thousands ). In order to get around the vsphere single tcp connection limit for NFS shares, the illumos server is multihomed with several ip addresses and each datastore is tied to one of those ip addresses. This prevents a single tcp connection for all VMs; each VM gets its own private tcp connection. The omnios server has 32gb ram and 10tb of storage so there is not a RAM deficit.
I've tried tweaking multiple illumos kernel parameters, NFSd parameters, vmx config file parameters, and vsphere parameters, all to no effect. I am unable to determine why these intermittent spikes are occurring.
Admittedly, they are not very high latencies, but there has to be a reason for them. It does not make sense that throughput and iops load should be low when these spikes occur.
Does anyone have any suggestions on where to look for the root cause?