SOFS horrendous performance

CaptainGiblets
CaptainGiblets used Ask the Experts™
on
I believe I am having some performance issues with SOFS that is running on our domain.

We are using Windows 2016, but we aren't using SOFS Direct.

Our setup has 2 hosts, Storage1 and Storage2. Each machine has 3 networks.  Domain, Cluster and Storage that are used for what they are named after.

Each Host is connected to all 3 JBOD's twice with MPIO installed on both machines.

Virtual machines are running without much of an issue, however I am trying to diagnose another problem with DFSR and I think it is to do with the speed of the storage.

Each JBOD has 16 1GB HDD's and 5 SSDs (one has 6)  The virtual disk has a column count of 8 and a data copy of 2

When I first set this up I was getting great speeds using the command diskspd.exe -b64K -d10 -h -L -o32 -w30 -t2 -c2G io1.dat io2.dat io3.dat io4.dat io5.dat

thread |       bytes     |     I/Os     |     MB/s   |  I/O per s |  AvgLat  | LatStdDev |  file
-----------------------------------------------------------------------------------------------------
     0 |      5158862848 |        78718 |     491.24 |    7859.87 |    4.070 |    26.885 | io1.dat (2048MB)
     1 |      5099421696 |        77811 |     485.58 |    7769.31 |    4.116 |    31.502 | io1.dat (2048MB)
     2 |      3890741248 |        59368 |     370.49 |    5927.81 |    5.394 |     9.878 | io2.dat (2048MB)
     3 |      7534411776 |       114966 |     717.45 |   11479.18 |    2.786 |     2.321 | io2.dat (2048MB)
     4 |      6189678592 |        94447 |     589.40 |    9430.39 |    3.391 |     7.073 | io3.dat (2048MB)
     5 |      6811746304 |       103939 |     648.63 |   10378.15 |    3.081 |     2.714 | io3.dat (2048MB)
     6 |      3873636352 |        59107 |     368.86 |    5901.75 |    5.418 |    10.154 | io4.dat (2048MB)
     7 |      5721554944 |        87304 |     544.82 |    8717.17 |    3.669 |    11.433 | io4.dat (2048MB)
     8 |      6418595840 |        97940 |     611.20 |    9779.16 |    3.271 |     4.274 | io5.dat (2048MB)
     9 |      6093799424 |        92984 |     580.27 |    9284.31 |    3.450 |     6.195 | io5.dat (2048MB)
-----------------------------------------------------------------------------------------------------
total:       56792449024 |       866584 |    5407.94 |   86527.11 |    3.697 |    13.982

Now I have been looking in to why one of my file servers was going slow, to try and fix it I pinned it direct to the SSD tier. Users are still reporting slowness, so I decided to run the tests again on both the pinned and unpinned VHDX's running on the same server.

I was receiving total:         775946240 |         2960 |      12.33 |      49.33 | 3571.875 |  4524.332 on the pinned VHDX which is slower than a single 7.2k HDD

On the unpinned I got total:        5881462784 |        22436 |      93.48 |     373.93 |  476.933 |  2971.906  which is better, but still pathetically slow for the hardware in use.

Watching the storage server during this period, it only averages around 300 Mbps network throughput, where it used to hit 2000+

Is there anything I can do to investigate these issues? It is kind of grinding my domain to a halt.
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Distinguished Expert 2018

Commented:
A bit confused on your topology, or perhaps just not enough information.

You mention having "2 hosts" but don't go into too much detail. Questions off the top of my head:

1) How many compute nodes do you have?
2) How many storage nodes do you have? (I assume 2 since you named your "hosts" storage1 and storage2)
3) How are your compute nodes connected to your storage nodes (you only talked about storage to JBOD)
4) How many CSVs have you defined?  If only 1, then you'll have some balancing issues with I/O and a storage node basically doing nothing or redirecting (performance hit.)

Author

Commented:
Hi Cliff, thanks for the fast reply!

By computer node do you mean hosts connecting to the storage nodes? If so there are 20 hosts.

It is 2 storage Nodes

Computer nodes if they are hosts are connected over a 10GB network, running speed tests the network is capable of much faster speeds than what I am currently getting.

I only have the one CSV, however the Storage node that hosts this CSV isn't being taxed at all, average 4% CPU and 10/192GB of RAM.  I have tried pausing the second storage node while running speed tests to make sure there are no redirects etc. However I still get the same results.
Technical Architect - HA/Compute/Storage
Commented:
Column count should be divisible by redundancy type, so 2-Way Mirror or 3-Way Mirror, times column count.

Assuming a 3-Way Mirror based on above:
 * 2 Columns  = 3*2 = 6 (HDD count should be in multiples of 6)
 * 3 Columns = 3*3 = 9 ( "" 9)
 * 4 Columns = 3*4 = 12 (""12)

With 16 capacity drives, it's hard to hit the above numbers. That's where the bite is coming from.
 * 8 Columns = 3*8 = 24 (HDD should be in multiples of 24)

Ideally, we populate our JBODs in groups of 12 drives to hit 2, 3, and 4 columns evenly. Then, we tweak the Interleave and the underlying format of the CSV to line up with say 64KB for higher IOPS needs or 256KB (default) for basic workloads.

EDIT: 3x JBODs = 48 so above rule applies. My bad, I missed that on my original read-through.

Question: Is MPIO set to Least Block Depth?

Author

Commented:
I only have a 2 way mirror set up. Not a 3 way.

I have run the command  Get-MSDSMGlobalDefaultLoadBalancePolicy and it just replies with "None"
Philip ElderTechnical Architect - HA/Compute/Storage
Commented:
Elevated CMD:
mpclaim -s -d

Open in new window

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial