Solved

Performance degradation in new Hyper-V cluster

Posted on 2011-03-23
19
6,069 Views
Last Modified: 2013-11-06
Good morning!

Setup: Dell T710 (purch Jan 2010) and Dell R710 (purch Jan 2011). Using a Dell MD3200i (iSCSI) as our SAN. 12 NICS on the T710, 8 on the R710 (got another quad port sitting on my desk, waiting for a maintenance window to install it). SAN direct attaches via CAT6 cables to the servers. SAN has dual raid controller cards.

Original VM environment ran purely on the Dell T710. It ran a RAID 5 array, about 1.5 TB of local storage, in addition to the RAID 1 system drive. Never had any performance issues.

Have about 5 networks dedicated for the cluster (and binding order)
1) Management NICS on each port
2) Direct connection to iSCSI port on SAN
3) Direct connection to ISCSI port on SAN (each server has dual connections, crossed, for MPIO purposes, on seperate subnets, both have jumbo frames enabled)
4) Cluster Heartbeat network (private network, 10.1.1.x)
5) Live Migration network (private network, 10.1.2.x, with jumbo frames enabled)

Rest of the NICs on the nodes are used for the virtual networks.

Have about a 2 GB LUN for the witness disk, and about 4 TB of storage for the VM's. Running Raid 6 on the SAN. 128 segment size

We started experiencing some massive slowdowns in one our SQL servers and a server that runs some CRM software, that uses one of the DB's on that SQL server in question. This CRM server is used by the sales force. The SQL server hosts the CRM database, as well as a enterprise database.

These slowdowns seem to have occured after we moved the VM's, via SCVMM, into the CSV storage.

I can't seem to find where the degredation is coming from. We've rebooted the SQL server often, and performance improves, only to degrade again within a few hours.

I tried to turn the VM off and copy the VHD files off back to local storage on the original host server, but  the copy process just hangs after a few seconds, on either server. (I'm attributing this to it being a CSV?)

I've contacted Dell enterprise storage to check iSCSI initiator settings and it all seems good now. I've upgraded the drivers on the older server NICs.

Any ideas on things I can check? I'm kind of at a loss here, on why we're seeing issues.
0
Comment
Question by:HornAlum
  • 11
  • 7
19 Comments
 
LVL 20

Expert Comment

by:Svet Paperov
ID: 35206293
This seems to me as a case of misconfiguration of network cards. Jumbo frames need to be enabled all over the iSCSI track – on the hosts, the switches, and on MD3200. Also, make sure that you have the same jumbo frame size or, at least, the jumbo frame size of the switches should be higher.

You could check as well if you have offloaded everything and that the flow control is enabled on the same network cards.

Are you experiencing problems with other VHDs too? It doesn’t seem so but you could have corrupted VHD file too.
0
 
LVL 5

Author Comment

by:HornAlum
ID: 35206414
the two iSCSI NIC(s), on both servers, both have Jumbo frames enabled and set to 9000. The SAN also has it enabled to 9000 (as high as it goes). I connect a Cat6 cable directly from the NICs to the iSCSI host ports. There's no switch between the host nodes and the SAN. Flow control seems to be set to Auto ... any suggestions on a specific setting? RX&TX enabled? I disabled all protocols except TCP/IP v4.

I have a suspicion that it's probably a I/O bandwidth issue to the SAN. The SQL server has very high I/O in the mornings and I believe there's a bottleneck somewhere. I was reading about numerous people who have been having some bottlenecks with MPIO and these Dell MD3200i's.

The SAN has two raid controllers. Here's the IP configs

Controller 0, Slot 1 - 172.16.1.101
Controller 0, Slot 2 - 172.16.2.101
Controller 0, Slot 3 - 172.16.3.101
Controller 0, Slot 4 - 172.16.4.101
Controller 1, Slot 1 - 172.16.1.102
Controller 1, Slot 2 - 172.16.2.102
Controller 1, Slot 3 - 172.16.3.102
Controller 1, Slot 4 - 172.16.4.102

Node 1, iSCSI NIC 1 - 172.16.1.11 (connects to port 0/1)
Node 1, iSCSI NIC 2 - 172.16.2.12 (connects to port 1/2)
Node 2, ISCSI NIC 1 - 172.16.2.11 (connects to port 1/1)
Node 2, iSCSI NIC 2 - 172.16.1.12 (connects to port 0/2)

Per Cluster validation wizard, each node has iSCSI nics on different subnets. they crisscross to the SAN. The quorum disk LUN is owned by controller 0 and the data disk, holding the VM's are owned by controller 1. this seems to be how the SAN auto configured it.

I adjusted MPIOfrom Least Queue Depth to Round Robin with Subset on the data drive path and put the other path on standby. From what I've been reading via google, people have been experiencing MPIO issues. I'm wondering if I should configure my cabling some other way ... maybe put both quorum and data LUN ownership on the same controller card and connect all 4 cables to that 1 controller card.

Those who are using a switch ... instead of direct connecting ... any suggestions? maybe that would solve bottleneck I/O issues?

0
 
LVL 5

Author Comment

by:HornAlum
ID: 35206458
Oops, I also have the QoS Packet Scheduler enabled under properties of each iSCSI NIC
0
 
LVL 20

Expert Comment

by:Svet Paperov
ID: 35207337
Your network setup doesn’t seam right to me.

Just to clarify my experience with iSCSI so far: a year ago, I set up two Dell R710 with MD3000i and two Dell 5424 switches dedicated to iSCSI traffic only. I am pretty happy with it as I am running 10 virtual machines, one of them is SQL Server 2008 used by a CRM application and heavy Websense logging.

First, how did you that: implementing multi-path iSCSI without switches? While I was doing my research and consulting with Dell engineers I have never saw a multi-path iSCSI implementation with crossover cables.

Also, you are omitted the subnet mask of the IP addresses but they seem to me wrong.

I could suggest you a setup but you need two switches, Dell 5424 which are iSCSI optimized for example. You can also check the following guide: PowerVault MD3200 Series Hyper-V Implementation - Dell (the hyperlink is too long but you can Google it).

Example:
Controller 0, Slot 1 - 172.16.1.101/24
Controller 0, Slot 2 - 172.16.2.101/24
Controller 0, Slot 3 - 172.16.1.103/24
Controller 0, Slot 4 - 172.16.2.103/24
Controller 1, Slot 1 - 172.16.1.102/24
Controller 1, Slot 2 - 172.16.2.102/24
Controller 1, Slot 3 - 172.16.1.104/24
Controller 1, Slot 4 - 172.16.2.104/24

Node 1, iSCSI NIC 1 - 172.16.1.11/24
Node 1, iSCSI NIC 2 - 172.16.2.11/24
Node 2, ISCSI NIC 1 - 172.16.1.12/24
Node 2, iSCSI NIC 2 - 172.16.2.12/24

We need to setup two VLANs on both switches (and don’t forget the jumbo frames):

VLAN 21 for subnet 172.16.1.0/24
VLAN 22 for subnet 172.16.2.0/24

And connect everything as it is shown in Figures 2 and 3 of the guide.

One more thing: it is not advisable to mix iSCSI traffic and ordinary IP traffic on the same switches.

Regarding Flow control: yes, it should be set to Rx&Tx Enabled.

There are some additional setups that you could implement to improve the performance of your iSCSI but you should start with the basic.

In some cases iSCSI could be a bottleneck for the SQL traffic but I doubt it is the case. Your overall configuration doesn’t suggest that you are running a highly saturated SQL Server.
0
 
LVL 5

Author Comment

by:HornAlum
ID: 35207788
I'm not using crossover cables. I'm using standard Cat 6 cables. When i said they were crossed, i meant that I don't have one node where a cable goes to slot 0/0 and 1/0, because those two IP's are on the same subnet. So, each node has a cable on the .1 and .2 subnet, satisfying the validation wizard.

we didn't have the budget to get additional switches.

the subnets are 255.255.255.0 on every NIC and host controller ports.

I've got the Dell book right in front of me, and for both implementations (switch, or no switch), each port on each controller card is on a seperate subnet.

Dell uses these defaults
Controller 1
192.168.130.101/24
192.168.131.101/24
192.168.132.101/24
192.168.133.101/24
Controller 2
192.168.130.102/24
192.168.131.102/24
192.168.132.102/24
192.168.133.102/24

I changed everything to 172.16.1-4 because my primary LAN uses 192.168.20.x, so i wanted to eliminate confusion

I'm not sure about multipathing. I just know it's enabled, even though I direct connect everything. My suspicion is that there are performance issues there. I'm not sure if i can just turn it off
0
 
LVL 20

Expert Comment

by:Svet Paperov
ID: 35209200
OK, I got it. I am sorry I misunderstood something. When you sad:

Node 2, ISCSI NIC 1 - 172.16.2.11 connects to port 1/1
Node 2, iSCSI NIC 2 - 172.16.1.12 connects to port 0/2


I presumed your ports 1/1 and 0/2 were respectively controller 1, slot 1 with 172.16.1.102 and controller 0, slot 2 with 172.16.2.101 which obviously won’t work if you were using /24 as subnet mask and could be the source of the problem, so I though you were using /16 which also lead to problems.

But if it is the other way round (controller 0, slot 2 and controller 1, slot 1 respectively), it is correct.

I didn’t know that MD3200i allows you to run without switches (I think MD3000i doesn’t).

Did you configure the iSCSI initiator on the hosts properly? Dell engineer how has configured my iSCSI pointed out that in case of MPIO it is important to select the Initiator IP address for the corresponding Target portal IP. You can do that when you are adding the target in Discovery Portal (in Discovery tab of iSCSI Initiator) by clicking on Advanced button. Then you select Microsoft iSCSI Initiator in Local adapter and select the corresponding IP address in Initiator IP.

You can turn off multipathing simply by depluging one of the network cables.
0
 
LVL 5

Author Comment

by:HornAlum
ID: 35209319
yea, I had a webex with a Dell tech and we fixed a few iSCSI initiator problems. All of the discovery portal paths have the correct initiator IP and target IP.
0
 
LVL 5

Author Comment

by:HornAlum
ID: 35216205
I've tried the following, and still no luck

The Dell device had put the witness disk on controller 0 and data disk on controller 1. I changed ownership to Controller 1
Disabled the iSCSI connections to Controller 0, effectly disabling multipathing
Disabled TOE on all of the broadcom iSCSI NICS, as a Dell engineer told me those may conflict with Jumbo frames, which are probably more important here. also disabled receive side scaling

Two of the VM's continue to have terrible performance ever since we moved them from local storage out to the SAN.
0
 
LVL 20

Expert Comment

by:Svet Paperov
ID: 35217281
OK. It doesn’t make any sense.

Quote: “I tried to turn the VM off and copy the VHD files off back to local storage on the original host server, but the copy process just hangs after a few seconds”. Is this still the case?  

If you don’t have problems with the other VMs then may be it is not the iSCSI that should be blamed?

You could try one of these:
Remove all VSS copies on the guests, shutdown it, and try to make a copy of the VHD (local or on remote drive)
Attach the VHD of problem VM as a disk on the Hyper-V host and run some check disk programs
Compress the VHD if it’s dynamically expanding
It will be nice to take a backup before trying all these but, as I understand, you cannot do that.

You don’t have snapshots on the virtual machines, I hope.

I am running out of options… Sorry man
0
What Is Threat Intelligence?

Threat intelligence is often discussed, but rarely understood. Starting with a precise definition, along with clear business goals, is essential.

 
LVL 5

Author Comment

by:HornAlum
ID: 35217562
that's ok. I appreciate any input anyone can give.

The disks are fixed size disks.

The other VM's are not nearly as IO intensive as the primary VM in question. the primary VM in question is a SQL server. it was perfectly fine when it was stored on the local SAS drive RAID 5 array. as soon as it ended up on the SAN/CSV, we see nothing but timeouts.

I did an experiment. I started copying files over to the SAN device. It writes very quickly. However … I rename the file (for the sake of maybe making it different), and try to copy it back, and it’s painfully slow. It writes the file to the disk at about 480-500 MB/second. Copying it back, and it’s only moving at under 5 MB/second.

So ... i could be seeing poor read performance, which is why I'm seeing a performance hit.
0
 
LVL 20

Expert Comment

by:Svet Paperov
ID: 35218096
If you want to do a real test you should use a small tool called SQLIO. It is designed to simulate SQL I/O. You can download it from MS. Here is a small script.bat that you can use to run it with different I/O request sizes:

sqlio -kW -s600 -frandom -o8 -b8 -LS -Fparam.txt
timeout /T 60
sqlio -kW -s600 -frandom -o8 -b64 -LS -Fparam.txt
timeout /T 60
sqlio -kW -s600 -frandom -o8 -b128 -LS -Fparam.txt
timeout /T 60
sqlio -kW -s600 -frandom -o8 -b256 -LS -Fparam.txt
timeout /T 60

sqlio -kW -s600 -fsequential -o8 -b8 -LS -Fparam.txt
timeout /T 60
sqlio -kW -s600 -fsequential -o8 -b64 -LS -Fparam.txt
timeout /T 60
sqlio -kW -s600 -fsequential -o8 -b128 -LS -Fparam.txt
timeout /T 60
sqlio -kW -s600 -fsequential -o8 -b256 -LS -Fparam.txt
timeout /T 60

sqlio -kR -s600 -frandom -o8 -b8 -LS -Fparam.txt
timeout /T 60
sqlio -kR -s600 -frandom -o8 -b64 -LS -Fparam.txt
timeout /T 60
sqlio -kR -s600 -frandom -o8 -b128 -LS -Fparam.txt
timeout /T 60
sqlio -kR -s600 -frandom -o8 -b256 -LS -Fparam.txt
timeout /T 60

sqlio -kR -s600 -fsequential -o8 -b8 -LS -Fparam.txt
timeout /T 60
sqlio -kR -s600 -fsequential -o8 -b64 -LS -Fparam.txt
timeout /T 60
sqlio -kR -s600 -fsequential -o8 -b128 -LS -Fparam.txt
timeout /T 60
sqlio -kR -s600 -fsequential -o8 -b256 -LS -Fparam.txt

Param.txt has:
d:\testfile.dat 2 0x0 500
Where D: drive is on your iSCSI storage and 500 (in MB) should be higher than the cache on the controller.

Then you can play with different settings on MD3200i and your hosts.
0
 
LVL 5

Author Comment

by:HornAlum
ID: 35218285
I will definitely try this on Monday.

I created a new 2GB LUN to play with, and used the ATTO Disk Benchmark. these are my results so far. Significant read slowdowns at the 32k and 64k blocks

Capture.JPG
0
 
LVL 5

Assisted Solution

by:HornAlum
HornAlum earned 0 total points
ID: 35243007
So I kind of made a breakthrough. It turns out the system wasn't performing very well with a setting of 9000 under jumbo frames. We lowered the setting and settled around a setting of 6500 and we're seeing much better performance via the ATTO benchmark. I am seeing some better performance from requests made to the SQL servers for data

I ran some of those SQL benchmarks after i made the packet change. I outputted them to the text files I've included, if you want to look them over. I used 360 seconds instead of 600 seconds.


Capture-6500-jumboframe.JPG
write-random.txt
write-sequential.txt
read-sequential.txt
read-random.txt
0
 
LVL 20

Accepted Solution

by:
Svet Paperov earned 300 total points
ID: 35243990
So, it was the jumbo frame size? I am rather surprised because MD3000i supports MTU size up to 9000 bytes/frame and Broadcom NICs also support 9kB jumbo frame size. Did set the same size on both ends? You can use ping –l [buffersize] –f from the host to test what is the correct size of the MTU.

Your results look OK. I uploaded some of my test results. They gave me slightly better results but this could be because of hard disks: 146GB 15000rpm SCSI disks in one group of 8 disks in RAID 5 and one of 6 in RAID 10.

 I would try to fix the MTU size to 9000. You may be have a lot of traffic with the SQL and it really matters.


sqlio-results.xls
0
 
LVL 5

Author Comment

by:HornAlum
ID: 35244029
the poor results we were getting was with a MTU of 9000. That's why I'm surprised as well. I set the MTU at both the NIC and the 3200i. I tried 8000 and 7000 and similar poor performance. I settled in around 6500 and seeing better results, as you can tell from the ATTO benchmarks.

I'm using 450GB disks, 15k RPM SAS, 12 disks in a RAID 6.
0
 
LVL 20

Assisted Solution

by:Svet Paperov
Svet Paperov earned 300 total points
ID: 35244276
You should be getting better results that me with these disks. As I can see, the major difference between your setup and the one I have are the switches. I would try to get at least one Dell PC 5424. It has some iSCSI optimizations and may be it helps.
0
 
LVL 5

Author Closing Comment

by:HornAlum
ID: 35308463
Jumbo frame size was found via self experimentation
0
 
LVL 1

Expert Comment

by:BRAHelpdesk
ID: 39395364
Had same problem with dell 5524 switched. Replaced with brocade vdx and jumbo frames work.
0
 
LVL 5

Author Comment

by:HornAlum
ID: 39396082
i stopped using the broadcoms and used intel nics instead and have had no issues since. the intels seem to handle the 9000 frame size much better than the broadcoms
0

Featured Post

Find Ransomware Secrets With All-Source Analysis

Ransomware has become a major concern for organizations; its prevalence has grown due to past successes achieved by threat actors. While each ransomware variant is different, we’ve seen some common tactics and trends used among the authors of the malware.

Join & Write a Comment

Introduction Windows 2012 is here - it is upon us, and I hope that like me, you will upgrade your Hyper-V to the new version which has been promised to be a lot more stable, flexible and powerful than its predecessor in Windows 2008 R2.  Setting up…
Few best practices specific to Network Configurations to be considered while deploying a Hyper-V infrastructure. It may not be the full list, but this could be a base line. Dedicated Network: Always consider dedicated network/VLAN for Hyper-V…
In this tutorial you'll learn about bandwidth monitoring with flows and packet sniffing with our network monitoring solution PRTG Network Monitor (https://www.paessler.com/prtg). If you're interested in additional methods for monitoring bandwidt…
This demo shows you how to set up the containerized NetScaler CPX with NetScaler Management and Analytics System in a non-routable Mesos/Marathon environment for use with Micro-Services applications.

746 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

16 Experts available now in Live!

Get 1:1 Help Now