Link to home
Start Free TrialLog in
Avatar of indigo6
indigo6

asked on

VMware ESXi 5 has high read latency, no write latency

Hello,
     I have a single ESXi 5 host hosting a few VMs, we will likely move to a multi server setup with a SAN, vMotion and the like later. However, right now, there seems to be a problem with some sluggishness on a SQL/Application server for about 12 users. It's generally quick, but there are certain times where it is slow. The VM is running Winows Server 2008 R2 and SQL server 2008 R2, and is allocated 8 cores and 12GB or RAM, it only shares the host with two other VMs, so it's not running into performance issues in that regard.

     However, when I look at the disk latencies, sometimes the read latency can peak at over 50ms, the write latency stays at 0ms. (See attachment) This is a RAID 10 virtual disk with 4 600GB 15K SAS drives on a Dell PERC H710. It is configured with Adaptive Read Ahead, and Write Back, while the disk cache policy shows disabled (I assume because the member disks are SAS). The stripe element size is 64KB. I'm not sure if this is the cause of the intermittent sluggishness, or if it's something else. Any help is appreciated!
Avatar of lcohan
lcohan
Flag of Canada image

"It's generally quick, but there are certain times where it is slow" - if anyone can check my guess is that this hapens due to some scheduled job in SQL or repeated user action on tables with missing indexes so larger than needed number of reads occur.

Besides that a 50ms latency on a multiple VM's host is not huge in my opinion and the latency in it's own can be induced by the host itself:

http://communities.vmware.com/thread/397646
I don't see any attachment on your posting.  Are these numbers for the whole disk array, or for just the SQL Server VM?

First of all, I sure hope you have battery-backup on the RAID controller cache.  If  not, you are taking a risk using write-back.  

Second, having a much lower write latency is normal when using a cache for writes.  In fact, depending on the details of the controller (which I am not familiar with), your write performance may be at the expense of your read performance. Writes are probably getting priority in the cache, and the reads get what is left over.   But 50ms read times is not a problem if they are rare.

If your your bottleneck is disk reads, then you might be able to reduce them by adding memory to SQL server.  But there are way to many unknowns in your configuration to be sure of that
- Are you sure the application slowness is due to wating for SQL responses?
- How big
- What are the memory settings for the SQL server instance?
- What is the page-get rate on SQL when the application is slow?  And what is the buffer pool hit ratio?
- What is the disk transfers/sec and average secs/transfer on the SQL server Virtual disks?
how many cores or virtual cpus have you allocated to the VM - 8 way vCPU, vSMP?

do you really need 8 cpus for the server, over allocation can be detrimental to performance

also check you are NOT running on a Snapshot virtual disk.
Good points by hanccocka.  I generally avoid giving any VM more vPUCs than half the number of cores in the host.  So, for a two socket, quad-core host that would be no more than 4.  I don't count hyperthreading...which provides only marginal extra throughput in most cases.
Avatar of indigo6
indigo6

ASKER

I will check if anything is going in SQL, as it's a new install with only a few DBs, I don't think so, but I will look nonetheless. Not sure why the attachment didn't come through, but I've attached it here.

So what I'm hearing is that the latency is likely not responsible for the intermittent performance issues I'm experiencing?

This vm is currently allocated 8 vCPU, the host has two six core processors. So, I'm curious, giving the VM too many resources can be detrimental to that VM's performance? Not just detrimental to others? Is is wise to reconfigure the number of CPUs on this server, I'm not sure I want to rebuild it, but if there is no other option then I most certainly will.

Thank you for all the suggestions!
Read-Latency.jpg
ASKER CERTIFIED SOLUTION
Avatar of Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
Flag of United Kingdom of Great Britain and Northern Ireland image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of indigo6

ASKER

I see, that's very good advice! The physical server this is replacing has 2 single core hyper threaded processors. It is an older Dell PowerEdge 2600 running Windows Server 2003 R2 and SQL Server 2000. Is it better to use multiple cores in a single virtual processor, or multiple virtual processors? I'm also wondering if changing the shares value for the Disk/Memory/CPU has any appreciable effect. I want this VM to have priority over the others if it becomes under heavy load.
Nobody is saying the read latency is not the cause.  But we don't have any immediate reason to focus on that.  

Too many vCPUs will be not be directly detrimental to the individiual SQL VM.  But the problem is that all vCPUs for a single VM must be scheduled together.  So if you have 12 cores and two VMs that each have 8 vCPUs, then only one VM can be running at once.   This would be detrimenal to everything on that that host...unless your application really needs 8 cores on each vCPU.  I don't know if I would go as far as hanccocka.   Starting with just one vCPU for each VM untill they show they need more would be the way to get the best use of host resources...but I would start with 2 or 4 (on a 12-core host) for any Server I expect to be under any significant load.

In any case, though hanccocka is right to point out you have too many vCPUs on your VM, there is no particular reason to think this is the cause of your slowness.  There's no way anybody is going to identify the problem without a lot more information.
see my EE Article

HOW TO:  Performance Monitor vSphere 4.x or 5.0

and SQL can sometimes behave poorly under any Hypervisor! (compared to a physical host!).
Avatar of indigo6

ASKER

Ok, I have cut back the number of vCPUs to 4. I will monitor it and report back! Thanks everybody!
also if you are using the E1000 NIC, replace it with the VMXNET3 interface.
Avatar of indigo6

ASKER

Ok, I have implemented both of your suggestions. Thank you. Just curious, is the VMXNET3 better?
E1000 is only used for legacy implementations.

VMXNET3 used less overhead than the emulated E1000!

E1000 should  not really be used for production only installation!
Avatar of indigo6

ASKER

I made the changes, and there are definitely improvements! I also now understand more about the scheduler, how it waits for all CPUs to be available, etc... So thanks! However there are still a few issues with slowness when certain operations are taking place on the client side. Also, I have a message in the log that says: "Device naa.6d4ae52099b5660017a19dff134071d7 performance has deteriorated. I/O latency increased from average value of 958 microseconds to 35966 microseconds."

Might there also be some SQL optimizations that I'm missing out on?
Avatar of indigo6

ASKER

Hello again, not to fill the thread with my own comments, but perhaps this is a SQL issue. While the latency on the host appears to be high at certain times, I did some synthetic benchmarks with ATTO and the new server's storage is anywhere from 2X to 26X times faster. (I've attached the benchmark screenshots, first is the old, second the new) Now, I don't know the ins and outs of SQL I/O, and I know a synthetic benchmark doesn't prove anything definitively, but I think it may be a database issue... Again, I really appreciate all the helpful insights!
APPSRV.PNG
APPSRV-NEW.PNG
- How big is the database?
- What are the min and max memory settings for the SQL server instance?  I
- What is the page-get rate on SQL when the application is slow?  And what is the buffer pool hit ratio?
- What is the disk transfers/sec and average secs/transfer on the SQL server Virtual disks at a time when slowness is occuring?
Avatar of indigo6

ASKER

There are two databases, the main one is about 9GB, and the other is about 25MB.
Max memory says: 2147483647MB, which I left at the default.
I'm unsure, but I will test, how can I log that info?

One more thing, since this is now more of a SQL question, would it be better to split it off into another question. Thanks!
Well, with 12GB RAM and a database of only 9GB, the problem is most likely not due to SQL disk reads.  You have enough memory to buffer the entire database!!  Unless the problem only occurs once, and then when you retry it is faster??

It could be that the new hardware will make your problem go away.  Or that the problem is not SQL at all.  I could be in your application code.  But I do agree this is no longer an ESX question.
Avatar of indigo6

ASKER

So will SQL load the entire database into memory? It is slower the first time the operation takes place, but even after that it is slower than the old server. These databases are for a custom software package, so I will contact the vendor as well.
It seems I misunderstood.  You mean this slow performance is occuring on the new hardware?  I thought it was still on the old hardware.
Avatar of indigo6

ASKER

Yes, that is correct, the new VM is having the issue.
if the old server was physical, and ran faster than this new virtual machine, we have seen this many times with SQL, and we have moved SQL databases back to physical servers.

hypervisors take too many cpu cycles away causing poor performance.
Avatar of indigo6

ASKER

Ok, I'll see what I can do. Can a 9 year old physical server really be faster? I'll be in contact with the vendor of our custom software to see if they have any insights. I may try putting this server on native hardware if it comes down to it. Thanks for all the helpful insights!
depends on clock speed and number of cycles....

everything is virtualised in a VM.
Avatar of indigo6

ASKER

I see, I see. The old server has two of these Xeons, and 3GB or RAM. The new one has two of these Xeons. The processor usage doesn't seem that high on the new server...
You should be able to get better performance on VMs on new hardware.    

Maybe you should tell us more about the workload on the host server: can you give us a list of all VMs on the host, and the resources (vCPUs, memory) assigned to each?  Also total memory on the host?

I hope you have not overcommited RAM....
Is your SQL server running anything other than the SQL database engine.  The current (default) memory settings will allow SQL to use all the VM's memory for database buffer pool, which can kill performance of any other service running on that VM.  You should probably set the max memory for SQL to 10GB.

and ask your application supplier if they have any recommendations for the MAXDOP setting in SQL (which affect how SQL uses multiple processors).
here is a test to try if you can, turn OFF ALL other VMs on the host.

just run the SQL VM, test performance.

Check and compare results.

We have used a bare metal dual processor quad core 3GHz with 32GB, with a hypervisor, both ESXi and Hyper-V,

VM of Win2k8, SQL 2K8, 500MB database, 250 concurrent clients, when a query was run on the 2 vCPU 8GB vM, it flatline the cpu for 13 mins...

same setup but on bare metal no hypervisor, reduced to 8GB only, query instantenous....

we moved sql vm back to physical hardware......
Avatar of indigo6

ASKER

The Host only has 3VMs. A file server, A Windows 7 Client that I'm using to test the DB, and the SQL server. Memory total is 32GB, and is not overcommitted. The SQL server is also serving the applications, but the old hardware was doing that as well. I will run these tests and report back. I am in contact with the vendor as well.
All right, if the application is on the same VM as SQL and the only other VM is the Win7 client, then I recommend:
- Set SQL Server MAX Memory to 8GB
- 4 vCPUs for the server, 2 vCPUs for the client
Test performance.  If still slow
-  check server "available memory" in task manager.
      -   If less than 500MB, decrease SQL MAX Memory
If you still have a problem
-  try setting SQL Server MAXDOP option to 1
     (ask application vendor if they have a recommedation)

Right now, my best guess is that you are getting memory contention between SQL and the application.  SQL is configured to take all the available memory by default.  I am also guessing that your old server is 32-bit, which would have ensured SQL could not take more than 2GB and would have still left 1GB for the OS and application.

Regardless of the example cited by hanccocka (from which no lessons can be drawn withou a lot more detail), in your case I am certain that you can get more performance out of a VM on the new hardware than you ever got out of the old hardware.  Yes, running it on bare hardware will always be a little faster, but hypervisor overhead is not going to wipe out 9 years of hardware upgrade effect.  Not even close.
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of indigo6

ASKER

After looking through everything, I don't think VMware is the cause of the issue. I'm going to try and vet this out with SQL and the vendor. I do appreciate all the good ESXi optimization advice! Thank you!
DId you try what I suggested for SQL memory settings?  Because you are running the application on the same server, real need to limit SQL memory usage.
Avatar of indigo6

ASKER

I did, I also tried playing with the MAXDOP setting, to no avail. I'm working with the software vendor to resolve the issue. Thanks!