Link to home
Start Free TrialLog in
Avatar of mkalugotla
mkalugotla

asked on

Low Performance of ESXi 4.1 Server on Dell M710

Hi,
We have a Dell PowerEdge M710 blade server, with 144GB of RAM, with four SATA disk drives each 600GB. I could not remember the Raid level used, but the total available disk for usage is 1.63TB. We have installed ESXi 4.1 free license. After installing 10 Virtual Machines(VM) with 4GB of RAM each, and using only 6 VMs, the ESXi has become very slow. All the VMs have become extremely slow that even small software are taking huge time to install. I have heard that this may be because of Input/Output Processing and may also related to Raid. Can anyone help what might be the issue.

thanks
Avatar of IanTh
IanTh
Flag of United Kingdom of Great Britain and Northern Ireland image

how many cores have you got

are  the sata drives in a raid controller if it is what hardware raid controller are you using
you are running raid 5 as 4 * 600 is 2400 gb and in raid 5 you lose roughly 1/3 so thats why you get 1.63 tb
also I have noticed if a vm is windows it takes more esx processing slices than linux
Are you sure those aren't SAS drives? I can't see spending a small fortune on RAM and going cheap on the drives. If they are SATA, I would get those changed out.

What do the performance counters on the VMs look like. Is there a lot of CPU activity? Have you assigned more vCPUs than necessary to the VMs?

Could be the storage driver too.
Avatar of ragnarok89
ragnarok89

Possible causes for poor performance:

- Not enough CPUs in the Blade server
- T1hin Disk VMs that are trying to grow, but not enough Disk space in ESX
- Not enough RAID 5 Arrays (if you have 5 VMS on a single RAID aray, it might be too much,
   never mind 10 VMs)
You didn't mention what roles these VM servers function as.  Asnything with high I/O? This sounds like disk I/O and latency to me.

Take a look at it in ESXi by going to the performance tab and and changing the chart options to Disk. In the counters window enable the I/O latency monitors.

You want it to be below 20, although it is normal for it to bounce above 20 sometimes. If it is sustained above 20 that's more than likely your problem.
VM performance issue in your case is caused by SATA drives and RAID level (but most likely by SATA drives). SATA is not recommended for that amount of virtual machines on that RAID level with that number of disks. Also it depends on the type of services running (for example Exchange and SQL together with transaction logs + Active Directory, Sharepoint maybe). This will degrade your performance to the power of 10th.

I would recommend to either move to RAID1+0 (this will cause data loss,reformatting your RAID, rebuilding machines from backup etc...) or buy a storage with SAS or FC disks and migrate servers.

Sorry for the bad news.
Avatar of mkalugotla

ASKER

Its not SATA, What I have is : 4 X 600GB 2.5-inch 10K RPM, 6Gbps SAS Hot Plug Hard Drive

Configured with RAID 5, So logically i have 1.63TB of drive space.

Also, Iam creating Windows VMs.

CPU Cores : 12 CPUs X 2.659 GHz
Physical Processors : 2
Logical Processors : 24
you said sata not sas there is an i/o difference

I expect your cpu cores are maxed out you can see in the performce tab in viclient when you on the esx server
so how many vcores have you added to the vm's
// so how many vcores have you added to the vm's
How can I check this? Pls let me know
In summary:
How many CPUs assigned per VM?
Did you thin provision the disks?
What do your CPU and Disk Counters reveal?
What are the functions of the Windows servers?
go into the settings of the vm
Summary of each VM shows : 1vCPU
I have chosed 'Typical VM Creation', not Custom.

Yes, I have chosed Thin provision for disks.

Performance tab was showing data, but after switching on around 6 VMs and installing products like Sharepoint, active directory, IIS, Exchange, the performance tab shows 'performance data is currently not available for this entity'

Windows Server fuctions are installing Sharepoint, Exchange, Anti Virus product etc.
so the esx serever is being maxed out which is more than likely all the installs as thats a one off really when you do an install
can you pls tell me any solution?
do i need to increase virtual processor number while installing VM?
well can't you just let the installations finish first  
They are finished
what and the performance tab is still not responding?
i have attached the performance data. User generated image
ASKER CERTIFIED SOLUTION
Avatar of bgoering
bgoering
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
How can I check these from an ssh console.. pls let me know
You can easily tell if you have the battery module by looking at the health tab (see below). If it is present there is no easy way to determine if the configuration is for write back. You would have to go into the controller setup (F8 maybe? There will be a message on what key to press) at boot time to see and/or change the cache configuration.

 User generated image
performance tab can show datastore , cpu, ram

Also is your raid battery is still charging as that can affect the raid controllers performance
Seems BBWC is not installed.
I agree IO is likely the cause although I doubt it is battery related. My wild guess is that when you installed your AV software, it started a full scan and you had 6 full virus scans all running at the same time.
Hmmm - I haven't actually used the CERC controller - but from your screenshot it doesn't appear to have a battery. You might want to review http://support.dell.com/support/edocs/software/smarrman/marb35/en/controll.htm in the section on caching.

"Write policy for PERC 2, 2/Si, 3/Si, 3/Di, and CERC SATA1.5/6ch controllers
Write Cache Enabled. When the write cache is enabled, the controller writes data to the write cache before writing data to the array disk. Because it takes less time to write data to the write cache than it does to a disk, enabling the write cache can improve system performance. Once data is written to the write cache, the system is free to continue with other operations. The controller, in the meantime, completes the write operation by writing the data from the write cache to the array disk. The Write Cache Enabled option is only available if the controller has a functional battery. The presence of a functional battery ensures that data can be written from the write cache to the array disk even in the case of a power outage.

Write Cache Disabled. This is the only available option if the controller does not have a functional battery. "

cerc is a built-in perc isn't it
Perc6i is the intergrated built in perc.

You might want to try a disk throughput tool like disk bench(http://www.nodesoft.com/diskbench/)  in order to actually measure your disk performance.

For my R710 ( a rack version of your M710 ) that I posted the health screen from I get results like the following on a SATA raid 5 array - with SAS disks yours should be a bit better.

10 MB Create File Bench
Starting Create File Bench...

Created file: dummy1
  Size: 251658240 bytes
  Time: 1438 ms
  Transfer Rate: 166.898 MB/s

Create File Bench ended

Create File Batch (with all default settings)
Starting Batch Create File Bench...

48 MB; C:\Users\Administrator.BGDOMAIN\AppData\Local\Temp\1\Test; 50331648 bytes; 281 ms; 170.819 MB/s
52 MB; C:\Users\Administrator.BGDOMAIN\AppData\Local\Temp\1\Test; 54525952 bytes; 297 ms; 175.084 MB/s
56 MB; C:\Users\Administrator.BGDOMAIN\AppData\Local\Temp\1\Test; 58720256 bytes; 297 ms; 188.552 MB/s
60 MB; C:\Users\Administrator.BGDOMAIN\AppData\Local\Temp\1\Test; 62914560 bytes; 313 ms; 191.693 MB/s
64 MB; C:\Users\Administrator.BGDOMAIN\AppData\Local\Temp\1\Test; 67108864 bytes; 219 ms; 292.237 MB/s
68 MB; C:\Users\Administrator.BGDOMAIN\AppData\Local\Temp\1\Test; 71303168 bytes; 359 ms; 189.415 MB/s
72 MB; C:\Users\Administrator.BGDOMAIN\AppData\Local\Temp\1\Test; 75497472 bytes; 359 ms; 200.557 MB/s
76 MB; C:\Users\Administrator.BGDOMAIN\AppData\Local\Temp\1\Test; 79691776 bytes; 266 ms; 285.714 MB/s
80 MB; C:\Users\Administrator.BGDOMAIN\AppData\Local\Temp\1\Test; 83886080 bytes; 391 ms; 204.604 MB/s
84 MB; C:\Users\Administrator.BGDOMAIN\AppData\Local\Temp\1\Test; 88080384 bytes; 391 ms; 214.834 MB/s

Create Batch File Bench ended

Let me know your results
Possibly bad news... this page (http://stuff.mit.edu/afs/athena/dept/cron/documentation/dell-server-admin/en/Perc6i_6e/chapterb.htm) indicates the BBWC isn't available on the CERC controller. That function is pretty much a must for decent vm performance...
does this mean CERC controller will have bad performance with ESXi?
@mkalugotla
No, your performance should be acceptable, but it could be better with write caching enabled. Your performance should be ok right now correct? Now that things from the installation, patching and AV scans have completed?
You will probably need to avoid things like scheduled AV scans all kicking off and running at the same time, instead stagger the schedules and make sure they run in off hours. Also, on the new servers you add, if you have the disk space try fixed size volumes. If you have many different VMs all trying to expand their volumes at the same time, it will impact performance. Also, stagger things like automatic update reboot times.
In addition, 10 servers, especially things like Exchange or other potentially high I/O systems, running on just 4 drives can be a lot to ask. You would get better performance by spreading the load over more disks, preferably using a SAN.
I wouldn't panic though, you are just learning the ins and outs of running in a virtual environment. I would recommend increasing the vCPUs assigned to the VMs to 2 as well (at a minimum depending on usage).
One thing I have done in the past is to "pre-expand" thin volumes by copying large files to the volume, then deleting them.  You may copy about 10-15GB worth of data to each server and then delete it to free up that space. That seemed to work for me.
Your environment just needs a little "fine-tuning" to get the most out of it.
yes things liuke exchange is not really required to learn esx vsphere just a dc and a vcenter server

You will need a iscsi solution to setup a esx cluster as you will not be able to lkearn ha, vmotion and drs without a cluster I use o-penfiler for that as the vm esx's connect to iscsi on my openfiler

Also setting it up with virtual esx servers make the cluster easier to setup as the cluster members have to be identical
I would probably find performance to be unacceptable without write back caching. However, it all depends on you and your expectations and your needs.

I would take issue with the previous recommendation to increase vcpu to at least two - I never exceed a single vcpu unless the demands of the application dictate allocating more than that. Workloads that will run fine on one vcpu will actually suffer a performance hit if you allocate more than one because of the overhead in manageing and scheduling cpu in a virtual environment, as well as the overhead withing the guest os of managing a smp.

As noted by another expert, to fully exploit virtualization capabilities you will need  a SAN or NFS for shared storage between multiple ESX(i) systems as well as some form of paid licensing (as opposed to the free licenses) to enable the more advanced features. If you are first starting your evaluation of virtualization local storage is fine - but for a fair indication of how it will perform I would make the additional investment of a raid controller with BBWC for your disks. From what I can determine that option is not available on a CERC6 controller. You can get the Perc6 controller with the BBWC module as another card that you plug in - the investment for that would be minimal considering the investment already made in the M710 server.

Good Luck
Hi,

I have enabled 'Force WB with no battery' option in Raid configuration, and verifying. Till now i didnt see any performance issue.
I dont think you will a card with a bbwc will be faster like a perc where setting force wb with no battery is for compatible reasons not performance
@mkalugotla - I wouldn't leave that setting for any production load, or for data on the drive that you might want to keep. It might be ok for testing but be aware it does put your data on the drives at risk should you have a power failure.

Did you ever get those disk bench numbers? Might be interesting to see the before and after your setting change results.
The system is configured with Raid 5 level, but enabled 'Force WB with no battery'.
Does the data still be under risk ? If a power failure happens, does the entire VMs,vmdk files may get deleted ? which mean we may be in more risk.

Pretty much any or all of the data on the raid volumes can be at risk. The difference between write through (standard with no battery) and write back is that:

Write through - the data is actually committed to and written to disk before the OS gets a notification the I/O has been completed.

Write back - the OS is notified that the I/O is completed as soon as the controller has the data in cache, compared to the other this happens a lot faster. If a power loss occurs and the data is in cache but not on disk the results will be unpredictable. The OS may have done many I/O operations that have not been committed to disk yet.

Typically with write back, if data is still in cache that has been written, a read request for that data can occur from cache also without actually going to disk.
Also the more cache the better - often when BBWC is installed the amount of cache goes up too. Did this change take care of your performance issue. That was actually the topic of this question.
But I dont have BBWC because the hardware has CERC.
If I understand correctly, if a power failure happens, and if the data is in cache, only this data will be lost. All the other data like snapshots, VMX, will not affect.
If I have an option to revert back to snapshots, even after the power failure, then Iam safe.
And the dell server has a corporate UPS system, for backup.
Thank you very much for the valuable information.
I still wouldn't consider it to be "safe" - suppose for example you lost the directory information for the volume -- after that you may not be able to find anything on the datastore! No vmdk or snapshot files. I would still highly recommend you obtain another controller to replace the CERC and that it is properly equipped with BBWC.
just set the vm's to auto snapshot if you have enough space I suppose