Link to home
Start Free TrialLog in
Avatar of menreeq
menreeq

asked on

file I/O not good on my ESXi server

I am running VMWare ESXi 3.5 on a server with Raid 5 7200 SATA drives (I know, SAS/SCSI is faster).

When I try to copy a 4GB test file to one of my VM's (tried to multiple VM's on both XP and Server 2003) it moves along quickly until 1/2 way where it stalls.  The disk graph for the VM still shows high disk activity for a few more minutes and then eventually tapers down.  

I have been playing with adding different counters and noticed that even when the disk usage (average) drops the "physical disk write latency" stays high, much higher then when i first started the copy.

Then after 5-10 minutes where it appears that the file has stalled it will resume and finish the file transfer.

Any thoughts on how i can further troubleshoot the bottleneck on my server?
Avatar of Paul Solovyovsky
Paul Solovyovsky
Flag of United States of America image

Are your VMs on local storage or SAN/NAS?

What type of connection do you have for your network 1GB or 100Mb?
Avatar of menreeq
menreeq

ASKER

Local storage, 1GB network
Try using IOMETER to test I/O.  Here's a good how to

http://articles.techrepublic.com.com/5100-10878_11-5735721.html
Your RAID5 stripe size is probably ridiculously too large .. what is it?
Avatar of menreeq

ASKER

dlethe, maybe.  I will have to reboot the machine after hours to access the raid card and check, what would you consider "ridiculously too large"?
It depends on how many disks are in the RAID5, let's say you set it to 64KB, 'cause most people make mistake of thinking more is better.

That means internally, the RAID controller is going to read/write 64KB on each disk whenever it gets a write request.  I don't know how you set up ESX I/O, but let's assume it writes in chunks of 64KB -- please check into that also, it is critical.

So anyway, your O/S wants to read 64KB worth of data, and you have a 4-disk RAID5.

The RAID controller reads a stripe from each disk, so it is forced to put 192KB worth of usable data in the controller's cache, and more importantly, read too much data and throw it away.

If you were just using NTFS, standard I/O size is 4KB, so in this case, you want 4KB of data, but your raid is forced to read 192KB.   On writes, it is much worse, as 64KB gets written to all 4 disks in the RAID.

Another issue, if you have an even number of disks, then you are not using power of 2.   You want to write say 256KB, but you have a 4-drive, i.e, a 3+1, so you get 64KB x 3 = 192KB worth of data at a time, so your RAID has to read it all over again.  You read 128KB from each disk, or 6x128KB when you only want 256KB.  

Now I am oversimplifying, some RAID controllers do short reads, meaning they will not read a full chunk from each disk if pending I/O request is smaller, but on writes, they must all do the full stripe write, so on a 256KB write from your O/S, your RAID must write 128KB to each disk drive.

Summary
1. Use 2+1, 4+1, 8+1 disks in RAID5 .. any other combination is not as efficient.
2. Enable write-back (write cache) if UPS
3. Look at ESX & NTFS allocation/chunk size.   Optimize so it matches TOTAL size of the RAID.   Add up stripe size and multiply by 1 less than the total number of disks in the RAID5 set.  This is what you should use for optimal performance and efficiency.
Avatar of menreeq

ASKER

dlethe, thank you for the detailed info.  I am afraid that some of this is just a bit of my head but i am looking it up to better understand your points.  Is allocation/chunk size the same as "block size"?  I have block size listed for this datastore as 1mb.
Also, since I/O stalls and queue depth is high, then that means the RAID is the bottleneck.  While I/O performance was "good", obviously there was caching of I/O, but at some point the disks got saturated.  This is classic indication that stripe size is either too big or too small.  If you match I/Os, then throughput typically is more consistent.

I am also assuming the RAID is not degraded, and you don't have XOR (Parity) errors, and disks don't have a bunch of bad blocks that are having to get reallocated.  Only way to determine that is run RAID diagnostics and/or look at the controller's event log, if it has one.  The cheapo RAID controllers don't give you such things.

Finally SATA disks have NCQ.  LOTS of bugs with NCQ, and many controllers turn it off by default.  That means the disk drives won't queue up I/Os internally and reorder them for efficiency.  Don't even consider enabling NCQ unless your RAID vendor has qualified your disks running your firmware.  There are known sev-1 bugs out there with many internal SATA chips and drive firmware that causes data loss when this feature is enabled.   Enabling NCQ will typically double write performance in a transactional environment, so look into that setting as well.
Yes, they are the same.   1MB block size is bad bad bad.   That means the ESX caches up I/Os until it has 1MB worth to do, or it runs out of cache and has to flush. This is root cause for probably all of your problems.  The block size should match the equation I mentioned.  At very least, even if you can't shut down, drop the block size to something much smaller, like 128KB. You should see nice performance increase.  Then you can fine-tune it later.
Avatar of menreeq

ASKER

dlethe, i am reading many say that write performance on raid 5 is a known issue and that raid 10 is a better option.  What do you think?  If it does turn out that my strip size is to large maybe i just take the time and change to raid 10?
Avatar of menreeq

ASKER

i am running ESXi and don't see the option of adjusting the block size, maybe that's not in ESXi.  Do you know?
Avatar of menreeq

ASKER

found the following link to change the block size, it appears that 1mb is the smallest size:

http://www.petri.co.il/forums/showthread.php?t=27413

right?
Those bozos that generalize how great RAID10 is really get to me.  First, yes, RAID10 will almost always be "better" than RAID5, reason is the "RAID5 write penalty" which you can google.  A grossly incorrect RAID10 stripe size and an optimal RAID5 array can make the same RAID5 array outperform a RAID10.  Those "many people" never bother to mention that.

Also, one can easily configure a pair of RAID1s to outperform a single RAID10.

In any event, the best way to configure it is to do it properly, starting at the RAID controller's stripe size.  In your case since you mentioned RAID10, then you have 4 disks?   If so RAID10 or a pair of RAID1s is in order, not RAID5.

Avatar of menreeq

ASKER

ok, i just rebooted this machine at lunch and checked the raid, the strip is set to 1mb...that's likely my issue right?  When I rebuild this over the weekend what size do you recommend?
ASKER CERTIFIED SOLUTION
Avatar of David
David
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of menreeq

ASKER

dlethe, you were right.  i did some testing and reducing the stip size on the raid resolved this issue and drastically increased the performance of my vm's.  Thanks!

can you confirm that if my strip size on my raid 10 is 256 then i want to set the ESXi & NTFS to 1mb, that right?
This is fuzzy area, as I do not  know the internals of ESXi to know of it will do partial reads.  That is, let us say that there is only 1 I/O request, for 64KB.   Is ESXi smart enough to only do a 64KB read instead of a 1MB read?   If you can find out the answer to that, then it may improve performance slightly to go to 512KB, but no matter what, you would want the minimum size for ESXi, which is 1MB.

Meenreq,
If you need to check vms partition alignement, you can use VM Align Check, available on this ISO tools CD -> http://www.kendrickcoleman.com/index.php?/Tech-Blog/vm-advanced-iso-free-tools-for-advanced-tasks.html
Regards,
Pascal