Link to home
Start Free TrialLog in
Avatar of SBSIAdmin
SBSIAdmin

asked on

VM virtual servers excessive reboot times

We're relatively new to VMWare.  We have two physical boxes running that latest version of free ESXi.  The servers are new HP Proliant ML350 G6's configured with RAID-5.  Plenty of horsepower I would think.  32GB RAM.  Each physical box is hosting 3 virtual servers, none of which have excessive load; i.e. they're basically file servers.  The virtual servers are all running Win2K3 R2 and each is allocated 4GB of RAM.

We've noticed that if we initiate a Shutdown/Restart on a virtual server, it takes upwards of 30 minutes to finish the reboot cycle.  Compare this with a similar Windows server, non-VM, that would take 4-5 minutes.  If we pull up the console via the vSphere Client, we'll stare at a gray screen for 20+ minutes.  Often times we end up powering off the instance in question just because it takes too long to wait for it to finish.  Both physical boxes and all six virtual servers are exhibiting these symptoms.  I'm thinking we must be missing some VM-101 setting that we're unaware of.

Any ideas??
ASKER CERTIFIED SOLUTION
Avatar of bgoering
bgoering
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of SBSIAdmin
SBSIAdmin

ASKER

Yes.  We originally thought that was it, but no luck.
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Thanks for the response bgoering.  Those sound like some promising suggestions.  It's about 10pm EST for me right now and I may not get to your suggested changes until Monday.
The servers were native builds as opposed to P2V's.
I'll try the memory trick and also get the specs on the RAID controller.  I'm guessing the RAID controller is relatively base since it's the base controller that you can get with the server on the motherboard.
I'm looking in the vSphere Client and don't see where to obtain the disk latency numbers.  On [Performance] I can select [Disk], but I don't see anything for storage path or storage adapter.
I did just peruse the [Events] tab and was reminded of a message that I had seen in the past.
Message from VM1:
Insufficient video RAM. The maximum resolution
of the virtual machine will be limited to 1176x885
at 16 bits per pixel. To use the configured
maximum resolution of 2360x1770 at 16 bits per
pixel, increase the amount of video RAM allocated
to this virtual machine by setting svga.vramSize=
"16708800" in the virtual machine's configuration
file.

I wonder if this could be adding to the reboot times.
Did you set a memory limit (edit settings | resources, click on memory)?  If so, change it back to unlimited.

This sounds more like a memory contention/limitation issue rather than a IO issue.  You might also try uninstalling VMware tools to see if the balloon driver or another component is interfering somehow.  (not sure from your earlier reply if you meant that you hadn't had tools installed or had just thought to check that they were installed)

Are you over-committed on memory?  What is your host memory consumption looking like?
No, that is just an informational message and does not indicate a problem. If you want to make the message go away take a look at this knowledge base article: http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1024990
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Where are the vms located? Are they on local disks or SAN?
Most of the time, the delay on vms happened at the disk i/o. If you have SAN, try to move your SAN card to a different slots on the higher number. If it's local storage, you may want to check the RAID and HD to ensure everything is OK. Next, run esxcfg-rescan to refresh the connection to your disks.

K
I will admit, I posted this question on behalf of our senior level tech who had been the point person for our VM installs.  He was a newbie as well.  As it turns out, he has left our organization so now our two VM servers are my responsibility now.  Thus the delay since the last posting.
I just did a full reboot on one of our physical boxes.  I wanted to peruse the BIOS because I had also heard that their may be a BIOS setting that needs to be adjusted if the box will be running VM.  I saw a couple entries about memory and Intel virtualization, but nothing jumped out at me about VM per se.
As the machine was posting, I notice a message about there being no battery backup on the RAID controller.  It seemed to indicate that it could be added, but it wasn't installed by default.  There was a previous comment about RAID controllers and battery-backed write cache.  There was also a BIOS setting for enabling write caching.  That's currently disabled and it comes with a warning about potentially losing data if there's a power outage, so I haven't enabled that.  I wondering if I should persue getting the battery module for my RAID controller.
Yeah, not having the battery backed write cache enabled definitely does negatively affect storage performance.
Yes, if the guests are hosted on the local RAID the addition of battery backed cache will make a huge improvement for disk writes. If the long delay is on the shutdown part of the reboot process, cosider what BloodRed indicated above about the clear pagefile setting - that would add a large amount of time to the shutdown process.

If you shutdown and power off one of these vms, how long does it take to power on and come up?
It was less than a year ago that we set our page files to clear on all servers as a result of a recommendation from our security consulting firm.  I tried turning that off to see if it would make a difference on the VM servers, but it hasn't.
I was unable to find a specific settings in BIOS that would "enable" the server to be a hypervisor.
I guess that leaves me with the battery backed cache option.  I'll have to obtain the exact model of RAID controller, confirm capabilities, and get a price.
you could give some vm's a full memory reservation and see if that makes a difference, too.  You'll still need to read from disk but it will preclude the need for a swap file so it would be a decent test to see if you're going down the right path before laying out the money.
I ran a couple time tests tonight to get real numbers.  I rebooted one of the server instances on each of our two ESX boxes tonight.  Both were quite consistent.  11 minutes to shutdown and less than 2 minutes to boot up.  This test was done with clearing pagefile disabled.
Then I reset the GPO to enable clearing of the pagefile on shutdown.  Ran gpupdate /force and rebooted again.  The shutdown and restart times were just about the same.  That leads me to believe that the pagefile isn't significantly impacting the process.
I'm not certain what a "full memory reservation" means.  The physical boxes have 16GB of RAM in them.  One ESX is running 3 virtual servers and the other is running 2.  Each virtual server has been allocated 4GB of RAM.  So each virtual is running 4GB leaving 4 or 8GB for the hypervisor.
I'm kind of leaning back to the battery backed cache.  Maybe the shutdown process is so disk intensive that we feel the effects, but daily usage as a file server or domain controller doesn't push the disk enough to notice it??
Double check to be sure the GPO update took - take a look at

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management
Value Name: ClearPageFileAtShutdown
Value Type: REG_DWORD

If the Value is 1, the pagefile will be cleared, if 0 the page file will not be cleared.

In any event in the long run to achieve satisfactory performance you will want to be able to configure write-back (rather than write-through) cache on you raid controller. I would definately get that upgrade.

Two minutes sounds reasonable for a startup time.

So far as the full memory reservation - I really don't recommend using that. It is generally better to let ESX manage the memory unless you have very special cases of critical workloads. To set it to test, go into edit settings on your vm, click the resources tab and there will be a place to set a reservation. For a full reservation change it to the amount of memory you have allocated to the vm. What this buys you is that ESX doesn't have to create a swap file to back the ram on that virtual machine. That is pretty much a low overhead activity unless your are serverly memory constrained on your host, and it doesn't sound like that is the case. In any event - the overhead of a swap file isn't really incurred until such time as ESX has to swap out memory pages to disk. When that happens there are two writes to disk, one to zero the area, the other to write the memory contents so that physical ram can be allocated to another host.

Good Luck
I very much appreciate all of the responses, but unfortunately, none have helped to this point.  My next step is going to be battery-backed cache, but that's going to take some time, purchase, etc.  For the time being I'm going to close this question.  Again, thanks for all the responses.
Unfortunately, we still have the issue, but I'm not able to continue working on it at this time.  I need to get the proper part number for our hardware, determine the cost, get budget for it, make the purchase, install it, yada, yada.  There's no point in keeping the question open at this time.  Responses were great, but nothing I tried worked.  The last item to try requires more planning.