We are having an issue with a Virtual Machine that we have setup at a client site. It was built in ESX 3 and then converted for a stand-alone system for Virtual Workstation 5 and then back to an ESX 3 server. I have attached two emails that outline the issue below. Any help on this is greatly appreciated!
After shutting down all of our services, the machine still sits at 2.75-2.81GB of Page File usage. To see if this was some sort of memory leak situation, I disabled all of our services, and rebooted the machine. When it came up, the PF usage started at around 600MB and climbed steadily until it hit 2.75GB. I ended up rebooting again, and again, it did the exact same thing. This only leaves 1GB of RAM for our services to use. We tried changing various server settings to coax it into cutting down on it’s memory usage, but nothing really changed.
We did suggest locking the kernel in Physical RAM to prevent it from swapping which did make the server more responsive. It’s more usable than it was before, but still very slow.
I was able to benchmark it’s hard drives and they perform as well as our test server (they are approx. 10k rpm drives). So this system should rock if we can figure out what’s eating memory. At this point, it sounds like some sort of VM issue. I started the copy of the slice we have here (secondary slice at our location as opposed to the clients) and it does not behave this way. Something is different between our server and theirs.
I put all the services back up and they are running a little better. However, this thing is on the hairy edge of being out of memory, so it’s only going to get slower. I’ll check it again in the morning to see how it’s running.
Email 2: (this was sent after I requested more specific information on the server/system our VM was running on)
Their system is running ESX, and is a six-server cluster. Each server has 2 quad-core processors, a metric ton of RAM, and is connected to a pretty powerful SAN. The server we are on has 4 others VMs. Our VM is set to request 4 virtual processors, one of the others requests 2 and the others 1, but they were not all running when we were there, so there should be plenty of processor.
The big thing we are noticing is the size of our paging file on our own slice. It is much larger there than on the original slice in our environment. Looking at the Windows Task Manager, you can see the size of the page file, but none of the processes are using anywhere near that much. I suspect fairly strongly that it’s being taken up by the VMWare “balloon” process that it uses to help it manage and balance memory on each slice that it has, but I’m not sure why it would behave so differently in their environment vs. ours.
From what I’ve been reading, it might not be a bad idea to have them re-install the VMWare tools on the slice. Evidently if they are incorrect it can cause this sort of problem, and we did move it from our ESX processor to Workstation and then they imported it back to ESX. But this is a “maybe yea, maybe nea” kinda thing to try, just noting that it was pointed to by VMWare for someone else that had a similar looking issue.
I’m still looking at other possibilities as well.