Link to home
Start Free TrialLog in
Avatar of MarkMichael
MarkMichael

asked on

Server on single ESX node, rebooted and services fail

Hi all,

We've have this customer for about 1.5 years.
When they came to us, they required 2 servers on a dedicated server. So we put in a single vmware esx node (as they dont need HA) and used local storage to install 2 servers

1 to run exchange 2007
1 to run sharepoint on it

Last week, we rebooted 1 of their servers and it appears that the server came up, but no one could connect to it. After looking at the server, it looks like the vmware tools hadn't loaded and neither did ALOT of the other automatic services start up, didnt start and we were unable to start them.

To fix this, we had to rebuild the server and restore data.

Today, I have rebooted the Exchange server and this has happened again, to a different server on the same ESX host.

After logging into the console for ESX i can see an alert for disk usage which is red.

'Datastore usage on disk' - I'm not sure if this is any cause, but may be helpful.

Has anyone seen this issue before?


ASKER CERTIFIED SOLUTION
Avatar of bgoering
bgoering
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of MarkMichael
MarkMichael

ASKER

Capacity: 1.63TB
Provisioned Space: 1.54TB
Free space: 98.01GB

Is this, possibly not enough space to make a snapshot?

Both servers, take up a total of 65GB of used space.

Do you think there is a possibility of finding this snapshot?

Avatar of Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
Check the VM properties, Snapshot Manager, do you have a snaphotis listed there, do you use Snapshots for backup?
That is only about 6% free space on your datastore - a little bit tight. I try to keep 15% to 20% Free.

As Hanccocka says check for any snapshots in your administration client, Either Virtual Infrastruction Client, or vSphere Client - depending on your version of ESX. There will be an icon on the toolbar that looks kind of like a clock with a wrench on it - click that to get into snapshot manager.

Also let us know how large of hard drives are allocated to your virtual machines.
Nothing showing in the snapshot manager.

Just the simple 'You are here.' meaning I'm at the latest.

Could this have possibly made a snapshot and it didn't show up?

Is this sometime I can try and resolve do you think?
Server 1:

System drive: 100GB
Disk 2: Pagefile disk of 8GB
Data drive: 256GB
have a look at the datastore for snapshots, and post screengrab here...

is the datastore used for anything else other than VMs?
confused now, you said

"Both servers, take up a total of 65GB of used space."

but server 1 takes 364GB? (unless they are thin provisioned)

Server 2

The

System - 50GB
Pagefile - 8GB
Data Disk that uses these 4 drives:

a. 256GB
b. 256GB
c. 256GB
d. 256GB

(theres also an old VM that we kept (100GB System drive on the same store))

Sorry, when I said used space, I mean their actual space showing in Windows, when adding together.

I think it's all thick provisioned. I'm no ESX expert.
As for the alarm - with 6% free space that would be a normal alarm. But if you have no snapshots (let us verify by posting a directory listing for each server) th 98 GB Free space might be enough for now. Can't remember the defaults, but 6% would definately be a red alarm.

Did you have no alarms before?
okay so that totaled is 1.5TB server 1, server 2, and old VM.

So totalling it all up is 1.6TB. (with the free space).

that's very tight, and I've not included the swap space, needed for each VM equal to memory.

So that's certainly why, you've got a disk alert.

What ever you do DONT start using Snapshots, or any other Backup product, that uses them, Veeam, vRanger, vDR etc.
It sounds like all of the space can pretty much be accounted for by allocated drives. I am thinking this is possibly a virus problem. Do you have virus protection on your servers?

go to http://malwarebytes.org download and run the free scanner there. You may have to download it on another box, burn to cd, mount the cd on your windows box, and boot your troublesome vm into safe mode in order to run it.
Hi there,

You suggest downloading this and creating an ISO to connect it to the server? This will run within windows i guess?
Personally, I would use Microsoft (yes they do have a free virus, malware checker) Security Checker

http://www.microsoft.com/security_essentials/

download and install, direct on server.
Yes, its a malware scanner - it does an install then trys to download a database. It won't be able to do that in safe mode. Run a full scan in safe mode, reboot and hopefully it will let you in. Let malwarebytes update itself and run another full scan.

Either burn to cd or to iso and connect the iso to your vm in order to get the scanner on the possibly infected machine.
I was just thinking - it might be easier to install it on your other vm - scan it first.

Then power down your troublesome vm and attach the system hard drive to the good vm and give it a drive letter. You can attach it by going into edit settings, add hard disk, browse to the system disk and add it.

At that point you should be able fully scan the system drive. Finally remove the hard drive from the helper vm (DO NOT DELETE FROM DISK) and try to bring up your exchange box again.
I've completed a full scan and nothing found.

Not a single item.

Any other suggestions please?
If you can boot to safe mode try looking at event logs see if you see any errors
Hmm.

Ok, I have found out what process is causing this.

LSASS.exe (local security authority)

When I kill this process off, all other services get starting, however.
When killing this off, the server gives a message that the server is going to restart in 1 minute, due to a server error. I assume this is to stop users from turning the security authority off.
yes, that's typical behavior for lsass...  used to see it a lot back when the sasser virus was running rampant.
don't forget, if you can get in with safe mode, you can use msconfig.exe to select which services to start and you can even filter out MS services.
ive been through all that unfortunately.

looks like the only fix is

a) patch lsass somehow, cant find a way of doing that.
b) fresh install of exchange using /recover mode and use the vmdk that contains the exchange database and logs on the new vm to recover the mail.

can you think of anything i'll need to do, in case i shoot myself in the foot half way through b?
if you've got enough space (but I take it from the thread that you don't) you could clone the disk before making changes to it.  You can clone it with the datastore browser or in the console with the command 'vmkfstools -i oldname.vmdk newname.vmdk'.

 The alternative to that is to take a snapshot, but that might be dangerous depending upon how much the process rewrites and how much free space you have.
upload the LSASS.exe file you have to http://www.virustotal.com/ just to check it's okay and the write one.
If you can get to a cmd prompt try

sfc /scannow

See if it will fix and system file inconsistancies
Sorry guys,

Looks like the reason this occured was because netbackup was having issues backing up the servers via VCB method.

It looks like there were several snapshots of this server hidden in the directory, taking up over 1TB or space and looks to have stopped the server from being able to read/write correctly. After rebooting it, it looks like it 'lost its way' back to the snapshot.

Thanks for all your help everyone, very much appreciate it.
My first response (34884609) indicated that it was likel a problem with snapshots filling the disk. This was concurred with by several other experts. At one point (34884747) we even asked for a directory listing that was never received.
Indeed, you are correct.

It's been a tough week. Sorry bgoering, I should have taken longer looking back at the answers.