SBS 2011 random crashes

I have a SBS 2011 server that started crashing a few days ago.

It crashed and then got stuck in a boot sequence where it would blue-screen trying to Apply Computer Settings.

I was able to get into safe mode and safe mode with networking with no issues.

I was able to get back into Windows normally booting it once, but I think i just got lucky as then next 2 boots also blue-screened at Applying Computer settings.

I logged a ticked with MS and we have been working on the issue for 2 days with no success. I am reaching out for help here hoping someone has any ideas.

I got back into Windows normally using Last Known Good Config, but it after a couple more reboots, the issue came back.

We discovered a strange issue where the network logon service was not starting (this had never happened before). MS determined that somehow the hostname of the computer was changed in a couple of places in the registry. We disabled Exchange services as they were also failing due to the Network Logon Service failing to start. Once we modified the registry settings back to the actual name of the server, the network logon service started up again normally.

Thinking the issue was fixed, we began restarting the Exchange services and then we crashed again when about half of them were started up. We rebooted and then got a couple more started and then crashed again.

MS then tried to disable 3rd party drivers and storage drivers (the ones that don't load in safe mode) but the server was unstable in that state. My MS engineer then quit for the night.

I had the data center run a full diagnostic on the hardware which came back clean.

I disabled all Exchange services again, and behold it has not crashed since.

So, any ideas?

I can't get the idea out of my head that it is related to RAM. This server is very undersized; it's running 8 GB RAM. Even with Exchange disabled 6.5 GB of RAM is used up just booting to the desktop.

My thought was that as services were starting up, and RAM was being given to processes, that it encountered some issue with the physical module, or that the page file filled up and somehow causing the crash. Is this valid reasoning?

Another thought was that registry entry that was changed which was causing the network logon service to fail. The name of the server that was appearing in the registry was generic, like WIN-67L5UNORI4I.

I scanned the security logs for failed logon attempts and I see similar PC names appearing from strange IP addresses (China, South Korea, Brazil, Germany).

Could someone have gained access and caused some damage that is making it crash?

Any advice you can give would be great.

Thanks!
LVL 2
IT_ServiceAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

ded9Commented:
Upload the last three mindump for analysis

You can also upload the minidump on skydrive and share the link here.

C:\Windows\minidump



Ded9
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
IT_ServiceAuthor Commented:
The last thing that MS did was to enable a complete memory dump, but the server has not crashed since then. I will try starting up services to see if I can force it to crash and dump the memory.

Should I change the option to Small Memory Dump or Kernel? MS wanted complete.
0
DhananjayTechnical ConsultantCommented:
If MS wanted completed memory dump then you have to configure complete memory dump to troubleshoot and find the exact cause for reboot the server.


To configure complete memory dump and more info regarding same refer below link :

http://support.microsoft.com/kb/307973

http://support.microsoft.com/kb/969028
0
The Ultimate Tool Kit for Technolgy Solution Provi

Broken down into practical pointers and step-by-step instructions, the IT Service Excellence Tool Kit delivers expert advice for technology solution providers. Get your free copy for valuable how-to assets including sample agreements, checklists, flowcharts, and more!

ded9Commented:
Configure it for minidump.

Is the minidump folder blank. If you have old minidump file then you can upload it.



Ded9
0
IT_ServiceAuthor Commented:
Ok, I'm a bit further along. I was able to force the system to crash under these circumstances:

 - Set all Exchange services to manual
 - Start services one at a time
 -  while trying to start the exchange rpcclientaccess service, the service hangs upon starting but a process is created which keeps taking more and more RAM, then crashes the server
 - disabled automatic restart upon crash, this allowed me to see the bsod error through the remote console application

BSOD ERROR:

KERNEL_DATA_INPAGE_ERROR
Technical Information: STOP: 0x0000007A (0xFFFFF6FC4000A9D0, 0xFFFFFFFFC000000E, 0x0000000137CDF860, 0xFFFFF8800153A758
*** Ntfs.sys - Address FFFFF8800153A758 at base FFFFF8800144C000, Datestamp 5167f5fc – mpethe 18 mins ago

This article (http://msdn.microsoft.com/en-us/library/ms854944.aspx)
Says:
"This Stop message indicates that the requested page of kernel data from the paging file could not be read into memory. This Stop message is usually caused by a bad block (sector) in a paging file, a virus, a disk controller error, or failing RAM. In rare cases, it is caused when nonpaged pool resources run out. It is also caused by defective hardware."

That certainly sounds like what is going on (failing RAM, or nonpaged pool resources running out).

Anyone see this before?

If it's the paging file, how can I eliminate the bad block? Would chkdsk need to be run?

Thanks!
0
ded9Commented:
Run test on RAM

www.memtest.org

Configure system for minidump

Enable verifier

Enable driver verifier
1) Open an elevated command prompt
2) Type "verifier /standard /all"  (no quotes)
3) Reboot your machine
4) Use machine again until it crashes

After the crash & reboot, go into safe mode with networking or normal mode

Disable driver verifier
1) Open an elevated command prompt
2) Type "verifier /reset" (no quotes)
3) Reboot your machine


Upload the new dmp file for analysis



Ded9
0
Gary ColtharpSr. Systems EngineerCommented:
You have a corrupt paging file

Reconfigure paging to remove paging from all drives and reboot. It will run bad...but it will boot.

Then re-enable paging on C: with system sizing.

You should be okay to start exchange after that.

I am not saying that will be the end of your issues with this box, but it should stop the BSOD.

HTH

Gary
0
IT_ServiceAuthor Commented:
After this issue was escalated within Microsoft, we are finally getting somewhere.

There was an error that was not showing frequently in the logs, but indicated there was in issue within the RAID controller firmware.

I had the data center update the firmware and when the server booted back up it reported one of the disks in the RAID as failed. It seems the failing drive and outdated firmware were causing the issues. When Exchange was trying to mount the 65 GB database, the system would crash. When we created a new empty DB, exchange would start without issue.

Before the firmware update, even trying to make a copy of the 65 GB database would cause the server to crash. After the update and drive replacement there have been no issues so far.

I have run a repair on the copy of the DB, and am getting set to re-mount it in Exchange.

I will update you all to confirm resoultion.

Thanks!
0
IT_ServiceAuthor Commented:
Looking for a quick answer on this ... I have not been able to reach my MS engineers all day for the open ticket.

I'm in the process of mounting the repaired db, but it's taking a long time. I have no idea if this should complete in 5 minutes, or 5 hours.  It's been about 2 hours so far. I don't want to wait this process out only to have it fail, then find out it's a process that should have been quick.

Can anyone comment?

The DB is about 65 GB.
0
Gary ColtharpSr. Systems EngineerCommented:
I cant imagine it taking that long to mount under ordinary circumstances. Did you run ESEUTIL against it to make sure it was in a clean state?
0
IT_ServiceAuthor Commented:
Yes it was clean. The issue was I hadn't removed the old log files from the directory. Once I did that, it mounted right away.

Thank you so much everyone.
0
IT_ServiceAuthor Commented:
I split up the points to those with the most helpful answers.

Thanks again, all.

0 points to the MS Engineer who took 2 days to get almost nowhere.
1,000,000 points to the MS Engineer who put me on the right path in less than 2 hours.
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
SBS

From novice to tech pro — start learning today.