System randomly hangs and becomes unresponsive

Greetings EE!

I have 2 identical boxes that we built to initially use as a sandbox, but we recently moved one into production use (about 2 months ago) and users were complaining that the server was constantly offline.  Because I had two identicle boxes I was fortunate to have one to test solutions on before implementing those solutions onto the production machine.

What is happening is the system would randomly hang.  Not crash or BSOD, just lock up completely.

This is the build:

Hardware Specs:
MB - ASUS Z9PE-D8 WS
	BIOS Updated on 1 from 3206 to 5109
Configured with 2 seperate RAID Groups.
Group0 - RAID 1 - OS Installed here
Group1 - RAID 5 - VM's stored here

(RAID 5 shows as "Intel Raid 5 Volume SCSI Disk Device)

CPU: 2x Intel Xeon E5-2609 Sandy Bridge-EP, 2.4GHz 80w Quad-Core, BX80621E52609
RAM: 8x 4GB - Kingston 240-pin DDR3 SDRAM ECC Registered DDR 1333 1.35V VLP KVR13LR9D8L/4HC
NIC: 2x onboard, 1x Intel Gigabit CT Desktop Adapter
HDD: 5x HGST Travelstar H2IK5001672SP (0S02858) 500GB 7200 RPM 32MB Cache SATA 6.0Gb/s 2.5" Internal 
PSU: CORSAIR Professional Series Gold AX1200 (CMPSU-1200AX) 1200W ATX12V v2.31 / EPS12V v2.92
Graphics: EVGA e-GeForce 8400 GS (Nvidia 8400) PCI-E 2.0 x16

Drive Cady: Thermaltake RC1600101A MAX-1562 5.25" (x1) Bay to 2.5" (x6) Bay Mobile Rack HDD Canister 

O/S: Windows Server 2008 R2, SP1 (Build 7601)
Roles: Hyper-V

Open in new window


The BIOS version initially on the test machine was 3206, but I updated it to 5109.  The system stopped hanging, but now keeps crashing.  I also enabled WHEA in the BIOS, so now other than being told by "WhoCrashed" that the failing module was "hal.dll", it actually shows that there is a fatal memory issue.

I ran MemTest86 and it kept locking up at around 15% or so.  Some research showed that setting the voltage from automatic to the recommended settings by the RAM Manufacturer could resolve this, and it did.  MemTest86 passed with flying colors...but the system will still BSOD with the same error.

I contacted ASUS, and the support tech said to flash the bios to version 3302 (currently running 5109), as that is the highest version I need for my processor.

The issue I'm having and they seem unwilling to assist is None of the tools provided will update the BIOS.  The EZ Flash utility refuse to use the file as it is older than the currently installed version, same with the windows utility, and the BUPDATER.exe won't work because it is a CAP file and not a ROM file.

In the mean time I still have to bounce the production server at least once a day when it hangs up...with no resolution in sight.  Again, when the server hangs it gives no error message at all.  I'm at a loss as to what else could possibly be wrong, as the odds that I have two sets of bad hardware is insurmountable, but not implausible.
cjake2299Asked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

ded9Commented:
Go to

C:\Windows\minidump

Upload the last three minidump for analysis.



Ded9
0
Pramod UbheCommented:
could be a memory leak issue.

http://blogs.technet.com/b/askperf/archive/2007/09/25/troubleshooting-server-hangs-part-one.aspx

Do you see any kind ow error/warnings in event logs?


If it is taking longer to troubleshoot, you can think of replacing production box with it its identical one. Or you can just swap the disks to see (as they are identical) to see if it is a hardware issue or OS issue.
0
Brian BEE Topic Advisor, Independant Technology ProfessionalCommented:
Based on what the previous expert said, if everything works, there are a couple of items that spring to mind.

Swap the memory.
Check the board on the "bad" unit to see if any of the capacitors have "popped".
Check to make sure fans are moving freely and that you don't have any other overheating issues.
0
Powerful Yet Easy-to-Use Network Monitoring

Identify excessive bandwidth utilization or unexpected application traffic with SolarWinds Bandwidth Analyzer Pack.

cjake2299Author Commented:
@ded9 The MiniDumps are added.

The production and test unit are identical.  The test unit is the one I updated the BIOS on to see if it would correct the issue with the system hanging, and this unit is now the one that crashes within 5-10 minutes after startup.

The production unit still hangs, but only once or twice a day.

Looking on the ASUS website, BIOS 3109 was supposed to fix the issue with the NVRAM causing the system to lock-up, both units had BIOS ver 3206.

@Pramod_ubhe I'll take a look at the link and get back to you.

@TBone2k no popped caps, cooling is fine (climate controls in server room keep temp at 68*F)  CPU Temp barely get over 34*C.

MemTest86 after two separate test (after setting voltage statically to MFG specs) passed fine with no issue.
100213-24975-01.dmp
100213-25506-01.dmp
100313-24679-01.dmp
0
cjake2299Author Commented:
@pramod_ubhe, after reviewing the article, the issue that is on the production server (and was on the test server before the BIOS update) is definitely a Hard Hang.  System is totally unresponsive...not even the monitor works.

The only error that shows in the event log before I reboot the system is from the SCCM Ops Manager connector looking for the old SCCM Server.

Both of these servers were previously in a cluster together as part of our sand-box.  The cluster was destroyed following MS Guidelines.  There were no apparent issues, but the servers were only on while vetting updates/modifications to the application and testing the security of the application.  Once that was complete for the day, the servers were powered off (rarely on for longer than 4 hours).
0
ded9Commented:
Dmp points to overheating issue. Check whether the fan on the processor is seated properly.

If overclocking is enabled in bios then disable it. Check cpu temp



Ded9
0
Brian BEE Topic Advisor, Independant Technology ProfessionalCommented:
Maybe I missed that you already have don eit, but I would still suggest swapping the memory. it's quick and you don't have to change it back if you are wrong.

Just to clarify, you said both units are now crashing/hanging?
0
cjake2299Author Commented:
Both units were hanging.

Unit1 (test unit) I updated the BIOS to 5109 and it started crashing.

Unit2 I have left alone thus far.  Manufacture finally responded and said I need to update the BIOS on Unit2 to 3302 to accurately support my CPU, but has thus far failed at helping to roll-back the BIOS on Unit1.

I'd prefer to set the BIOS to version 3302 on Unit1 to verify it corrects the issue before I update the BIOS on Unit2 to the same version.
0
Brian BEE Topic Advisor, Independant Technology ProfessionalCommented:
I am surprised in all of this that there would be these kinds of problems related to BIOS, but it sounds like Acer has confirmed it. Have you tried searching on google with your specific model of server to see if others are having the same problem?

It really does sound BIOS related at this point.... or memory, or both. I know I keep saying that, but just because it passed memtest doesn't mean there isn't some random problem. That's the last I'll bring it up.
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
cjake2299Author Commented:
UPDATE, ASUS says I need the BIOS update to correctly support my CPU.  Updated Machine2 to the BIOS level that supports my CPU and it has been running fine.

Had many other weird issues with Machine1 while trying to get the BIOS rolled back, but ASUS it can't be done (after talking to the fifth tech support rep in two weeks).

I ordered a BIOS chip from ASUS with the correct BIOS version installed on it for Machine1, should be here by Friday.  During my internet searches I've found a large variety with this particular ASUS board (Z9PE-D8 WS), so I'll avoid it in the future.  My Sabertooth boards have been running fine for years, said to see that the only ASUS board that supports 2 CPU sockets is having so many issues.

Once I get the BIOS chip I'll test the memory as well.  This has been a very odd experience.  I'd avoid the Z9PE-D8 WS board if you can.  Also going to tighten up the interior to see if I can improve air flow, maybe replace a few of the standard CoolerMaster fans that came with the Chassis with something that pushes through a mush higher volume.

I'm going to close this for now and issue points evenly to each of you for your assistance.

Thanks again!
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Server Hardware

From novice to tech pro — start learning today.