Link to home
Start Free TrialLog in
Avatar of cjake2299
cjake2299Flag for United States of America

asked on

System randomly hangs and becomes unresponsive

Greetings EE!

I have 2 identical boxes that we built to initially use as a sandbox, but we recently moved one into production use (about 2 months ago) and users were complaining that the server was constantly offline.  Because I had two identicle boxes I was fortunate to have one to test solutions on before implementing those solutions onto the production machine.

What is happening is the system would randomly hang.  Not crash or BSOD, just lock up completely.

This is the build:

Hardware Specs:
MB - ASUS Z9PE-D8 WS
	BIOS Updated on 1 from 3206 to 5109
Configured with 2 seperate RAID Groups.
Group0 - RAID 1 - OS Installed here
Group1 - RAID 5 - VM's stored here

(RAID 5 shows as "Intel Raid 5 Volume SCSI Disk Device)

CPU: 2x Intel Xeon E5-2609 Sandy Bridge-EP, 2.4GHz 80w Quad-Core, BX80621E52609
RAM: 8x 4GB - Kingston 240-pin DDR3 SDRAM ECC Registered DDR 1333 1.35V VLP KVR13LR9D8L/4HC
NIC: 2x onboard, 1x Intel Gigabit CT Desktop Adapter
HDD: 5x HGST Travelstar H2IK5001672SP (0S02858) 500GB 7200 RPM 32MB Cache SATA 6.0Gb/s 2.5" Internal 
PSU: CORSAIR Professional Series Gold AX1200 (CMPSU-1200AX) 1200W ATX12V v2.31 / EPS12V v2.92
Graphics: EVGA e-GeForce 8400 GS (Nvidia 8400) PCI-E 2.0 x16

Drive Cady: Thermaltake RC1600101A MAX-1562 5.25" (x1) Bay to 2.5" (x6) Bay Mobile Rack HDD Canister 

O/S: Windows Server 2008 R2, SP1 (Build 7601)
Roles: Hyper-V

Open in new window


The BIOS version initially on the test machine was 3206, but I updated it to 5109.  The system stopped hanging, but now keeps crashing.  I also enabled WHEA in the BIOS, so now other than being told by "WhoCrashed" that the failing module was "hal.dll", it actually shows that there is a fatal memory issue.

I ran MemTest86 and it kept locking up at around 15% or so.  Some research showed that setting the voltage from automatic to the recommended settings by the RAM Manufacturer could resolve this, and it did.  MemTest86 passed with flying colors...but the system will still BSOD with the same error.

I contacted ASUS, and the support tech said to flash the bios to version 3302 (currently running 5109), as that is the highest version I need for my processor.

The issue I'm having and they seem unwilling to assist is None of the tools provided will update the BIOS.  The EZ Flash utility refuse to use the file as it is older than the currently installed version, same with the windows utility, and the BUPDATER.exe won't work because it is a CAP file and not a ROM file.

In the mean time I still have to bounce the production server at least once a day when it hangs up...with no resolution in sight.  Again, when the server hangs it gives no error message at all.  I'm at a loss as to what else could possibly be wrong, as the odds that I have two sets of bad hardware is insurmountable, but not implausible.
SOLUTION
Avatar of ded9
ded9
Flag of India image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
SOLUTION
Avatar of Pramod Ubhe
Pramod Ubhe
Flag of India image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of cjake2299

ASKER

@ded9 The MiniDumps are added.

The production and test unit are identical.  The test unit is the one I updated the BIOS on to see if it would correct the issue with the system hanging, and this unit is now the one that crashes within 5-10 minutes after startup.

The production unit still hangs, but only once or twice a day.

Looking on the ASUS website, BIOS 3109 was supposed to fix the issue with the NVRAM causing the system to lock-up, both units had BIOS ver 3206.

@Pramod_ubhe I'll take a look at the link and get back to you.

@TBone2k no popped caps, cooling is fine (climate controls in server room keep temp at 68*F)  CPU Temp barely get over 34*C.

MemTest86 after two separate test (after setting voltage statically to MFG specs) passed fine with no issue.
100213-24975-01.dmp
100213-25506-01.dmp
100313-24679-01.dmp
@pramod_ubhe, after reviewing the article, the issue that is on the production server (and was on the test server before the BIOS update) is definitely a Hard Hang.  System is totally unresponsive...not even the monitor works.

The only error that shows in the event log before I reboot the system is from the SCCM Ops Manager connector looking for the old SCCM Server.

Both of these servers were previously in a cluster together as part of our sand-box.  The cluster was destroyed following MS Guidelines.  There were no apparent issues, but the servers were only on while vetting updates/modifications to the application and testing the security of the application.  Once that was complete for the day, the servers were powered off (rarely on for longer than 4 hours).
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Both units were hanging.

Unit1 (test unit) I updated the BIOS to 5109 and it started crashing.

Unit2 I have left alone thus far.  Manufacture finally responded and said I need to update the BIOS on Unit2 to 3302 to accurately support my CPU, but has thus far failed at helping to roll-back the BIOS on Unit1.

I'd prefer to set the BIOS to version 3302 on Unit1 to verify it corrects the issue before I update the BIOS on Unit2 to the same version.
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
UPDATE, ASUS says I need the BIOS update to correctly support my CPU.  Updated Machine2 to the BIOS level that supports my CPU and it has been running fine.

Had many other weird issues with Machine1 while trying to get the BIOS rolled back, but ASUS it can't be done (after talking to the fifth tech support rep in two weeks).

I ordered a BIOS chip from ASUS with the correct BIOS version installed on it for Machine1, should be here by Friday.  During my internet searches I've found a large variety with this particular ASUS board (Z9PE-D8 WS), so I'll avoid it in the future.  My Sabertooth boards have been running fine for years, said to see that the only ASUS board that supports 2 CPU sockets is having so many issues.

Once I get the BIOS chip I'll test the memory as well.  This has been a very odd experience.  I'd avoid the Z9PE-D8 WS board if you can.  Also going to tighten up the interior to see if I can improve air flow, maybe replace a few of the standard CoolerMaster fans that came with the Chassis with something that pushes through a mush higher volume.

I'm going to close this for now and issue points evenly to each of you for your assistance.

Thanks again!