Help with HP ProLiant ML350 G5 Constant BSOD:

Hi all.
We've had this server as a domain controller in a remote branch for a few years now and have had problems with rebooting at random.
The rebooting would log a BSOD error in the event log.
I've looked at the minidump using BSOD View and it always states the same thing:
"The problem seems to be caused by the following file: ntoskrnl.exe
PAGE_FAULT_IN_NONPAGED_AREA
*** STOP: 0x00000050 (0xe2474000, 0x00000000, 0xbfab34ec, 0x00000000)
*** ntoskrnl.exe - Address 0x8087c4a0 base at 0x80800000 DateStamp 0x4b27c5b8
"
Some info was omitted because it doesn't seem relevant.
If you need more details from the crash, just let me know.
This server is running Windows Server 2003 R2 SP2 (32-bit).

There are also the same events repeating in the event log every xx minutes:
1)   HP NC373i: The network link is down.  Check to make sure the network cable is properly connected.
2)  The power subsystem is now in a non-redundant state.
3)  Power supply 2 has failed.
4)  Power supply 2 is now operating correctly.

I figured these are related to the HP Management agents.

I tried updating the server to the newest HP PSP at the time (8.60) but it doesn't seem to help.
I downloaded the newest one (8.70) but I don't feel like it's going to solve the problem.

Are the blue screen reboots and the HP agent warnings related?

This has been an issue for quite a while now and I honestly don't know how to handle it.

Any help would be appreciated.
LVL 1
homerslmpsonAsked:
Who is Participating?
 
DavidConnect With a Mentor PresidentCommented:
Memtest & the HP diagnostics and UBCD are toys compared to a test board that one plugs into a motherboard to run hardware diagnostics.  The pure software tools simply do not have the ability to fully test any motherboard.  So don't assume the hardware is OK when you haven't given it a thorough test.

As you probably can't justify the expense of purchasing proper test equipment, and this is a DC, then why not create another DC as a virtual machine at that office to take over so you can get a good downtime window, and then load a fresh copy of everything?   It is either hardware or software or interaction between the two.

You do not have what you need to certify the hardware is 100%, but if you reload the O/S (sorry backup/restore won't cut it you could have corrupted DLLs or files) and patch it up, then you can see if it becomes stable.

If it starts crashing again, then you know it is hardware-related, and can get it repaired by HP or a pro who knows what they are doing.  If not, and system stays up ok, then problem solved.


0
 
rindiConnect With a Mentor Commented:
Test the RAM using memtest86+. You'll find that on the UBCD. I'd also clean out all dust. As you have redundant PSU's the crashes shouldn't have been caused by that, but if it is always the same PSU that fails I would change that asap. The LAN link going down also shouldn't cause crashes.

http://ultimatebootcd.com

If the RAM is fine, make sure your Antivirus software is updated, and if you are using some remote control tool also make sure that is up-to-date. Update all your drivers.
0
 
homerslmpsonAuthor Commented:
I downloaded the Windows version of UBCD but I'm unsure if I should have done that.  It looks like they have a different one on the main page.  Does it matter?

The antivirus is up to date.

This may actually be the first time I got the warning about the power supply.

The one that usually shows up is below:
"System Information Agent: Health: The Fan Sub-system has lost redundancy.  Replace any failed or missing fans.
 Chassis: '0'
[SNMP TRAP: 6037 in CPQHLTH.MIB]
"

That will immediately be followed up with this one:
"System Information Agent: Health: The Fan Sub-system has returned to a redundant state.
 Chassis: '0'
[SNMP TRAP: 6055 in CPQHLTH.MIB]
"

Perhaps I can have someone in that branch take a look at the back of the server to see if there are any warning lights, etc.

The memtest needs to be done before Windows loads, yes?

Any idea how long I should run it?

Sorry for all the questions and thanks in advance.
0
Improve Your Query Performance Tuning

In this FREE six-day email course, you'll learn from Janis Griffin, Database Performance Evangelist. She'll teach 12 steps that you can use to optimize your queries as much as possible and see measurable results in your work. Get started today!

 
rindiCommented:
I meant the other UBCD, but as memtest is on both it should be OK. The problem is that the windows version you have to build first using an XP CD or probably also a 2003 CD (never used that though). The standard UBCD is just an iso which you make a CD of. Both have to be booted from, so you need someone at the server's location to put it in the server and boot from it, and yes, you need to run memtest by booting from CD without windows running. 3 Passes should be fine (how long that takes is difficult to say as it depends on the CPU speed and size of the RAM).

You will need someone there anyway so he can clean out the dust, and look at the fans. If they cause errors like you just posted that could certainly be a reason for the crashes.
0
 
homerslmpsonAuthor Commented:
Hmm.  I see.
I will download the other UBCD as that sounds a lot easier to work with.
Looks like I'm going to have to coordinate with someone in that branch to assist me (this is going to be fun).
Well, thanks for your help.  I guess we're going to have to put this thread on hold for a while as I'll need to have the memory tested and all that jazz.
We'll be in touch!!
Thanks again.
0
 
marcustechConnect With a Mentor Commented:
I agree with Rindi, intermittent ntoskrnl BSODs often caused by bad memory.

You'll also want to install the HP system management tools and run the insight online diagnostics and the HP Integrated Management Log (IML) viewer.  If the Internal Health LED on the front panel is illuminated then you will need to take the side off and get a note / photo of the diagnostic LEDs on the mobo, which are quite informative on proliant ML.  If you've got enough memory and don't want the downtime of running memtest, next time it's shutdown remove half the RAM, if it crashes again, swap for the other half of the RAM and see if it still crashes.

Of course if it's under warranty then call HP and get them to walk you through troubleshooting and replace parts if necessary.
0
 
sifueditionCommented:
Most ntoskrnl.exe dumps I have seen come back to hardware. That is definitely not to imply that is the only thing that can cause this, just the most common. If the memory checks out, be sure to go to HP's website and find any bootable hardware diagnostics they offer. With a mix of fan errors, power supply, and BSOD, this could also be motherboard or chipset related.
0
 
homerslmpsonAuthor Commented:
OK well I'm sending the UBCD to the manager of that branch and gave him clear instructions on what to do in order to test the RAM.
He's going to let it run overnight and then send me a picture of the screen the next morning.
Guess we'll take it from there.
0
 
homerslmpsonAuthor Commented:
Wow.  After almost 2 months I finally got someone in that branch to run the memory test.
After running the test overnight, the test showed there were no errors.  That's kind of a bummer.
I was hoping that was the issue.  Now I'm not sure what the next step is.
Any ideas?
Memory test
0
 
homerslmpsonAuthor Commented:
I ran the newest version of the HP Insights Diagnostics Online Edition software and when you go to the diagnostics tab you can only run "Logical Drive 1, Storage Controller in Slot 0".
Power Supply 1 and 2 are greyed out and are "not diagnosable".

So I ran the diagnostics for the logical drive and get the following:

Hard drive 1:
Error: F155: The read/write hard error rate recorded in the monitor and performance log is above the acceptable threshold.

Hard drive 2:
OK

Hard drive 3:
Error: F155: The read/write hard error rate recorded in the monitor and performance log is above the acceptable threshold.

Do I take this to be the truth and replace the drives?  Or is this something likely to do with the agents reporting inaccurate information?

Thanks.
0
 
marcustechCommented:
I would advice pulling the drives one at a time and running the manufacturers diagnostics on them on another PC.
0
 
homerslmpsonAuthor Commented:
Hmmm.  I see.
The thing is they are 2.5" SAS drives.
I'd need to find a server that accepts these drives which is unlikely in my company.
Any other options?
What if we order one replacement drive, replace one of the bad drives and then run the diagnostics on the server again?   If the error doesn't show up, we can assume the drive was bad and if the new drive also shows up as bad, we can assume it's not the drive and it's a different problem altogether.
0
 
homerslmpsonAuthor Commented:
Well I'm at the point now where the HP tools are confusing me all too much.
The HP Insights Diagnostic show the errors listed 3 posts up but if I run the HP Array Diagnostics Utility (8.12.1.0) it shows no errors at all.
Do I need to replace these drives or not?
I'm showing one spare drive already in that server so I don't know if that's of any use.
Any help would be appreciated.
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.