Learn how to a build a cloud-first strategyRegister Now

x
?
Solved

Help with HP ProLiant ML350 G5 Constant BSOD:

Posted on 2011-10-07
15
Medium Priority
?
5,627 Views
Last Modified: 2012-05-12
Hi all.
We've had this server as a domain controller in a remote branch for a few years now and have had problems with rebooting at random.
The rebooting would log a BSOD error in the event log.
I've looked at the minidump using BSOD View and it always states the same thing:
"The problem seems to be caused by the following file: ntoskrnl.exe
PAGE_FAULT_IN_NONPAGED_AREA
*** STOP: 0x00000050 (0xe2474000, 0x00000000, 0xbfab34ec, 0x00000000)
*** ntoskrnl.exe - Address 0x8087c4a0 base at 0x80800000 DateStamp 0x4b27c5b8
"
Some info was omitted because it doesn't seem relevant.
If you need more details from the crash, just let me know.
This server is running Windows Server 2003 R2 SP2 (32-bit).

There are also the same events repeating in the event log every xx minutes:
1)   HP NC373i: The network link is down.  Check to make sure the network cable is properly connected.
2)  The power subsystem is now in a non-redundant state.
3)  Power supply 2 has failed.
4)  Power supply 2 is now operating correctly.

I figured these are related to the HP Management agents.

I tried updating the server to the newest HP PSP at the time (8.60) but it doesn't seem to help.
I downloaded the newest one (8.70) but I don't feel like it's going to solve the problem.

Are the blue screen reboots and the HP agent warnings related?

This has been an issue for quite a while now and I honestly don't know how to handle it.

Any help would be appreciated.
0
Comment
Question by:homerslmpson
  • 7
  • 2
  • 2
  • +2
13 Comments
 
LVL 88

Assisted Solution

by:rindi
rindi earned 600 total points
ID: 36931640
Test the RAM using memtest86+. You'll find that on the UBCD. I'd also clean out all dust. As you have redundant PSU's the crashes shouldn't have been caused by that, but if it is always the same PSU that fails I would change that asap. The LAN link going down also shouldn't cause crashes.

http://ultimatebootcd.com

If the RAM is fine, make sure your Antivirus software is updated, and if you are using some remote control tool also make sure that is up-to-date. Update all your drivers.
0
 
LVL 1

Author Comment

by:homerslmpson
ID: 36932300
I downloaded the Windows version of UBCD but I'm unsure if I should have done that.  It looks like they have a different one on the main page.  Does it matter?

The antivirus is up to date.

This may actually be the first time I got the warning about the power supply.

The one that usually shows up is below:
"System Information Agent: Health: The Fan Sub-system has lost redundancy.  Replace any failed or missing fans.
 Chassis: '0'
[SNMP TRAP: 6037 in CPQHLTH.MIB]
"

That will immediately be followed up with this one:
"System Information Agent: Health: The Fan Sub-system has returned to a redundant state.
 Chassis: '0'
[SNMP TRAP: 6055 in CPQHLTH.MIB]
"

Perhaps I can have someone in that branch take a look at the back of the server to see if there are any warning lights, etc.

The memtest needs to be done before Windows loads, yes?

Any idea how long I should run it?

Sorry for all the questions and thanks in advance.
0
 
LVL 88

Expert Comment

by:rindi
ID: 36932366
I meant the other UBCD, but as memtest is on both it should be OK. The problem is that the windows version you have to build first using an XP CD or probably also a 2003 CD (never used that though). The standard UBCD is just an iso which you make a CD of. Both have to be booted from, so you need someone at the server's location to put it in the server and boot from it, and yes, you need to run memtest by booting from CD without windows running. 3 Passes should be fine (how long that takes is difficult to say as it depends on the CPU speed and size of the RAM).

You will need someone there anyway so he can clean out the dust, and look at the fans. If they cause errors like you just posted that could certainly be a reason for the crashes.
0
Veeam Disaster Recovery in Microsoft Azure

Veeam PN for Microsoft Azure is a FREE solution designed to simplify and automate the setup of a DR site in Microsoft Azure using lightweight software-defined networking. It reduces the complexity of VPN deployments and is designed for businesses of ALL sizes.

 
LVL 1

Author Comment

by:homerslmpson
ID: 36932390
Hmm.  I see.
I will download the other UBCD as that sounds a lot easier to work with.
Looks like I'm going to have to coordinate with someone in that branch to assist me (this is going to be fun).
Well, thanks for your help.  I guess we're going to have to put this thread on hold for a while as I'll need to have the memory tested and all that jazz.
We'll be in touch!!
Thanks again.
0
 
LVL 12

Assisted Solution

by:marcustech
marcustech earned 600 total points
ID: 36938904
I agree with Rindi, intermittent ntoskrnl BSODs often caused by bad memory.

You'll also want to install the HP system management tools and run the insight online diagnostics and the HP Integrated Management Log (IML) viewer.  If the Internal Health LED on the front panel is illuminated then you will need to take the side off and get a note / photo of the diagnostic LEDs on the mobo, which are quite informative on proliant ML.  If you've got enough memory and don't want the downtime of running memtest, next time it's shutdown remove half the RAM, if it crashes again, swap for the other half of the RAM and see if it still crashes.

Of course if it's under warranty then call HP and get them to walk you through troubleshooting and replace parts if necessary.
0
 
LVL 6

Expert Comment

by:sifuedition
ID: 36943437
Most ntoskrnl.exe dumps I have seen come back to hardware. That is definitely not to imply that is the only thing that can cause this, just the most common. If the memory checks out, be sure to go to HP's website and find any bootable hardware diagnostics they offer. With a mix of fan errors, power supply, and BSOD, this could also be motherboard or chipset related.
0
 
LVL 1

Author Comment

by:homerslmpson
ID: 36943476
OK well I'm sending the UBCD to the manager of that branch and gave him clear instructions on what to do in order to test the RAM.
He's going to let it run overnight and then send me a picture of the screen the next morning.
Guess we'll take it from there.
0
 
LVL 1

Author Comment

by:homerslmpson
ID: 37220543
Wow.  After almost 2 months I finally got someone in that branch to run the memory test.
After running the test overnight, the test showed there were no errors.  That's kind of a bummer.
I was hoping that was the issue.  Now I'm not sure what the next step is.
Any ideas?
Memory test
0
 
LVL 1

Author Comment

by:homerslmpson
ID: 37237755
I ran the newest version of the HP Insights Diagnostics Online Edition software and when you go to the diagnostics tab you can only run "Logical Drive 1, Storage Controller in Slot 0".
Power Supply 1 and 2 are greyed out and are "not diagnosable".

So I ran the diagnostics for the logical drive and get the following:

Hard drive 1:
Error: F155: The read/write hard error rate recorded in the monitor and performance log is above the acceptable threshold.

Hard drive 2:
OK

Hard drive 3:
Error: F155: The read/write hard error rate recorded in the monitor and performance log is above the acceptable threshold.

Do I take this to be the truth and replace the drives?  Or is this something likely to do with the agents reporting inaccurate information?

Thanks.
0
 
LVL 12

Expert Comment

by:marcustech
ID: 37237817
I would advice pulling the drives one at a time and running the manufacturers diagnostics on them on another PC.
0
 
LVL 1

Author Comment

by:homerslmpson
ID: 37237865
Hmmm.  I see.
The thing is they are 2.5" SAS drives.
I'd need to find a server that accepts these drives which is unlikely in my company.
Any other options?
What if we order one replacement drive, replace one of the bad drives and then run the diagnostics on the server again?   If the error doesn't show up, we can assume the drive was bad and if the new drive also shows up as bad, we can assume it's not the drive and it's a different problem altogether.
0
 
LVL 1

Author Comment

by:homerslmpson
ID: 37278906
Well I'm at the point now where the HP tools are confusing me all too much.
The HP Insights Diagnostic show the errors listed 3 posts up but if I run the HP Array Diagnostics Utility (8.12.1.0) it shows no errors at all.
Do I need to replace these drives or not?
I'm showing one spare drive already in that server so I don't know if that's of any use.
Any help would be appreciated.
0
 
LVL 47

Accepted Solution

by:
David earned 800 total points
ID: 37289706
Memtest & the HP diagnostics and UBCD are toys compared to a test board that one plugs into a motherboard to run hardware diagnostics.  The pure software tools simply do not have the ability to fully test any motherboard.  So don't assume the hardware is OK when you haven't given it a thorough test.

As you probably can't justify the expense of purchasing proper test equipment, and this is a DC, then why not create another DC as a virtual machine at that office to take over so you can get a good downtime window, and then load a fresh copy of everything?   It is either hardware or software or interaction between the two.

You do not have what you need to certify the hardware is 100%, but if you reload the O/S (sorry backup/restore won't cut it you could have corrupted DLLs or files) and patch it up, then you can see if it becomes stable.

If it starts crashing again, then you know it is hardware-related, and can get it repaired by HP or a pro who knows what they are doing.  If not, and system stays up ok, then problem solved.


0

Featured Post

Get your Disaster Recovery as a Service basics

Disaster Recovery as a Service is one go-to solution that revolutionizes DR planning. Implementing DRaaS could be an efficient process, easily accessible to non-DR experts. Learn about monitoring, testing, executing failovers and failbacks to ensure a "healthy" DR environment.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Great sound, comfort and fit, excellent build quality, versatility, compatibility. These are just some of the many reasons for choosing a headset from Sennheiser.
As cyber crime continues to grow in both numbers and sophistication, a troubling trend of optimization has emerged over the last year.
In this video, Percona Director of Solution Engineering Jon Tobin discusses the function and features of Percona Server for MongoDB. How Percona can help Percona can help you determine if Percona Server for MongoDB is the right solution for …
In this video, Percona Solutions Engineer Barrett Chambers discusses some of the basic syntax differences between MySQL and MongoDB. To learn more check out our webinar on MongoDB administration for MySQL DBA: https://www.percona.com/resources/we…

810 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question