Solved

Proliant DL380 G4  - Win 2k3 Server BSOD

Posted on 2013-01-12
34
869 Views
Last Modified: 2013-01-18
Beginning last night, one of our financial production servers began rebooting itself.  It would come up to the login prompt and a before you could log on, it would reboot.  I was able to get into safe mode and set it to not restart automatically but after being in safe mode for just a few minutes, it would reboot there too.  A lot of articles online say that bad memory could be a culprit, so I replaced the 4 sticks with 4 new sticks but that did not change the situation.  I was able to get a small memory dump file off of the box and I read it with the Windows debugging tool.  This was my first time using this tool so I’m not completely sure what it’s telling me, but it appeared that the problem was with the cpqasm2.sys driver.  While in safe mode, I tried updating this driver but I’m not able to do much of anything in safe mode of course since the Windows installer service won’t run.  I did disable a bunch of the HP management services and rebooted into regular mode but that didn’t help any.
I am just wondering how I’m supposed to update drivers or remove applications in safe mode.  I’m not sure where to go next to get this server up and running and time is ticking.  If anyone could provide some assistance, it would be most appreciated.  I have attached the memory dump here in the hopes that some kind soul with a lot more knowledge than I might help out.

Thanks in advance
-Chris
Mini011213-06.txt
0
Comment
Question by:HarkinsIT
  • 20
  • 13
34 Comments
 

Author Comment

by:HarkinsIT
ID: 38770481
Here is another dump file.  To make it more interesting, the "probable cause of the error" this time is different.  The first one was ntfs.sys and this one is ntkrnlmp.exe.  

Thanks.
Mini011213-07.txt
0
 

Author Comment

by:HarkinsIT
ID: 38770501
If it helps any, here are all of the mini dumps since the problem started last night.
minidumps.zip
0
 

Author Comment

by:HarkinsIT
ID: 38770581
I booted into the recovery console and replaced the cpqasm2.sys driver with the newest one from HPs site.  I noticed that the file was the same size and the same date as the existing one but I gave it a shot anyhow.  It did not work.
0
 
LVL 12

Expert Comment

by:Sandeep
ID: 38770674
Please refer to this

http://support.microsoft.com/kb/832212

Have you tried rebooting your Server in Last Known Good Configuration?
0
 

Author Comment

by:HarkinsIT
ID: 38770679
Thanks for the response.

The article you linked me to refers to updating the cpqasm2.sys driver which I have already done, as mentioned above.  It was already the most current version but I updated it via the recovery console anyhow.

Yup, last known good is usually the first thing I try.  Haven't ever had any luck with it yet but that will never stop me from trying.  :-)
0
 
LVL 24

Accepted Solution

by:
smckeown777 earned 500 total points
ID: 38770682
Hi, I've only looked at a few of the minidumps but they seem to be referencing HP related drivers/services as you've already noticed...

The 2 BSOD codes I've seen are BAD POOL HEADER and DRIVER_CORRUPTED_EXPOOL

These ususally refer to driver issues(sometimes AV related as well)

You've replaced the 4 sticks of ram - but have you tested the ram? These are all memory related BSOD's, so to eliminate all things you need to at least test the running ram - www.memtest86.com for the software you need...

Anything new installed on server in the last week?
The fact that there are multiple possible causes(different dll's etc) I think means this could be memory related(maybe a bad slot on the board of the server for example)

Run memtest just to confirm it passes and report back thanks...
0
 

Author Comment

by:HarkinsIT
ID: 38770685
Will do.  Thanks very much.
0
 

Author Comment

by:HarkinsIT
ID: 38770770
Wow, how long do you think this will take approximately?  I have four 1GB sticks of RAM in this box.  Should I just go home and come back in the morning?  :-)
0
 
LVL 24

Expert Comment

by:smckeown777
ID: 38770775
Nope!

Ok, I assume you've been running for 30 mins yes? If so you are ok, looks good...

Memtest runs forever(sorry if I didn't mention that), it just loops over and over...

If you didn't see any errors then we can say the ram is clean...

You mentioned you booted into Safe mode and still got a BSOD yes? Safe mode eliminates 3rd party drivers which 'should' take the HP stuff out of the loop(I would have thought)

Have you any idea which of those minidumps were from running in Safe mode?

You didn't say if anything was installed on this server recently?  No new packages? What AV is running?
0
 

Author Comment

by:HarkinsIT
ID: 38770780
Yup, still got BSOD in safe mode.  I'm not sure which of the dumps happened when I was in safe mode though unfortunately.  I can generate a few more if that would be helpful.

Nothing was installed onto the server recently except for some Windows Updates earlier in the week.  We're running Kaspersky AV.

Thanks.
0
 

Author Comment

by:HarkinsIT
ID: 38770784
Actually I was wrong.  We had uninstalled the AV from this server several months ago so there is none running on it ATM.

Thanks.
0
 
LVL 24

Expert Comment

by:smckeown777
ID: 38770786
Yes, give safe mode another try, want to see if it helps point to another potential issue...

How are you getting these minidumps off the drive? Does the server stay up for 5 minutes or so, I mean is the dump happening at exactly the same time each time the server boots?

Last thing - have you ran chkdsk on the drive?
0
 

Author Comment

by:HarkinsIT
ID: 38770802
0
 

Author Comment

by:HarkinsIT
ID: 38770810
Here's a new dump that happened when in safe mode.

I am booting to safe mode with networking and I'm able to copy the dumps off of the drive and onto a file server.  In safe mode the server does stay up for maybe 3 or 4 minutes before crashing.  In regular mode, I never even get to the logon prompt.  It says "Applying computer settings" for a minute or two and then crashes.  I don't know if the crashes are happening at exactly the same time each time but they are definitely very close to each other.  

I have not run chkdsk.  This server has a C and D partition, each residing on a single RAID 5 array.  I can try that next.
0
 

Author Comment

by:HarkinsIT
ID: 38770811
Oddly enough, after the crash in safe mode, the server restarted and I went to normal mode this time.  I did get to the point where I could put in my credentials but as soon as I hit enter, it crashed again.
0
 
LVL 24

Expert Comment

by:smckeown777
ID: 38770819
Load safe mode without networking(want to strip down to the last)

That last dump points to cpqasm2.sys again...now here's the issue...
Based on this - http://www.spydig.com/file-diagnosis/cpqasm2-sys.html - that file 'could' be something else(i.e. not a HP file at all)

You mentioned that you got the same file off HP to copy over the existing one on the system earlier, but what path was that in? C:\Windows\system32? Cause there could be other locations with the same file(i.e. the infection)

Normally I wouldn't have picked up on this but you said you removed the AV from the server a while back(in my opinion that's a very bad idea)

Which means now we can't be sure its clean at all...

What I'd try to do is get that cpqasm2-sys file from the server and upload it to www.virustotal.com to see if it reports anything bad, just as a first step - can you try this?
0
 
LVL 24

Expert Comment

by:smckeown777
ID: 38770826
Ignore that last message(spydig isn't a reputable site unfortunately), so my bad...

Just as a test though go ahead and upload that file to virustotal for the sake of it
0
Find Ransomware Secrets With All-Source Analysis

Ransomware has become a major concern for organizations; its prevalence has grown due to past successes achieved by threat actors. While each ransomware variant is different, we’ve seen some common tactics and trends used among the authors of the malware.

 

Author Comment

by:HarkinsIT
ID: 38770844
VirusTotal says the file checks out clean.

After the last reboot, I chose safe mode again and ya know how when you're booting into safe mode you see the list of drivers that are loading up?  My boot has stopped at system32\acpitabl.dat.  I tried a hard reboot but when it came up it stalled at the same place again.  I found an article suggestion that I try running chkdsk from the recovery console.  I may give that a shot seeing as I can't get anywhere at the moment.
0
 

Author Comment

by:HarkinsIT
ID: 38770866
This server is quite old, nearly eight years old at this point.  What are the chances that this could be hardware failure of some sort?  I guess if the array controller card was bad, it wouldn't boot at all, right?  Perhaps something else internal?  All of the lights on the front of the drives are green and blinking happily.  I would really not like having to rebuild this server, of all the servers we have here.  My backups should be fine but having to reload these ancient apps could be a chore.

So do you think there's much hope of saving this thing?  Things seem to be getting worse.

I appreciate your assistance.
0
 
LVL 24

Expert Comment

by:smckeown777
ID: 38770876
Biggest issue is the fact you can't even keep it up for 5 minutes...even if we do manage to get to a root cause you may not be able to actually do anything(i.e. uninstall for example) due to the speed it is dropping...

Since the BSOD's are memory related its still possible the drive is to blame, I assume you are getting into recovery console to perform that yes? the ACPI error relates to hardware issues with the drives(in your case RAID I assume) so yes we are starting to move towards the disk/controller possibly being at fault...

If you can get chkdsk running, if it stays up to run you may be in luck, if it doesn't then yes you may have an unrecoverable situation...

I'm away for the night, late over here, hopefully you'll have some success before I talk to ye again ;)
0
 

Author Comment

by:HarkinsIT
ID: 38770881
Thanks.  I'm just about to call it a night and head out since my belly is rumbling quite loudly.  Recovery console is staying up fine but the fact that chkdsk is taking so long to scan the smaller of the partitions is a bit troubling.  It's currently at 62% so I'm just going to come back in the morning and see how it went.  

I actually have six 146.8GB 10k drives in a single RAID5 array on this server.  

I have a question that I'm pretty sure I already know the answer to  I have an almost identical DL380 G4 in the rack that is not currently being used for anything.  If I were to take the 6 drives out of the bad server and put them into the spare, would that work?  The array on the spare server is configured the same way as the problem server.  I'm pretty sure that all data would be lost if I did that, but I thought I'd bring it up.

Thanks again.
-Chris
0
 
LVL 24

Expert Comment

by:smckeown777
ID: 38770886
Well chkdsk can take time, a lot of time, so let it be...

As for the drive switch I couldn't say for sure if that would work, not a RAID expert by any means but I don't think it would be that easy...so wouldn't just do that too quick...

Leave it running chkdsk and see if things improve after it finishes...cheers
The fact that recovery console is up means the hardware might be ok, so time will tell

Shane
0
 

Author Comment

by:HarkinsIT
ID: 38772081
When I came in this morning, chkdsk had completed.  It said that it had found and repaired some errors.  I exited the recovery console and rebooted the server and when it started to boot Windows,, the Windows disk check started to run.  I've been sitting at step 4 of 5 for about a half hour now, waiting hopefully.......
0
 
LVL 24

Expert Comment

by:smckeown777
ID: 38772100
Right, might still be related to a controller issue of some sort(the fact that it wanted to run a 2nd chkdsk straight after the first)...let it play out and you might get a result, if it BSOD's again after this attempt we are running out of options i doubt...
0
 

Author Comment

by:HarkinsIT
ID: 38772110
Came up in normal mode, logged on and was feeling pretty good since it appeared to stay up.  About 30 seconds later, it crashed again.  Here's the latest dump file that I pulled off of it when I came back up in safe mode.  (I lasted about 3 or 4 minutes in safe mode before it crashed again.)
Mini011313-01.dmp
0
 
LVL 24

Expert Comment

by:smckeown777
ID: 38772155
This one points to SCSIPORT.SYS...again a different file than previous...

Unless someone else can see anything I'm missing in the dump files I'm lost on this one - one question, how come you removed the AV from the server? With all these different crash files pointing to different items its hard to say that this isn't caused by some sort of infection...not saying that it is, but normally BSOD's will remain pretty consistent if the cause is related to a driver issue
0
 

Author Comment

by:HarkinsIT
ID: 38772162
This server is very old and the AV was causing some pretty major performance issues.  It is not standard operating procedure for me to remove AV from servers but in this case, it made a pretty big difference to the performance so I let it slide.  I knew it wasn't a good idea when I did it, but now I'm really regretting it.  

Thanks for all of your help.  I really appreciate it.
0
 
LVL 24

Expert Comment

by:smckeown777
ID: 38772215
If this was a pc I'd simply recommend an offline scan, i.e. boot from a CD with an AV suite and run an external scanner against the C drive to see what it found...

Problem for you is this is a server with RAID drives, finding a utility to boot up with existing drivers for the RAID is hard, if we could get that we may be able to then scan the drive and just confirm whether it is an infection...

What time have you left on this? Or is this something that is needed resolved tomorrow?
0
 

Author Comment

by:HarkinsIT
ID: 38772241
Well, I have already accepted the reality that this server is unrecoverable so I started loading up my spare server as a replacement.  Our PeopleSoft guy is here now trying to figure out how he's going to recover all of the apps that are going to need to b re-installed so we're going forward with that plan.

In the meantime, I am now downloading the latest Kaspersky Rescue disk so I can at least give it a shot and see if it'll recognize the RAID array.  I'll keep you posted.......
0
 
LVL 24

Expert Comment

by:smckeown777
ID: 38772259
No bother, good luck with it hope u get something sorted...cheers
0
 

Author Comment

by:HarkinsIT
ID: 38772304
Unfortunately it did not recognize the RAID array and I haven't been able to find a way to get it to load the array controller drivers.  Do you know of a tool that lets you load RAID drivers?
0
 
LVL 24

Expert Comment

by:smckeown777
ID: 38772334
Only one I know of is Microsoft DART - http://www.microsoft.com/windows/enterprise/products/mdop/dart.aspx

But to get this you need an MSDN or Technet subscription unfortunately, it has an AV scanner as well which would be great...I'll never been able to locate a boot cd that lets you inject any drivers and would love one for servers as it would be a lifesaver for sure...

Maybe you have or know someone with a subscription(or just search old favourite Google and you'll probably find another source ;)
0
 

Author Closing Comment

by:HarkinsIT
ID: 38789000
I ended up just replacing the server with my spare.  Thanks for your help.
0
 
LVL 24

Expert Comment

by:smckeown777
ID: 38792579
Cheers, thanks for the update...sometimes there's no answer unfortunately!
0

Featured Post

Complete Microsoft Windows PC® & Mac Backup

Backup and recovery solutions to protect all your PCs & Mac– on-premises or in remote locations. Acronis backs up entire PC or Mac with patented reliable disk imaging technology and you will be able to restore workstations to a new, dissimilar hardware in minutes.

Join & Write a Comment

Suggested Solutions

by Batuhan Cetin In this article I will be guiding through the process of removing a failed DC metadata from Active Directory (hereafter, AD) using the ntdsutil tool in a Windows Server 2003 environment. These steps are not necessary in a Win…
Setting up a Microsoft WSUS update system is free relatively speaking if you have hard disk space and processor capacity.   However, WSUS can be a blessing and a curse. For example, there is nothing worse than approving updates and they just have…
Here's a very brief overview of the methods PRTG Network Monitor (https://www.paessler.com/prtg) offers for monitoring bandwidth, to help you decide which methods you´d like to investigate in more detail.  The methods are covered in more detail in o…
When you create an app prototype with Adobe XD, you can insert system screens -- sharing or Control Center, for example -- with just a few clicks. This video shows you how. You can take the full course on Experts Exchange at http://bit.ly/XDcourse.

747 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

12 Experts available now in Live!

Get 1:1 Help Now