Win 2003 Enterprise server BSOD whenever cluster resources swing to it

I have a pair of Win 2003 Enterprise Server that are clustered.

DB1 server has been the active node for ages & just a few days
ago, it BSOD'ed.  It's an IBM x3650 M3 & IBM replaced the RAID
controller's cache battery & after that I'm not able to swing the
cluster resource to it.  IBM ran diagnostics & said there's nothing
wrong with the hardware.

These 2 servers are connected to an EMC storage.  On the server
that BSOD, a few partitions are shown as "Unallocated".  I log
a case with EMC: EMC reporting tool were ran & outputs sent to
them & EMC got back to say it's an Microsoft bug.  Refer to the
attached Event Viewer entry tt corresponds to the BSOD.

Well, the last time we apply MS security patches was about 2 weeks
back.  So I'm inclined to disbelieve it's a MS bug.

I haven't got the chance to fix the 'unallocated' partitions yet.

I don't have MS support maintenance, so seeking solutions &
opinions here.

Below is the analysis from EMC support:


==================================================

Please contact Microsoft and have the memory dump analyzed.

Bug Check 0x19: BAD_POOL_HEADER The BAD_POOL_HEADER bug check has a value of 0x00000019. This indicates that a pool header is corrupt.
 

http://support.microsoft.com/kb/948289

Error message on a Windows Server 2003-based computer:
"Stop error code 0x00000019"

 

### System Event Log

08/12/2012 07:32:44 PM  Information   DB1Svr_tt_BSOD     1001    Save Dump                        The computer has rebooted from a bugcheck.  The bugcheck was:  0x00000019 (0x0000000000000020, 0xfffffa800773aeb0, 0xfffffa800773af20, 0x0000000006070105).  A dump was saved in: C:\WINDOWS\MEMORY.DMP.  

08/12/2012 06:56:26 PM  Information   DB1Svr_tt_BSOD     1001    Save Dump                        The computer has rebooted from a bugcheck.  The bugcheck was:  0x00000019 (0x0000000000000020, 0xfffffa80056df590, 0xfffffa80056df600, 0x0000000006070203).  A dump was saved in: C:\WINDOWS\MEMORY.DMP.  

08/12/2012 01:25:45 AM  Information   DB1Svr_tt_BSOD     1001    Save Dump                        The computer has rebooted from a bugcheck.  The bugcheck was:  0x00000019 (0x0000000000000020, 0xfffffa8006f23520, 0xfffffa8006f23590, 0x0000000006070102).  A dump was saved in: C:\WINDOWS\MEMORY.DMP.  

08/12/2012 12:28:47 AM  Information   DB1Svr_tt_BSOD     1001    Save Dump                        The computer has rebooted from a bugcheck.  The bugcheck was:  0x00000019 (0x0000000000000020, 0xfffffa81552578c0, 0xfffffa8155257930, 0x0000000006070107).  A dump was saved in: C:\WINDOWS\MEMORY.DMP.  

04/13/2011 12:28:06 PM  Information   DB2_Server     1001    Save Dump                        The computer has rebooted from a bugcheck.  The bugcheck was:  0x00000020 (0x0000000000000000, 0x000000000000fffb, 0x0000000000000000, 0x0000000000000001).  A dump was saved in: C:\WINDOWS\MEMORY.DMP.  
 

08/12/2012 07:05:47 PM  Information   DB1Svr_tt_BSOD     1201    ClusSvc                          The Cluster Service brought the Resource Group "Group 2_SGFSH" online.  

08/12/2012 07:32:43 PM  Error         DB1Svr_tt_BSOD     6008    EventLog                         The previous system shutdown at 7:07:25 PM on 8/12/2012 was unexpected.  

08/12/2012 06:30:22 PM  Information   DB2_Server     7036    Service Control Manager          The MSSQL$I01 service entered the stopped state.  

08/12/2012 06:30:53 PM  Warning       DB1Svr_tt_BSOD     256     PlugPlayManager                  Timed out sending notification of device interface change to window of "ClusterDiskPnPWatcher"  

08/12/2012 06:56:25 PM  Error         DB1Svr_tt_BSOD     6008    EventLog                         The previous system shutdown at 6:32:03 PM on 8/12/2012 was unexpected.  
 

08/12/2012 12:59:16 AM  Information   DB2_Server     1204    ClusSvc                          The Cluster Service brought the Resource Group "Group 1_SGCE10" offline.  

08/12/2012 12:59:46 AM  Warning       DB1Svr_tt_BSOD     256     PlugPlayManager                  Timed out sending notification of device interface change to window of "ClusterDiskPnPWatcher"  

08/12/2012 01:25:44 AM  Error         DB1Svr_tt_BSOD     6008    EventLog                         The previous system shutdown at 1:00:47 AM on 8/12/2012 was unexpected.  
 

08/11/2012 11:44:50 PM  Information   DB1Svr_tt_BSOD     7036    Service Control Manager          The Microsoft Software Shadow Copy Provider service entered the running state.  

08/11/2012 11:44:50 PM  Information   DB1Svr_tt_BSOD     7035    Service Control Manager          The Acronis VSS Provider service was successfully sent a start control.  

08/11/2012 11:44:50 PM  Information   DB1Svr_tt_BSOD     7036    Service Control Manager          The Acronis VSS Provider service entered the running state.  

08/12/2012 12:28:47 AM  Error         DB1Svr_tt_BSOD     6008    EventLog                         The previous system shutdown at 11:48:40 PM on 8/11/2012 was unexpected.
EvtVwrBSODB1.png
sunhuxAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

 
xDUCKxCommented:
Update your SCSI Driver and your NIC driver to the latest version.  If you have an RDAC driver you might want to look into that also.
0
 
sunhuxAuthor Commented:
I want to do the least just to fix the BSOD for that server to be
able to function again;  don't want to make / apply too many
changes.  So just the SCSI & NIC driver are the most likely to help?

Btw, if I apply the driver upgrades on this server, do I need to
apply the same on the other cluster server (that's currently OK).

Can MS CLuster of 2 nodes with different BIOS, NIC, SCSI &
HBA driver versions co-exist?
0
 
Manpreet SIngh KhatraSolutions Architect, Project LeadCommented:
Which resource fails ... if an option for a couple of minutes downtime to check which resource has issue .... Go to properties of each Resource in Cluster Group and uncheck "Affect the Group" and failover to the Affected node.

Please monitor very closely which resource is the first to fail. I assume its going to be Disk or Quorum .... anyone once it happens check for the location it uses .... i guess that the issue as every resource has a location where it saves it files\data.

- Rancy
0

Experts Exchange Solution brought to you by ConnectWise

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.