[Okta Webinar] Learn how to a build a cloud-first strategyRegister Now

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 1747
  • Last Modified:

Win 2003 Enterprise server BSOD whenever cluster resources swing to it

I have a pair of Win 2003 Enterprise Server that are clustered.

DB1 server has been the active node for ages & just a few days
ago, it BSOD'ed.  It's an IBM x3650 M3 & IBM replaced the RAID
controller's cache battery & after that I'm not able to swing the
cluster resource to it.  IBM ran diagnostics & said there's nothing
wrong with the hardware.

These 2 servers are connected to an EMC storage.  On the server
that BSOD, a few partitions are shown as "Unallocated".  I log
a case with EMC: EMC reporting tool were ran & outputs sent to
them & EMC got back to say it's an Microsoft bug.  Refer to the
attached Event Viewer entry tt corresponds to the BSOD.

Well, the last time we apply MS security patches was about 2 weeks
back.  So I'm inclined to disbelieve it's a MS bug.

I haven't got the chance to fix the 'unallocated' partitions yet.

I don't have MS support maintenance, so seeking solutions &
opinions here.

Below is the analysis from EMC support:


==================================================

Please contact Microsoft and have the memory dump analyzed.

Bug Check 0x19: BAD_POOL_HEADER The BAD_POOL_HEADER bug check has a value of 0x00000019. This indicates that a pool header is corrupt.
 

http://support.microsoft.com/kb/948289

Error message on a Windows Server 2003-based computer:
"Stop error code 0x00000019"

 

### System Event Log

08/12/2012 07:32:44 PM  Information   DB1Svr_tt_BSOD     1001    Save Dump                        The computer has rebooted from a bugcheck.  The bugcheck was:  0x00000019 (0x0000000000000020, 0xfffffa800773aeb0, 0xfffffa800773af20, 0x0000000006070105).  A dump was saved in: C:\WINDOWS\MEMORY.DMP.  

08/12/2012 06:56:26 PM  Information   DB1Svr_tt_BSOD     1001    Save Dump                        The computer has rebooted from a bugcheck.  The bugcheck was:  0x00000019 (0x0000000000000020, 0xfffffa80056df590, 0xfffffa80056df600, 0x0000000006070203).  A dump was saved in: C:\WINDOWS\MEMORY.DMP.  

08/12/2012 01:25:45 AM  Information   DB1Svr_tt_BSOD     1001    Save Dump                        The computer has rebooted from a bugcheck.  The bugcheck was:  0x00000019 (0x0000000000000020, 0xfffffa8006f23520, 0xfffffa8006f23590, 0x0000000006070102).  A dump was saved in: C:\WINDOWS\MEMORY.DMP.  

08/12/2012 12:28:47 AM  Information   DB1Svr_tt_BSOD     1001    Save Dump                        The computer has rebooted from a bugcheck.  The bugcheck was:  0x00000019 (0x0000000000000020, 0xfffffa81552578c0, 0xfffffa8155257930, 0x0000000006070107).  A dump was saved in: C:\WINDOWS\MEMORY.DMP.  

04/13/2011 12:28:06 PM  Information   DB2_Server     1001    Save Dump                        The computer has rebooted from a bugcheck.  The bugcheck was:  0x00000020 (0x0000000000000000, 0x000000000000fffb, 0x0000000000000000, 0x0000000000000001).  A dump was saved in: C:\WINDOWS\MEMORY.DMP.  
 

08/12/2012 07:05:47 PM  Information   DB1Svr_tt_BSOD     1201    ClusSvc                          The Cluster Service brought the Resource Group "Group 2_SGFSH" online.  

08/12/2012 07:32:43 PM  Error         DB1Svr_tt_BSOD     6008    EventLog                         The previous system shutdown at 7:07:25 PM on 8/12/2012 was unexpected.  

08/12/2012 06:30:22 PM  Information   DB2_Server     7036    Service Control Manager          The MSSQL$I01 service entered the stopped state.  

08/12/2012 06:30:53 PM  Warning       DB1Svr_tt_BSOD     256     PlugPlayManager                  Timed out sending notification of device interface change to window of "ClusterDiskPnPWatcher"  

08/12/2012 06:56:25 PM  Error         DB1Svr_tt_BSOD     6008    EventLog                         The previous system shutdown at 6:32:03 PM on 8/12/2012 was unexpected.  
 

08/12/2012 12:59:16 AM  Information   DB2_Server     1204    ClusSvc                          The Cluster Service brought the Resource Group "Group 1_SGCE10" offline.  

08/12/2012 12:59:46 AM  Warning       DB1Svr_tt_BSOD     256     PlugPlayManager                  Timed out sending notification of device interface change to window of "ClusterDiskPnPWatcher"  

08/12/2012 01:25:44 AM  Error         DB1Svr_tt_BSOD     6008    EventLog                         The previous system shutdown at 1:00:47 AM on 8/12/2012 was unexpected.  
 

08/11/2012 11:44:50 PM  Information   DB1Svr_tt_BSOD     7036    Service Control Manager          The Microsoft Software Shadow Copy Provider service entered the running state.  

08/11/2012 11:44:50 PM  Information   DB1Svr_tt_BSOD     7035    Service Control Manager          The Acronis VSS Provider service was successfully sent a start control.  

08/11/2012 11:44:50 PM  Information   DB1Svr_tt_BSOD     7036    Service Control Manager          The Acronis VSS Provider service entered the running state.  

08/12/2012 12:28:47 AM  Error         DB1Svr_tt_BSOD     6008    EventLog                         The previous system shutdown at 11:48:40 PM on 8/11/2012 was unexpected.
EvtVwrBSODB1.png
0
sunhux
Asked:
sunhux
2 Solutions
 
xDUCKxCommented:
Update your SCSI Driver and your NIC driver to the latest version.  If you have an RDAC driver you might want to look into that also.
0
 
sunhuxAuthor Commented:
I want to do the least just to fix the BSOD for that server to be
able to function again;  don't want to make / apply too many
changes.  So just the SCSI & NIC driver are the most likely to help?

Btw, if I apply the driver upgrades on this server, do I need to
apply the same on the other cluster server (that's currently OK).

Can MS CLuster of 2 nodes with different BIOS, NIC, SCSI &
HBA driver versions co-exist?
0
 
Manpreet SIngh KhatraSolutions Architect, Project LeadCommented:
Which resource fails ... if an option for a couple of minutes downtime to check which resource has issue .... Go to properties of each Resource in Cluster Group and uncheck "Affect the Group" and failover to the Affected node.

Please monitor very closely which resource is the first to fail. I assume its going to be Disk or Quorum .... anyone once it happens check for the location it uses .... i guess that the issue as every resource has a location where it saves it files\data.

- Rancy
0

Featured Post

Keep up with what's happening at Experts Exchange!

Sign up to receive Decoded, a new monthly digest with product updates, feature release info, continuing education opportunities, and more.

Tackle projects and never again get stuck behind a technical roadblock.
Join Now