SQL Server on Microsoft Cluster Service

I have a 2 node SQL Server cluster running on Microsoft Cluster Service, this has been running without issue for at least a year, with at least one failover per month (Microsoft Updates).  This last month however, when the cluster failed over from Node 1 to Node 2, Node 2 would not start the services.  Forcing everything back to Node 1 brought the services back online.  Investigation on Node 2 shows an event MSSQLSERVER - ID: 9003 "The log scan number passed to log scan in database 'master' is invalid."

Everything I have read points to a corrupted Master database, but if the database was corrupt, why does it run without issue on Node 1?  Have not attempted to force a failover back to Node 2 as I don't know if it will in fact corrupt the master database causing it to not run on Node 1.  I have good full backups of all databases including master, so I may just force a failover and deal with whatever happens, just confused as to why it would fine on Node 1 and not Node 2, both servers are identical.

Have also considered evicting Node 2 and re-adding, but don't really know what difference that would make.
TC3CTAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Paul MacDonaldDirector, Information SystemsCommented:
It's possible the disk resource wasn't ready when the SQL service tried to start.

When it's convenient, you can try rolling the cluster back over to see if the problem recurs.  If not, that may have been the issue.
0
geek_vjCommented:
As this is happening on one node of a cluster, this might be a one-time issue. Please try failing over to node 2 again during downtime and check if the error exists.

It is highly impossible that the same error occurs again.

If yes, please post the same so that we can assist you further.

Hope this helps!
0
Paul MacDonaldDirector, Information SystemsCommented:
[geek_vj] - restating what I've already posted does not advance the discussion.  Please refrain from doing it in the future.
0
Cloud Class® Course: Amazon Web Services - Basic

Are you thinking about creating an Amazon Web Services account for your business? Not sure where to start? In this course you’ll get an overview of the history of AWS and take a tour of their user interface.

TC3CTAuthor Commented:
At the next maintenance window, I will give it a shot.
0
TC3CTAuthor Commented:
Tried another failover, same result.  More info...

Everything is  running on VMWare, ESXi 5.1.0, 1117900, HP Blades, HP SAN/

Not interested in a debate about whether or not I need a MS cluster since I am running on VMWare, or whether or not I should even be on VMWare with SQL, existing setup worked for 2 years with no issue.

It appears the time on SQL1 was about 1 min 15 secs behind SQL2, but I can’t believe that was the issue as the named instance works just fine on both nodes.  The only errors are with the default instance and only on SQL2.

Node 1
Windows Server 2008 R2 Enterprise 64 Bit
      - (4) E5530 @ 2.40 GHz
      - RAM: 15GB

SQL Server 2008 R2 64-bit
      -Default Instance
      -1 Named Instance

Node 2
Windows Server 2008 R2 Enterprise 64 Bit
      - (4) E5530 @ 2.40 GHz
      - RAM: 15GB

SQL Server 2008 R2 64-bit
      -Default Instance
      -1 Named Instance

Failover Cluster Manager
      Services and Applications
             DBClusterDTC
             SQL Server (MSSQLSERVER)
             SQL Server(STUDENTEMAIL)
        Nodes
             SQL1
             SQL2
        Storage
             Disk Witness in Quorum
                 Cluster Disk 1
             SQL Server (MSSQLSERVER)
                  Cluster Disk 2
                Cluster Disk 3
                  Cluster Disk 4
             SQL Server (STUDENTEMAIL)
                Cluster Disk 5
       Network
            Public
            Private

Initiate Fail over to SQL2
System Log
The Distributed Transaction Coordinator (5ad38b7c-3999-46ef-985f-b474c9073c7b) service entered the running state.

The SQL Server (MSSQLSERVER) service entered the stopped state.

The SQL Server (MSSQLSERVER) service terminated with service-specific error WARNING: You have until SQL Server (MSSQLSERVER) to logoff. If you have not logged off at this time, your session will be disconnected, and any open files or devices you have open may lose data.

Cluster resource 'SQL Server' in clustered service or application 'SQL Server (MSSQLSERVER)' failed.

The Cluster service failed to bring clustered service or application 'SQL Server (MSSQLSERVER)' completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered service or application.

Application Log
The log scan number (2818:402:2) passed to log scan in database 'master' is not valid. This error may indicate data corruption or that the log file (.ldf) does not match the data file (.mdf). If this error occurred during replication, re-create the publication. Otherwise, restore from backup if the problem results in a failure during startup.

Cannot recover the master database. SQL Server is unable to run. Restore master from a full backup, repair it, or rebuild it. For more information about how to rebuild the master database, see SQL Server Books Online.

[sqsrvres] StartResourceService: Failed to start MSSQLSERVER service.  CurrentState: 1

[sqsrvres] OnlineThread: ResUtilsStartResourceService failed (status 435)

[sqsrvres] OnlineThread: Error 435 bringing resource online.

The Cluster service failed to bring clustered service or application 'SQL Server (MSSQLSERVER)' completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered service or application.

Fail back to SQL1
System Log:
The SQL Server (MSSQLSERVER) service entered the running state.

Application Log
Starting up database 'master'.

The tail of the log for database master is being rewritten to match the new sector size of 4096 bytes.  1536 bytes at offset 846336 in file H:\MSSQL10_50.MSSQLSERVER\MSSQL\DATA\mastlog.ldf will be written.

Recovery is writing a checkpoint in database 'master' (1). This is an informational message only. No user action is required.

CHECKDB for database 'master' finished without errors on 2013-09-20 01:00:06.377 (local time). This is an informational message only; no user action is required.

At this point I am not sure if I should just try an uninstall / re-install of SQL Server on SQL2, or if I should just start a new server from scratch.
0
Paul MacDonaldDirector, Information SystemsCommented:
"The log scan number (2818:402:2) passed to log scan in database 'master' is not valid. This error may indicate data corruption or that the log file (.ldf) does not match the data file (.mdf). If this error occurred during replication, re-create the publication. Otherwise, restore from backup if the problem results in a failure during startup.

Cannot recover the master database. SQL Server is unable to run. "

This is almost certainly the problem, but why it's an issue on one server but not the other I'm not sure.   I'd urge you to make sure you have a good back-up, then try to run a repair on your Master database and see if that has an effect.
0
TC3CTAuthor Commented:
I am fully aware that is the problem....as pointed out in my original post...

"Investigation on Node 2 shows an event MSSQLSERVER - ID: 9003 "The log scan number passed to log scan in database 'master' is invalid."

Everything I have read points to a corrupted Master database, but if the database was corrupt, why does it run without issue on Node 1?"
0
TC3CTAuthor Commented:
Solution was to uninstall the default SQL instance on node 2 and reinstall.
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
TC3CTAuthor Commented:
Solved problem on my own.
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Microsoft SQL Server 2008

From novice to tech pro — start learning today.