Solved

SQL wont fail over with MS 2003 Cluster

Posted on 2010-11-23
7
914 Views
Last Modified: 2012-05-10
i have a active/passive cluster that runs a SQL data base. when the SQL group is owned by one server(WDB06) it runs fine, when i try and fail other to the second server in the group(WDB07) everything in the cluser's MSCS group comes online(DTC, Cluster name, cluster ip, and quorum"DTC log is also in quoruom drive). then when i try and fail over the SQL Server group, few things happen: at first the SQL server would not start, and was throwing error:
----------------------------------------------------------------------------
Event Type:      Error
Event Source:      MSSQLSERVER
Event Category:      (3)
Event ID:      19019
Date:            11/23/2010
Time:            9:02:02 PM
User:            N/A
Computer:      DA-GHN-WDB07
Description:
[sqsrvres] OnlineThread: service stopped while waiting for QP.
----------------------------------------------------------------------------
what i did for this problem was on the server that had control of the Quorum drive, ran command: MSDTC -resetlog
and i did not get this error again for SQL server.  but i am having a problem with a SQL Server Agent coming online, and again it is only when i try and move the SQL group onto the server: WDB07.  the odd thing here is, when i move the group this this physical server, everything comes online, then with in 3 minuets this server agent turn from online to fail. event view has the following log for this:
------------------------------------------------------------------
Event Type:      Error
Event Source:      SQLSERVERAGENT
Event Category:      Service Control
Event ID:      103
Date:            11/23/2010
Time:            9:20:36 PM
User:            N/A
Computer:      DA-GHN-WDB08
Description:
SQLServerAgent could not be started (reason: Unable to connect to server '(local)'; SQLServerAgent cannot start).
-----------------------------------------------------------------------------
Event Type:      Information
Event Source:      SQLSERVERAGENT
Event Category:      Service Control
Event ID:      102
Date:            11/23/2010
Time:            9:20:37 PM
User:            N/A
Computer:      DA-GHN-WDB08
Description:
SQLServerAgent service successfully stopped.
------------------------------------------------------------------------------------------
Event Type:      Error
Event Source:      SQLSERVERAGENT
Event Category:      Failover
Event ID:      53
Date:            11/23/2010
Time:            9:20:38 PM
User:            N/A
Computer:      DA-GHN-WDB07
Description:
[sqagtres] CheckServiceAlive: Service is dead
----------------------------------------------------------------------------------
I am not too sure as to why this agent will not stqay online when the SQL group is ran on server WDB07, but runs fine when server WDB06 is handling it.  im not a expert with MS server 2003 clustering, and i only know a few basic things with SQL, so any detailed help would be great. let me know if you need anymore information

thank you
Steven
0
Comment
Question by:sdmarek
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 5
  • 2
7 Comments
 
LVL 4

Expert Comment

by:rlog
ID: 34205924
Seems like the sql agent can't connect to the sql server service. Do you use separate accounts for for server and agent?
0
 
LVL 2

Author Comment

by:sdmarek
ID: 34206428
under system services, both SQL server and SQL Agent (on both servers) run loged in as:  production\prdsqlsrvc
on the servers when i go to computer managment, this domain account is added to the administrator group on both servers.

-steven
0
 
LVL 2

Author Comment

by:sdmarek
ID: 34206512
another note:  im at my new job, this is a network i have inherited so i did not set any of this up, but i would assume this has been working before.   when i got to this job, the systems admin before me had started to move this cluster's physical disk resources onto a new SAN, he did the hard part, he attached the LUNs and moved the SQL database and log drive over to the new SAN, all i did was:
http://support.microsoft.com/default.aspx?scid=kb;en-us;280353
to move the quorum drive, and then reset the DTC log to the new quorum drive.  i dont hink that would of broken any permisions or anything, because the domain user that runs SQL Server and Agent are in the admin group on both servers, but thats a little more history info if it helps.

-steven
0
Free eBook: Backup on AWS

Everything you need to know about backup and disaster recovery with AWS, for FREE!

 
LVL 4

Accepted Solution

by:
rlog earned 500 total points
ID: 34209134
Have you ever tried filemon.exe utility from MS (Sysinternals.com). In Cluster admin - set do not affect the group if SQL Agent fails. Move the cluster group over to the faulty node and the sql agent will fail. Start filemon and it logs all disk activity - start sqlagent (it will fail). Stop filemon and assess the file operations. Look for "File not found" and access denied.

You can see what files it tries to open or can't find. Try locating these files. I've come across misspelled path's in registry as well (sqlagent path) so correcting the path in a reg key has often done the trick.

If you're out of option you could uninstall the faulty node (start the installer on the active working node and uninstall the passive node). Once it's uninstalled you can either evict the node from the cluster and add it again (maybe a fresh install)? Once it's joined the cluster - you can install sql server from the active node (and service packs) on the passive node.
0
 
LVL 2

Author Comment

by:sdmarek
ID: 34209688
iv uased file mon many times (now rolled into proccess mon) so i can run that, then start messing around with the cluster if i cant find erros there.  one little thing, since the SQL DB is running (even w/o the fail over right now) and this is my production enviroment, im not going to take it off line untill next tuesday between 8pm-12am to be able to work on it (thats my weekly server maintenence window).  so, this ticket is gona sit for a little, id love any other tips you guys got, but not gona touch the running production enviroment till tuesday.  ill keep in touch when i do make changes and have the option to crash the DB for all i care... long as its running by 12.

thank you, talk to you guys by tuesday
-steven
0
 
LVL 2

Author Comment

by:sdmarek
ID: 34319927
sorry i havent been to this post, got distracted for the last couple weeks, i will not acualy be trying to get this done this weekend and will keep you posted

-steven
0
 
LVL 2

Author Closing Comment

by:sdmarek
ID: 34493328
we did have a read/write permission, the agents were not set to run as the same user on both nodes of the cluster, filemon showed that

thanks,
steven
0

Featured Post

Industry Leaders: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

This article explains how to reset the password of the sa account on a Microsoft SQL Server.  The steps in this article work in SQL 2005, 2008, 2008 R2, 2012, 2014 and 2016.
In this article we will learn how to fix  “Cannot install SQL Server 2014 Service Pack 2: Unable to install windows installer msi file” error ?
Familiarize people with the process of retrieving data from SQL Server using an Access pass-thru query. Microsoft Access is a very powerful client/server development tool. One of the ways that you can retrieve data from a SQL Server is by using a pa…
Via a live example, show how to extract insert data into a SQL Server database table using the Import/Export option and Bulk Insert.

717 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question