Solved

SQL wont fail over with MS 2003 Cluster

Posted on 2010-11-23
7
896 Views
Last Modified: 2012-05-10
i have a active/passive cluster that runs a SQL data base. when the SQL group is owned by one server(WDB06) it runs fine, when i try and fail other to the second server in the group(WDB07) everything in the cluser's MSCS group comes online(DTC, Cluster name, cluster ip, and quorum"DTC log is also in quoruom drive). then when i try and fail over the SQL Server group, few things happen: at first the SQL server would not start, and was throwing error:
----------------------------------------------------------------------------
Event Type:      Error
Event Source:      MSSQLSERVER
Event Category:      (3)
Event ID:      19019
Date:            11/23/2010
Time:            9:02:02 PM
User:            N/A
Computer:      DA-GHN-WDB07
Description:
[sqsrvres] OnlineThread: service stopped while waiting for QP.
----------------------------------------------------------------------------
what i did for this problem was on the server that had control of the Quorum drive, ran command: MSDTC -resetlog
and i did not get this error again for SQL server.  but i am having a problem with a SQL Server Agent coming online, and again it is only when i try and move the SQL group onto the server: WDB07.  the odd thing here is, when i move the group this this physical server, everything comes online, then with in 3 minuets this server agent turn from online to fail. event view has the following log for this:
------------------------------------------------------------------
Event Type:      Error
Event Source:      SQLSERVERAGENT
Event Category:      Service Control
Event ID:      103
Date:            11/23/2010
Time:            9:20:36 PM
User:            N/A
Computer:      DA-GHN-WDB08
Description:
SQLServerAgent could not be started (reason: Unable to connect to server '(local)'; SQLServerAgent cannot start).
-----------------------------------------------------------------------------
Event Type:      Information
Event Source:      SQLSERVERAGENT
Event Category:      Service Control
Event ID:      102
Date:            11/23/2010
Time:            9:20:37 PM
User:            N/A
Computer:      DA-GHN-WDB08
Description:
SQLServerAgent service successfully stopped.
------------------------------------------------------------------------------------------
Event Type:      Error
Event Source:      SQLSERVERAGENT
Event Category:      Failover
Event ID:      53
Date:            11/23/2010
Time:            9:20:38 PM
User:            N/A
Computer:      DA-GHN-WDB07
Description:
[sqagtres] CheckServiceAlive: Service is dead
----------------------------------------------------------------------------------
I am not too sure as to why this agent will not stqay online when the SQL group is ran on server WDB07, but runs fine when server WDB06 is handling it.  im not a expert with MS server 2003 clustering, and i only know a few basic things with SQL, so any detailed help would be great. let me know if you need anymore information

thank you
Steven
0
Comment
Question by:sdmarek
  • 5
  • 2
7 Comments
 
LVL 4

Expert Comment

by:rlog
ID: 34205924
Seems like the sql agent can't connect to the sql server service. Do you use separate accounts for for server and agent?
0
 
LVL 2

Author Comment

by:sdmarek
ID: 34206428
under system services, both SQL server and SQL Agent (on both servers) run loged in as:  production\prdsqlsrvc
on the servers when i go to computer managment, this domain account is added to the administrator group on both servers.

-steven
0
 
LVL 2

Author Comment

by:sdmarek
ID: 34206512
another note:  im at my new job, this is a network i have inherited so i did not set any of this up, but i would assume this has been working before.   when i got to this job, the systems admin before me had started to move this cluster's physical disk resources onto a new SAN, he did the hard part, he attached the LUNs and moved the SQL database and log drive over to the new SAN, all i did was:
http://support.microsoft.com/default.aspx?scid=kb;en-us;280353
to move the quorum drive, and then reset the DTC log to the new quorum drive.  i dont hink that would of broken any permisions or anything, because the domain user that runs SQL Server and Agent are in the admin group on both servers, but thats a little more history info if it helps.

-steven
0
The curse of the end user strikes again      

You’ve updated all your end user’s email signatures. Hooray! But guess what? They’re playing around with the HTML, adding stupid taglines and ruining the imagery. Find out how you can save your signatures from end users today.

 
LVL 4

Accepted Solution

by:
rlog earned 500 total points
ID: 34209134
Have you ever tried filemon.exe utility from MS (Sysinternals.com). In Cluster admin - set do not affect the group if SQL Agent fails. Move the cluster group over to the faulty node and the sql agent will fail. Start filemon and it logs all disk activity - start sqlagent (it will fail). Stop filemon and assess the file operations. Look for "File not found" and access denied.

You can see what files it tries to open or can't find. Try locating these files. I've come across misspelled path's in registry as well (sqlagent path) so correcting the path in a reg key has often done the trick.

If you're out of option you could uninstall the faulty node (start the installer on the active working node and uninstall the passive node). Once it's uninstalled you can either evict the node from the cluster and add it again (maybe a fresh install)? Once it's joined the cluster - you can install sql server from the active node (and service packs) on the passive node.
0
 
LVL 2

Author Comment

by:sdmarek
ID: 34209688
iv uased file mon many times (now rolled into proccess mon) so i can run that, then start messing around with the cluster if i cant find erros there.  one little thing, since the SQL DB is running (even w/o the fail over right now) and this is my production enviroment, im not going to take it off line untill next tuesday between 8pm-12am to be able to work on it (thats my weekly server maintenence window).  so, this ticket is gona sit for a little, id love any other tips you guys got, but not gona touch the running production enviroment till tuesday.  ill keep in touch when i do make changes and have the option to crash the DB for all i care... long as its running by 12.

thank you, talk to you guys by tuesday
-steven
0
 
LVL 2

Author Comment

by:sdmarek
ID: 34319927
sorry i havent been to this post, got distracted for the last couple weeks, i will not acualy be trying to get this done this weekend and will keep you posted

-steven
0
 
LVL 2

Author Closing Comment

by:sdmarek
ID: 34493328
we did have a read/write permission, the agents were not set to run as the same user on both nodes of the cluster, filemon showed that

thanks,
steven
0

Featured Post

What Should I Do With This Threat Intelligence?

Are you wondering if you actually need threat intelligence? The answer is yes. We explain the basics for creating useful threat intelligence.

Join & Write a Comment

Suggested Solutions

Title # Comments Views Activity
MS SQL Bulk load data error 5 33
INSERT INTO SELECT JOIN THING 2 25
SQL server 2008 SP4 29 33
AD Replications issues 12 44
Everyone has problem when going to load data into Data warehouse (EDW). They all need to confirm that data quality is good but they don't no how to proceed. Microsoft has provided new task within SSIS 2008 called "Data Profiler Task". It solve th…
Load balancing is the method of dividing the total amount of work performed by one computer between two or more computers. Its aim is to get more work done in the same amount of time, ensuring that all the users get served faster.
Familiarize people with the process of utilizing SQL Server functions from within Microsoft Access. Microsoft Access is a very powerful client/server development tool. One of the SQL Server objects that you can interact with from within Microsoft Ac…
Viewers will learn how the fundamental information of how to create a table.

705 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

17 Experts available now in Live!

Get 1:1 Help Now