Solved

Interesting SQL 2012 Cluster Failover Question

Posted on 2014-03-21
5
327 Views
Last Modified: 2014-04-05
Guys I need help here and if someone has a reasonable explanation about what happened, I need a proof.
I think I know the reason, but just wanted to see out if someone else has another interesting explanation.
Here's the case:
I have an SQL 2012 Cluster configured with two nodes. Unfortunately, the connection to the shared storage (SAN) was down due to a dhcp failure; everything else was statically configured.
So node A; which was the active node, failed over to node B; which was the passive node.
Node B became the active node, and I had connectivity back to the shared storage, but SQL traffic was not flowing through node B until I had to restart node A. Does anyone knows the reason behind that and why I had to restart node A to have SQL traffic flowing through node B.
Note: Before restarting node A, I was able to connect to node B; and it became the active node and everything seemed fine, but I couldn't telnet on port 1433 to node B until I restarted node A.
0
Comment
Question by:kfdiab
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 2
  • 2
5 Comments
 
LVL 28

Expert Comment

by:Ryan McCauley
ID: 39947891
When you say "I couldn't telnet on port 1433 to node B", do you mean to that port on Node B's primary IP address, or to the shared IP address that was owned by Node B? Also, was this the case for all clients connecting during this period, or just for the one you were testing from? It's odd behavior, though.

When you have a cluster, connectivity to the clustered resource is through the clustered IP address that's owned by the active node. When Node A gave up control of the resource and Node B became active, the IP address should have moved over to Node B - more than that, the SQL resource depends on the shared IP address, so it shouldn't have come online if there was a problem bringing up control of the IP address.

All that said, I can't see any reason why you'd be unable to connect to the SQL instance mounted on Node B until you rebooted Node A - I've never seen that occur. I suspect (though this is going out a limb a bit) that control was transferred to Node B properly and it came online as expected, but for some reason the client you were connecting from continued to route requests to the old host, which no longer owned the IP address. Maybe the switch was routing to the old server without knowing the IP had moved, and only the host physically going offline forced the switch to update its path? That's a bit outside, but it's the only thing I can think of to cause what you're describing.

If you have any more detail, I'm interested in hearing the cause you determine.
0
 

Accepted Solution

by:
kfdiab earned 0 total points
ID: 39951770
ryanmccauley,

I couldn't connect to the primary IP address on Node B.
All users were affected.
Well the problem looks in the dependencies. Currently I have the Cluster IP address configured as the only dependency. The cluster disk group needs to be added as a dependency, don't you think so?
Because the problem appeared on the storage level and not the network level.
I still need to simulate the problem and test it after adding the shared disk group as a dependency.
I will keep you posted, but let me have your opinion about that.

Thank you.
0
 
LVL 28

Assisted Solution

by:Ryan McCauley
Ryan McCauley earned 100 total points
ID: 39951891
Absolutely - the resource dependency chain should look like the following:

SQL Server Agent - depends on SQL Server
SQL Server - Depends on all storage and network name
Network name - depends on IP Address
Additional storage - depends on primary storage (if you have additional LUNS mounted to folders on the primary drive)

If you've got the SQL Service waiting for the network name/IP, but not waiting for storage, though could definitely cause a problem. I'm not sure why the storage would come online when you reboot the primary node, but what you're describing would definitely cause an issue.
0
 
LVL 51

Expert Comment

by:Mark Wills
ID: 39954522
Hmmmm... While it could be dependency issues, it sounds more like network and permissions...

You haven't mentioned the op sys - assuming it is Win Server 2008 ?

You said : Note: Before restarting node A, I was able to connect to node B; and it became the active node and everything seemed fine

Which indicates to me that Node B was not available via the network rather than having to wait for storage (or other dependants).

When Node A came back online it restored the "network" integrity.

I would be inclined to triple check your DNS resolution and if using Win Server 2008 then there has been some changes with accounts and permissions for clusters and maybe static IP addresses which have not been mapped (or not available to all). So make sure that you haven't inadvertently set up (essentially) a private network which excludes "clients".

Have a read of : http://technet.microsoft.com/en-us/library/cc771404.aspx just to make sure you have the physical network set.
0
 

Author Closing Comment

by:kfdiab
ID: 39979851
The issue was that the disk group was not a dependency.
0

Featured Post

MS Dynamics Made Instantly Simpler

Make Your Microsoft Dynamics Investment Count  & Drastically Decrease Training Time by Providing Intuitive Step-By-Step WalkThru Tutorials.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
Does INTERSECT return opposite from FULL OUTER JOIN? 4 40
SQL State HYT00. Timeout expired proplem 8 43
T-SQL Query 9 35
Need split for SQL data 7 44
Why is this different from all of the other step by step guides?  Because I make a living as a DBA and not as a writer and I lived through this experience. Defining the name: When I talk to people they say different names on this subject stuff l…
In the first part of this tutorial we will cover the prerequisites for installing SQL Server vNext on Linux.
Familiarize people with the process of utilizing SQL Server functions from within Microsoft Access. Microsoft Access is a very powerful client/server development tool. One of the SQL Server objects that you can interact with from within Microsoft Ac…
Via a live example, show how to shrink a transaction log file down to a reasonable size.

732 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question