Solved

Interesting SQL 2012 Cluster Failover Question

Posted on 2014-03-21
5
315 Views
Last Modified: 2014-04-05
Guys I need help here and if someone has a reasonable explanation about what happened, I need a proof.
I think I know the reason, but just wanted to see out if someone else has another interesting explanation.
Here's the case:
I have an SQL 2012 Cluster configured with two nodes. Unfortunately, the connection to the shared storage (SAN) was down due to a dhcp failure; everything else was statically configured.
So node A; which was the active node, failed over to node B; which was the passive node.
Node B became the active node, and I had connectivity back to the shared storage, but SQL traffic was not flowing through node B until I had to restart node A. Does anyone knows the reason behind that and why I had to restart node A to have SQL traffic flowing through node B.
Note: Before restarting node A, I was able to connect to node B; and it became the active node and everything seemed fine, but I couldn't telnet on port 1433 to node B until I restarted node A.
0
Comment
Question by:kfdiab
  • 2
  • 2
5 Comments
 
LVL 28

Expert Comment

by:Ryan McCauley
ID: 39947891
When you say "I couldn't telnet on port 1433 to node B", do you mean to that port on Node B's primary IP address, or to the shared IP address that was owned by Node B? Also, was this the case for all clients connecting during this period, or just for the one you were testing from? It's odd behavior, though.

When you have a cluster, connectivity to the clustered resource is through the clustered IP address that's owned by the active node. When Node A gave up control of the resource and Node B became active, the IP address should have moved over to Node B - more than that, the SQL resource depends on the shared IP address, so it shouldn't have come online if there was a problem bringing up control of the IP address.

All that said, I can't see any reason why you'd be unable to connect to the SQL instance mounted on Node B until you rebooted Node A - I've never seen that occur. I suspect (though this is going out a limb a bit) that control was transferred to Node B properly and it came online as expected, but for some reason the client you were connecting from continued to route requests to the old host, which no longer owned the IP address. Maybe the switch was routing to the old server without knowing the IP had moved, and only the host physically going offline forced the switch to update its path? That's a bit outside, but it's the only thing I can think of to cause what you're describing.

If you have any more detail, I'm interested in hearing the cause you determine.
0
 

Accepted Solution

by:
kfdiab earned 0 total points
ID: 39951770
ryanmccauley,

I couldn't connect to the primary IP address on Node B.
All users were affected.
Well the problem looks in the dependencies. Currently I have the Cluster IP address configured as the only dependency. The cluster disk group needs to be added as a dependency, don't you think so?
Because the problem appeared on the storage level and not the network level.
I still need to simulate the problem and test it after adding the shared disk group as a dependency.
I will keep you posted, but let me have your opinion about that.

Thank you.
0
 
LVL 28

Assisted Solution

by:Ryan McCauley
Ryan McCauley earned 100 total points
ID: 39951891
Absolutely - the resource dependency chain should look like the following:

SQL Server Agent - depends on SQL Server
SQL Server - Depends on all storage and network name
Network name - depends on IP Address
Additional storage - depends on primary storage (if you have additional LUNS mounted to folders on the primary drive)

If you've got the SQL Service waiting for the network name/IP, but not waiting for storage, though could definitely cause a problem. I'm not sure why the storage would come online when you reboot the primary node, but what you're describing would definitely cause an issue.
0
 
LVL 51

Expert Comment

by:Mark Wills
ID: 39954522
Hmmmm... While it could be dependency issues, it sounds more like network and permissions...

You haven't mentioned the op sys - assuming it is Win Server 2008 ?

You said : Note: Before restarting node A, I was able to connect to node B; and it became the active node and everything seemed fine

Which indicates to me that Node B was not available via the network rather than having to wait for storage (or other dependants).

When Node A came back online it restored the "network" integrity.

I would be inclined to triple check your DNS resolution and if using Win Server 2008 then there has been some changes with accounts and permissions for clusters and maybe static IP addresses which have not been mapped (or not available to all). So make sure that you haven't inadvertently set up (essentially) a private network which excludes "clients".

Have a read of : http://technet.microsoft.com/en-us/library/cc771404.aspx just to make sure you have the physical network set.
0
 

Author Closing Comment

by:kfdiab
ID: 39979851
The issue was that the disk group was not a dependency.
0

Featured Post

How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

Join & Write a Comment

Suggested Solutions

Title # Comments Views Activity
Help with SQL Query 23 39
Group by and order by clause 28 36
Sql query for filter 12 23
login and database user 3 21
Performance is the key factor for any successful data integration project, knowing the type of transformation that you’re using is the first step on optimizing the SSIS flow performance, by utilizing the correct transformation or the design alternat…
Ever needed a SQL 2008 Database replicated/mirrored/log shipped on another server but you can't take the downtime inflicted by initial snapshot or disconnect while T-logs are restored or mirror applied? You can use SQL Server Initialize from Backup…
Using examples as well as descriptions, and references to Books Online, show the documentation available for datatypes, explain the available data types and show how data can be passed into and out of variables.
Viewers will learn how the fundamental information of how to create a table.

746 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

10 Experts available now in Live!

Get 1:1 Help Now