Solved

Interesting SQL 2012 Cluster Failover Question

Posted on 2014-03-21
5
317 Views
Last Modified: 2014-04-05
Guys I need help here and if someone has a reasonable explanation about what happened, I need a proof.
I think I know the reason, but just wanted to see out if someone else has another interesting explanation.
Here's the case:
I have an SQL 2012 Cluster configured with two nodes. Unfortunately, the connection to the shared storage (SAN) was down due to a dhcp failure; everything else was statically configured.
So node A; which was the active node, failed over to node B; which was the passive node.
Node B became the active node, and I had connectivity back to the shared storage, but SQL traffic was not flowing through node B until I had to restart node A. Does anyone knows the reason behind that and why I had to restart node A to have SQL traffic flowing through node B.
Note: Before restarting node A, I was able to connect to node B; and it became the active node and everything seemed fine, but I couldn't telnet on port 1433 to node B until I restarted node A.
0
Comment
Question by:kfdiab
  • 2
  • 2
5 Comments
 
LVL 28

Expert Comment

by:Ryan McCauley
ID: 39947891
When you say "I couldn't telnet on port 1433 to node B", do you mean to that port on Node B's primary IP address, or to the shared IP address that was owned by Node B? Also, was this the case for all clients connecting during this period, or just for the one you were testing from? It's odd behavior, though.

When you have a cluster, connectivity to the clustered resource is through the clustered IP address that's owned by the active node. When Node A gave up control of the resource and Node B became active, the IP address should have moved over to Node B - more than that, the SQL resource depends on the shared IP address, so it shouldn't have come online if there was a problem bringing up control of the IP address.

All that said, I can't see any reason why you'd be unable to connect to the SQL instance mounted on Node B until you rebooted Node A - I've never seen that occur. I suspect (though this is going out a limb a bit) that control was transferred to Node B properly and it came online as expected, but for some reason the client you were connecting from continued to route requests to the old host, which no longer owned the IP address. Maybe the switch was routing to the old server without knowing the IP had moved, and only the host physically going offline forced the switch to update its path? That's a bit outside, but it's the only thing I can think of to cause what you're describing.

If you have any more detail, I'm interested in hearing the cause you determine.
0
 

Accepted Solution

by:
kfdiab earned 0 total points
ID: 39951770
ryanmccauley,

I couldn't connect to the primary IP address on Node B.
All users were affected.
Well the problem looks in the dependencies. Currently I have the Cluster IP address configured as the only dependency. The cluster disk group needs to be added as a dependency, don't you think so?
Because the problem appeared on the storage level and not the network level.
I still need to simulate the problem and test it after adding the shared disk group as a dependency.
I will keep you posted, but let me have your opinion about that.

Thank you.
0
 
LVL 28

Assisted Solution

by:Ryan McCauley
Ryan McCauley earned 100 total points
ID: 39951891
Absolutely - the resource dependency chain should look like the following:

SQL Server Agent - depends on SQL Server
SQL Server - Depends on all storage and network name
Network name - depends on IP Address
Additional storage - depends on primary storage (if you have additional LUNS mounted to folders on the primary drive)

If you've got the SQL Service waiting for the network name/IP, but not waiting for storage, though could definitely cause a problem. I'm not sure why the storage would come online when you reboot the primary node, but what you're describing would definitely cause an issue.
0
 
LVL 51

Expert Comment

by:Mark Wills
ID: 39954522
Hmmmm... While it could be dependency issues, it sounds more like network and permissions...

You haven't mentioned the op sys - assuming it is Win Server 2008 ?

You said : Note: Before restarting node A, I was able to connect to node B; and it became the active node and everything seemed fine

Which indicates to me that Node B was not available via the network rather than having to wait for storage (or other dependants).

When Node A came back online it restored the "network" integrity.

I would be inclined to triple check your DNS resolution and if using Win Server 2008 then there has been some changes with accounts and permissions for clusters and maybe static IP addresses which have not been mapped (or not available to all). So make sure that you haven't inadvertently set up (essentially) a private network which excludes "clients".

Have a read of : http://technet.microsoft.com/en-us/library/cc771404.aspx just to make sure you have the physical network set.
0
 

Author Closing Comment

by:kfdiab
ID: 39979851
The issue was that the disk group was not a dependency.
0

Featured Post

Best Practices: Disaster Recovery Testing

Besides backup, any IT division should have a disaster recovery plan. You will find a few tips below relating to the development of such a plan and to what issues one should pay special attention in the course of backup planning.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
SQL Select - column value lookup. 3 40
SQL Query 17 30
How to use odbc in vb to connect to ms sql 14 31
Permissions on Database 11 36
Everyone has problem when going to load data into Data warehouse (EDW). They all need to confirm that data quality is good but they don't no how to proceed. Microsoft has provided new task within SSIS 2008 called "Data Profiler Task". It solve th…
The Delta outage: 650 cancelled flights, more than 1200 delayed flights, thousands of frustrated customers, tens of millions of dollars in damages – plus untold reputational damage to one of the world’s most trusted airlines. All due to a catastroph…
Via a live example, show how to extract information from SQL Server on Database, Connection and Server properties
Via a live example, show how to extract insert data into a SQL Server database table using the Import/Export option and Bulk Insert.

932 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

9 Experts available now in Live!

Get 1:1 Help Now