Solved

Interesting SQL 2012 Cluster Failover Question

Posted on 2014-03-21
5
323 Views
Last Modified: 2014-04-05
Guys I need help here and if someone has a reasonable explanation about what happened, I need a proof.
I think I know the reason, but just wanted to see out if someone else has another interesting explanation.
Here's the case:
I have an SQL 2012 Cluster configured with two nodes. Unfortunately, the connection to the shared storage (SAN) was down due to a dhcp failure; everything else was statically configured.
So node A; which was the active node, failed over to node B; which was the passive node.
Node B became the active node, and I had connectivity back to the shared storage, but SQL traffic was not flowing through node B until I had to restart node A. Does anyone knows the reason behind that and why I had to restart node A to have SQL traffic flowing through node B.
Note: Before restarting node A, I was able to connect to node B; and it became the active node and everything seemed fine, but I couldn't telnet on port 1433 to node B until I restarted node A.
0
Comment
Question by:kfdiab
  • 2
  • 2
5 Comments
 
LVL 28

Expert Comment

by:Ryan McCauley
ID: 39947891
When you say "I couldn't telnet on port 1433 to node B", do you mean to that port on Node B's primary IP address, or to the shared IP address that was owned by Node B? Also, was this the case for all clients connecting during this period, or just for the one you were testing from? It's odd behavior, though.

When you have a cluster, connectivity to the clustered resource is through the clustered IP address that's owned by the active node. When Node A gave up control of the resource and Node B became active, the IP address should have moved over to Node B - more than that, the SQL resource depends on the shared IP address, so it shouldn't have come online if there was a problem bringing up control of the IP address.

All that said, I can't see any reason why you'd be unable to connect to the SQL instance mounted on Node B until you rebooted Node A - I've never seen that occur. I suspect (though this is going out a limb a bit) that control was transferred to Node B properly and it came online as expected, but for some reason the client you were connecting from continued to route requests to the old host, which no longer owned the IP address. Maybe the switch was routing to the old server without knowing the IP had moved, and only the host physically going offline forced the switch to update its path? That's a bit outside, but it's the only thing I can think of to cause what you're describing.

If you have any more detail, I'm interested in hearing the cause you determine.
0
 

Accepted Solution

by:
kfdiab earned 0 total points
ID: 39951770
ryanmccauley,

I couldn't connect to the primary IP address on Node B.
All users were affected.
Well the problem looks in the dependencies. Currently I have the Cluster IP address configured as the only dependency. The cluster disk group needs to be added as a dependency, don't you think so?
Because the problem appeared on the storage level and not the network level.
I still need to simulate the problem and test it after adding the shared disk group as a dependency.
I will keep you posted, but let me have your opinion about that.

Thank you.
0
 
LVL 28

Assisted Solution

by:Ryan McCauley
Ryan McCauley earned 100 total points
ID: 39951891
Absolutely - the resource dependency chain should look like the following:

SQL Server Agent - depends on SQL Server
SQL Server - Depends on all storage and network name
Network name - depends on IP Address
Additional storage - depends on primary storage (if you have additional LUNS mounted to folders on the primary drive)

If you've got the SQL Service waiting for the network name/IP, but not waiting for storage, though could definitely cause a problem. I'm not sure why the storage would come online when you reboot the primary node, but what you're describing would definitely cause an issue.
0
 
LVL 51

Expert Comment

by:Mark Wills
ID: 39954522
Hmmmm... While it could be dependency issues, it sounds more like network and permissions...

You haven't mentioned the op sys - assuming it is Win Server 2008 ?

You said : Note: Before restarting node A, I was able to connect to node B; and it became the active node and everything seemed fine

Which indicates to me that Node B was not available via the network rather than having to wait for storage (or other dependants).

When Node A came back online it restored the "network" integrity.

I would be inclined to triple check your DNS resolution and if using Win Server 2008 then there has been some changes with accounts and permissions for clusters and maybe static IP addresses which have not been mapped (or not available to all). So make sure that you haven't inadvertently set up (essentially) a private network which excludes "clients".

Have a read of : http://technet.microsoft.com/en-us/library/cc771404.aspx just to make sure you have the physical network set.
0
 

Author Closing Comment

by:kfdiab
ID: 39979851
The issue was that the disk group was not a dependency.
0

Featured Post

The Eight Noble Truths of Backup and Recovery

How can IT departments tackle the challenges of a Big Data world? This white paper provides a roadmap to success and helps companies ensure that all their data is safe and secure, no matter if it resides on-premise with physical or virtual machines or in the cloud.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
MSSQL join different row from other table 14 65
SQL Add Parameter in Variable 4 21
SQL Server code help needed 14 23
Need age at date of document 3 8
Introduction In my previous article (http://www.experts-exchange.com/Microsoft/Development/MS-SQL-Server/SSIS/A_9150-Loading-XML-Using-SSIS.html) I showed you how the XML Source component can be used to load XML files into a SQL Server database, us…
Everyone has problem when going to load data into Data warehouse (EDW). They all need to confirm that data quality is good but they don't no how to proceed. Microsoft has provided new task within SSIS 2008 called "Data Profiler Task". It solve th…
Via a live example, show how to extract insert data into a SQL Server database table using the Import/Export option and Bulk Insert.
Viewers will learn how the fundamental information of how to create a table.

713 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question