Link to home
Start Free TrialLog in
Avatar of anushahanna
anushahannaFlag for United States of America

asked on

Best Practices for testing a 2-node Cluster, by bringing down a node.

We would like to test the performance of our 2 node cluster for SQL Server 2005 on Win Server 2003.

Any recommendations/suggestions/best practices before we just bring down one node, and see how the other one is performing on load for daily operations, and if there are any issues that could not be isolated in the cluster environment?

 We are thinking of running PerfMon to do the performance monitoring. We can try SQL Server 2005 Performance Dashboard & RML Utilities, if recommended to test a cluster, too.

Thanks
Avatar of Jim P.
Jim P.
Flag of United States of America image

I take it you are running an Active/Active setup?

I would also fire up the profiler from a client machine and use the Lightweight Lock Tracing template I've attached. (Rename to .tdf and import to your profiler.) Use it beforehand to see lock timeouts and such. That will be your base line.

Then run it afterward to what kind of stress you are under. I suggest dumping the data to a table on a separate instance -- even an instance of SQL installed on a client. That way it is easy to slice and dice and do grouping and summary queries.

The other question is how are you going to do the failover -- graceful shutdown, or yank the plug from the wall? The best method is to yank the plug. The downside is you could lose data and/or disrupt production ops.
Lightweight-Lock-Tracing.txt
Avatar of anushahanna

ASKER

jimpen
right now it is active/passive.

so we want to 'shut down' the active one and see how the passive one becomes active and responds.

Thanks for the template; so, you recommend running the profiler, just before 'plug from the wall'?, and on the passive node?

thanks much

ASKER CERTIFIED SOLUTION
Avatar of Jim P.
Jim P.
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
jimpen
let me make sure I understand you:(kindly correct me if i am off)

 I will run the profiler on the active node now (production setting). Then do the plug from the wall - induced failure on the active node. Then run the profiler again on the node that took over (failover situation), and record it to another local table.

The point is to monitor the server transactions now (when one node is passive -normal production state) and when the system 'fails', then again, see the transactions of the new active node (failover state) and see how the performance of the transactions are - like locks, duration, cpu, cachehits etc.

Apart from this, monitor the PerfMon on the recommended 11 tests before and after the 'plug from the wall' testing.

Thanks
>> The point is to monitor the server transactions now....

Exactly.

You'll have to do the perfmon on each individual node as well. You can fire it up on the inactive node with the exception of the PhysicalDisk(0 x:) tracking. Those are the SQL Server drives assigned to the cluster.

The MSSQL$SYSTEMSDATA entries will exist on both nodes and available to be selected in perfmon, but will be a flat line on the inactive node.

As a side note: you can save the settings from the perfmon to a separate .msc file. That way if you are looking at tuning issues and/or slowness complaints then you can fire it up quickly and easily to see if the bottleneck is the system.

jimpen
would you recommend me to check anything in Cluster Administration before doing the test, just to make any baseline check-ups.

thanks
Make sure the SQL Server Service is dependent on the data Drive(s) -- not the quorum drives; the IP address, and the SQL Network Name.

Then the SQL Server Agent is dependent only on the SQL Service.

I dropped the Full Text service and set them to manual on both nodes -- I didn't see any great purpose to it. But if you have it, then again a dependency only on the SQL Service.

On the individual nodes, make sure the SQL Server Browser Service is up and running in Automatic mode in the services.

Also by default, I leave all other SQL Services in manual or automatic start-up mode. None are disabled.

That's about all I can think of off-hand.
jimpen,
I would like to test first usin gthe gracious method- no pulling the wire from the wall. (That will come next). Could I just restart the server or do I need to do anything else before doing the failover(in the gracious way). Would you have me go to the services before the reboot to check anything (other than to make sure the SQL Server Browser Service is running.)

Thanks
You don't even need to reboot. The very gracious way is to open the Cluster Administration -- Right-click the SQL Cluster and do Move Group. If it doesn't restart in a few moments, then just force it back by reversing the move.
jimpen,
Where would the default cluster logs be for me to check what went on during the transition?

How much latency to expect between the failover and failback.?

Would you recommend the same method (Move Group) to bring the Cluster from Failover to Failback  (after the testing is done)?

on a personal note, if you can write up about the whole process, it may be a great resource in the articles section. I went and looked for any written articles on Cluster Failover and found none. Your knowledge may benefit much for those who are doin gthis for the first time. Just a thought.

Thanks again.
I hate to say this, but getting to this level would take a fortune. ;-) Actually I have several articles to write.

As far as failback, its the same method.

The events register in the normal event logs in
compmgmt.msc.

Failover time is dependent on hardware, sw, apps, and other factors. Over a minute is too long, in general.
jimpen
I see you have 6 events in the template. Do you have a query, SP that you use to analyze the results, for it?

You stressed the need for SQL Server Browser Service to be running in automatic mode before the failover. Otherwise, the passive node will not be able to get active, is that right?

Where do I see which data drives are the SQL Server Service dependent on? and that the SQL Server Agent is dependent on the SQL Service?

can you kindly explain the following:
I do not see Buffer Manager, Locks, and Memory Manager as Performance Objects.

\\NODESERVERNAME\MSSQL$SYSTEMSDATA:Buffer Manager\Buffer cache hit ratio
\\NODESERVERNAME\MSSQL$SYSTEMSDATA:Buffer Node (000)\Page life expectancy
\\NODESERVERNAME\MSSQL$SYSTEMSDATA:Locks (_Total)\Lock Requests/sec
\\NODESERVERNAME\MSSQL$SYSTEMSDATA:Locks (_Total)\Lock Timeouts/sec
\\NODESERVERNAME\MSSQL$SYSTEMSDATA:Locks (_Total)\Number of Deadlocks/sec
\\NODESERVERNAME\MSSQL$SYSTEMSDATA:Memory Manager\Lock Memory (KB)
Thanks

>> Do you have a query, SP that you use to analyze the

I would do basic group by queries on database name, sum of duration, count of records. I would be looking for the highs and then from that looking at the TextData to see if you can do any tuning, indexing, or tightening up on the queries/SPs.

>> You stressed the need for SQL Server Browser Service to be running in......

The browser service is not really needed if you are only one instance. But it basically acts like the buddy standing in line. Instead of starting from the back of the line, the browser has a slot already set up for the SQL Server to slide into.

>> Where do I see which data drives are the SQL Server Service dependent on?

Both are in the Cluster Administrator. If the cluster was set up right -- you should have a Cluster Group and then a separate group that is the SQL Group. (These names are variables but it should be like it.)

In the SQL Group there should be a network name, SQL IP address, and at least one drive. You will also have the SQL Server and other services. If you right click the properties on the SQL Server the second tab should be Dependencies. That is saying these items need to be up for me to fire.

>> I do not see Buffer Manager, Locks, and Memory Manager

This is on the server? The prefix may be different. Maybe just SQLServer:Buffer Manager.  Then the rest are subcategories of the performance object.
Perfmon-SQL-Counters.jpg
I did a small pre-test experiment.

I only stopped the SQL Server Service on the Active node. But the passive node did not get turned on, rather the active node automatically turned the service on again, and in 55 seconds, the active node was up and running again.

I was running the profiler before the 'switch' for a little less than an hour, and I got 60 rows in the profiler, with the event classes of 12(SQL:BatchCompleted),45(SP:StmtCompleted),65528,65532,65534.-Not sure about the 655xx Event code, what they are:

After the Active node came back again, I ran the profiler again for 5 minutes. Got 17 records, and all the events as before but also 3 'Lock:Escalation' (Event 60). Does that tell anything?

So I guess the Passive node failed for the 'Failover', and hence the active node took over?

I did not do a restart, because there is a critical service running in the Active node that is using a Virtual IP address, and depends on the Active node being alive, and is not dependent on the SQL Server service. That is why I limited my test to stopping the SQL Service.

There are 2 groups: Cluster and SQL; and the SQL Group has a drive,one IP Address, One network name, the SS and the SSAgent, and the SS is dependent on the drive and network name. I hope this sounds right.

I guess you can test for several things separately: Disk - pull the hard drive out;
Power - pull the plugs ;Network - pull out a network adapter ;Fiber Channel - disconnect fiber connections; Windows 2003 - shut down ;SQL Server - Fail the SQL Server and/or Agent; Cluster Service - disable Cluster Service ;Quorum  un-present the drive. Would unplugging the plug cover all these?

I do not have physical access to the nodes, only Cluster Admin. So for the next real round of test, how many of the above, do you think I can do just from Cluster Admin tool?
Thanks

jimpen
I should mention this also:

On the active node, under resources, I can see the drive, IP Address, Network Name, SS, SSAgent, all falling under SQL Group,which is it's active and only group.

On the passive node, under resources, I see Cluster IP address, Cluster Name, a different disk, and MSDTC, all belonging to the Cluster Group, which is it's active and only group.

Is this the reason why the SQL failover did not go to the passive node, and the active took control again?
thanks
>> I only stopped the SQL Server Service on the Active node. ....

That is a pretty good restart time -- depending on hardware. We're running Dell X950's with quad cores and 12+GB of RAM with and multiple instances running active/active. Our failover time is lower -- but we paid for the performance.

Remember you always have a choice: Fast, Cheap, Good. Pick any two. ;-)

>> So I guess the Passive node failed for the 'Failover', and hence
>> the active node took over?

You can't just fail the SQL Server service. It is dependent on the all the rest moving to the other node as well.

>> I did not do a restart, because there is a critical service .....

What I would suggest for that is to add a separate group for that IP address that is set not to failover. If the A Node goes down, then it is just down.


>> I guess you can test for several things separately:.....

The simple way to see if it will work with the least chance of damage is to Right-Click the group and do the Move Group. It will then turn it off on the Active node and try to move it to other node. If it works then you are good.

Server Cluster How To...
http://technet.microsoft.com/en-us/library/cc786260%28WS.10%29.aspx
Cluster-Move.jpg
Thanks for the detailed helpful information.
So did it work out?

Glad to be of assistance. May all your days get brighter and brighter.
jimpen
thanks for your patient help. I will be implementing some of your ideas after some ground work , for some real testing.. perhaps i will bring that up in another post.

thanks much. brighter and brighter.. sounds good...
jimpen

Can you let me know your thoughts abt the 3 lock:escalations when the active node started again.

also the passive node did not have any of the meaningful resources of the active node. Is this right?
thanks
The lock:escalations don't really concern me with a newly restarted SQL Server. Even if the server is accepting remote connections and servicing queries, it takes a few minutes for SQL to "settle back in."

If you see them constantly after the server has been up, you can be starved for memory, you have large or poorly written queries running, or there may be contention for resources from various maintenance plans and/or applications.

>> also the passive node did not have any ...

Yes. The SQL Service is dependent on the IP, Network Name and the data drive being on the other node all at the same time before it starts. Since it wasn't, it couldn't start.