Link to home
Start Free TrialLog in
Avatar of marrowyung
marrowyung

asked on

SQL server 2017 AG failover with potential data lost

hi,

we are going to test the manual AG failover and  I right click on AG console and select failover I saw this:




User generated image
1)  is the warning matter? it will stop me from doing manual failover?
2) what should I do in order to fix this? we do this as we want to do SQL server 2017 CU22/23 update and the first node to install will be the secondary replica, then primary one.

  but I need to make sure that the failover working well before the patch happens.
3) as from primary and secondary the replication is in Async mode, what should I do before and after the failover and fallback?

the AG daskboard is showing everything is ok .

User generated image
and from the photo I am not sure why the primary instance is on 02 but 01 is hosted by 03, 03 is the secondary replica! 02 is primary replica.

any thing to check which one is hosting the primary now?

any bugs in the AG dashbard ?

any way to check failover log on when the primary instance is failovered to 03 instead of staying in 02?
Avatar of arnold
arnold
Flag of United States of America image

It is a question of probability.
When a resource fails over after a failure, the resource that is being used failed; meaning wherever it might have processed made it through, replicated. When the other group is brought up through the recovery, automated process has the same risk, but it does not have an interactive user to present the information to. Or the possible data loss did not conclude prior to failure.


What it warns deals with data being handled right now by the active node might not have been accepted and replicated through the failover.
I.e. When you start the failover process, the transition Windows .

The scope and scale of possible data loss is limited to current sessions
Avatar of marrowyung
marrowyung

ASKER

hi,

tks.

What it warns deals with data being handled right now by the active node might not have been accepted and replicated through the failover.

so it just warn us by a "it might be" only ? and also because it is in async mode so data might be lost /not replicate to secondary replica BEFORE it become the primary?

I.e. When you start the failover process, the transition Windows .

you mean during the time the primary failover to secondary any data committed in primary might not be in secondary at all AFTER failover?


how about what the dashboard is saying ? 02 is primary or 03 is primary ?  I cant' see what "01 is hosted by 02 "means?

I understand you are obscuring your server/host/availability group's name.
I can not tell whether the 01,02,03 are names you choose, or ..
you have a system/node/FCI as an availability group host.


https://codingsight.com/sql-always-on-availability-groups-computer-objects/

The item identifies the availability groups you have defined and which instance holds each availability group

Any action you perform manually, log shipping, db mirror, etc. they always indicate that a data loss is possible.

Whether the data is committed while the transition is taking place, or data is in the process of being submitted,
I think if the trigger to failover occurs after the data is committed by the primary, the data will be on the other side.
The issue is likely you triggering the failover in the middle of a transaction being handled. as the transaction will not complete, there is nothing to replicate and this last transaction will be lost/generate an error on the application or if it has error detection/handling it might recover by submitting the same transaction to the new active availability group host.
https://codingsight.com/sql-always-on-availability-groups-computer-objects/

The item identifies the availability groups you have defined and which instance holds each availability group

I toke a look but it do not teach me how to look from the AG dashboard.


Any action you perform manually, log shipping, db mirror, etc. they always indicate that a data loss is possible.]
so for my screen, just ignore it as it is not alert and critical error ?

Whether the data is committed while the transition is taking place, or data is in the process of being submitted,
I think if the trigger to failover occurs after the data is committed by the primary, the data will be on the other side
anyway to force sync of this async AG before I failover ?
one thing I feel interesting is the AG dashboard of primary and secondary replica is NOT the SAME!

primary :


User generated image
Secondary:
User generated image
why is it ?

I read this:
https://docs.microsoft.com/en-us/sql/database-engine/availability-groups/windows/perform-a-planned-manual-failover-of-an-availability-group-sql-server?view=sql-server-ver15

and it say:
A planned manual failover is supported only when the primary replica and the target secondary replica are running in synchronous-commit mode and are currently synchronized.

so it is not applicable to me ..  ! mind is async mode.


and I tried to run the query in it on my async AG:

SELECT ag.name,    drs.database_id,    drs.group_id,    drs.replica_id,    drs.synchronization_state_desc,    ag.sequence_number
FROM sys.dm_hadr_database_replica_states drs, sys.availability_groups ag
WHERE drs.group_id = ag.group_id; 

Open in new window


and my result is :






User generated imagewhy it is Synchronizing and synchronized together on the SAME database_id
?

and so should I force them back to sync mode first:
1)ALTER AVAILABILITY GROUP [AGRScale]      MODIFY REPLICA ON N'<node2>'      WITH (AVAILABILITY_MODE = SYNCHRONOUS_COMMIT); ALTER AVAILABILITY GROUP [AGRScale] MODIFY REPLICA ON N'<node2>' WITH (AVAILABILITY_MODE = SYNCHRONOUS_COMMIT);

2) run this one:

SELECT ag.name,    drs.database_id,    drs.group_id,    drs.replica_id,    drs.synchronization_state_desc,    ag.sequence_number FROM sys.dm_hadr_database_replica_states drs, sys.availability_groups ag WHERE drs.group_id = ag.group_id;  SELECT ag.name, drs.database_id, drs.group_id, drs.replica_id, drs.synchronization_state_desc, ag.sequence_number FROM sys.dm_hadr_database_replica_states drs, sys.availability_groups ag WHERE drs.group_id = ag.group_id;

to make sure all synced !

3) failover form AG dashboard? 


Not certain I understand. Your preference is asynch.
Prior to manual failover, you could use the step to synchronize_commit, though it will take as long as it needs to get the replica in sync.
not sure whether it is information from a different topic, but your AG updates are pushed every 15 minutes. which would represent the max 15 minutes of data loss that might

sync, commit will reflect the DB going through a sync, and confirm that it completed/synchronized
, but your AG updates are pushed every 15 minutes. which would represent the max 15 minutes of data loss that might

async!

so I should
1)convert to to sync mode first
2) make sure that data all sync by the query from the URL I post
3) failover using dashboard failover.
4) make sure again in dashboard all are green to make sure data all sync again.
5) fallback to original primary and do 2) and 4) again.
6) convert the AG back to Async until we upgrade the AG node to CU22/23

is that right above for a AG failover/DB DR drill ?

Consider the test where you have the time to wait for the sync, before failover versus a real event, when the current primary fails.
The manual process to bring up the primary AG replica, will be more complex and likely will have an extended down time

why do you prefer async versus sync? Do you have a significant latency between the locations?
why do you prefer async versus sync? Do you have a significant latency between the locations?
 I think there is a bug on SQL server 2017 RTM, we are using this version, just 2 x nodes AG slow down the DB performance and we decided to change to async and speed much faster!

The manual process to bring up the primary AG replica, will be more complex and likely will have an extended down time
why more complex?
like what I said above ?
1)convert to to sync mode first
2) make sure that data all sync by the query from the URL I post
3) failover using dashboard failover.
4) make sure again in dashboard all are green to make sure data all sync again.
5) fallback to original primary and do 2) and 4) again.
6) convert the AG back to Async until we upgrade the AG node to CU22/23

Open in new window


is this works ?


hi,

as changing AG sync mode from Async to sync can be change using tSQL:

ALTER AVAILABILITY GROUP [AGRScale]      MODIFY REPLICA ON N'<node2>'      WITH (AVAILABILITY_MODE = SYNCHRONOUS_COMMIT);

how about change it back to Async?
this:

ALTER AVAILABILITY GROUP [AGRScale] FORCE_FAILOVER_ALLOW_DATA_LOSS;
?

hi,

so this queries should run on secondary replica and check if data all replicated?

SELECT ag.name,    drs.database_id,    drs.group_id,    drs.replica_id,    drs.synchronization_state_desc,    ag.sequence_number FROM sys.dm_hadr_database_replica_states drs, sys.availability_groups ag WHERE drs.group_id = ag.group_id;  SELECT ag.name,    drs.database_id,    drs.group_id,    drs.replica_id,    drs.synchronization_state_desc,    ag.sequence_number
FROM sys.dm_hadr_database_replica_states drs, sys.availability_groups ag
WHERE drs.group_id = ag.group_id; 

Open in new window


one thing, if I change the replication mode from async to sync, do I need to restart SQL server? and how about change from sync back to async ?

hi,

any update for me ?
ASKER CERTIFIED SOLUTION
Avatar of arnold
arnold
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
one thing, if I change the replication mode from async to sync, do I need to restart SQL server? and how about change from sync back to async ?

no !  I meant make sure the replication of the AG is good looking before and after failover.
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
https://docs.microsoft.com/en-us/sql/database-engine/availability-groups/windows/monitor-performance-for-always-on-availability-groups?view=sql-server-ver15 

that one just monitoring, not about switching sync mode.

Not sure you would have the flexibility similar to your manual test dealing with ore-planing the switch from asynchronously to synch in order to transition to the replica.
taht's why I planned to change the mode prior to my failover test!  so I have to know changing the async to sync has to followed by restart or not !

I think you are drawing your approach from your log shipping experience.
Versus applying the new tech
no! I learn new thing !

 duration to transition/potential loss of data.
for async, likely to have data lost. buy sync mode do not have data lose. that's why I planned to switch mode before failing over and once failover is done, switch back to async mode.



The switch between sync/async is an operating mode change; do not think you need to restart SQL. 

I hope!


Answering last first, do you think you will have the time to get to sync mode when the Primary group experiences a failure?
Why complicate things more than they need to be. If you think async mode is advantegeous to your circumstance and you are setting the async frequency short enough, tolerance of how much data loss you are ok with loosing, every 15 minutes.
However, if you need to minimize loss and the environment to be resilient. Test what if any performance hit you see if the operating of the AG is in sync mode versus in async.

The info on the last link includes estimates of time to sync, transition. as well as estimate on loss of data.

I get that you are learning the new, the point made is that you are applying what you already know to what you are learning.

Lets try it another way, test out the scenario with the operating mode is async every 15minutes.
you are advised that you are loosing power at the main location in the next five minutes.
see whether you can switch to sync mode, get the data replicate/sync up and manually failover the AG.

Much depends on how many DBS are part of the availability group and the transaction that need to be replicated in the remaining time.

Yes, to everything there is a cost and a benefit.

if you are testing, setup a local AG pair in sync mode.
while monitoring, run load tests that hit the DB.
now repeat the same test while the primary replica is offline. diconnected.
This way you could estimate what performance hit your setup might take should the primary replica is no accessible for sometime in sync mode.

then you can repeat the same test with ASYNC mode
but in this case, when your load is many writes. how long and what the scale of the possible data loss is.
Test what if any performance hit you see if the operating of the AG is in sync mode versus in async.

we tried in sync before and it is slow, that's why we switch to async mode.

Lets try it another way, test out the scenario with the operating mode is async every 15minutes. you are advised that you are loosing power at the main location in the next five minutes.
see whether you can switch to sync mode, get the data replicate/sync up and manually failover the AG.

you mean change to sync and get data all sync in 5 minutes?

Much depends on how many DBS are part of the availability group and the transaction that need to be replicated in the remaining time.
agree! I can only test tomorrow

while monitoring, run load tests that hit the DB.

any good suggestion on what tools to run the load against DB which record all action from us and replay it.



Before under what circumstances? how many DBs? when it was virtual, VM versus physical systems?

you can script/scale load testing.
What resources do you have available.
a list that might help
https://www.softwaretestinghelp.com/tools/40-best-database-testing-tools/#:~:text=1%20HammerDB%20is%20an%20open-source%20tool%20for%20database,on%20industry%20standards%20like%20TPC-C%20and%20TPC-H%20Benchmarks.

https://www.red-gate.com/products/sql-development/sql-test/
Before under what circumstances? how many DBs? when it was virtual, VM versus physical systems?
it is VM.

before under sync mode, we found it is slow but that one is SQL server 2017 enterprise RTM, not patch at all, that's why we switch to async mode ! then performance much better!

and now I got this task and I want to enhance it !  today I found a lot of SQL CODE can't change at all as the front end is NOT our application, so can't change any code at all! So we can only focus SQL infrastructure side and SQL server 2017 RTM has problem on SQL query reroute, e.g. reroute to secondary replica make select only query even slower !

so we plan to upgrade it to CU23, which we have tested ! CU22 and CU23 fixed AG And read reroute problem.

so I have to:
1) If you have updated to Cu23/CU22, is it good for your AG ?
 
https://www.experts-exchange.com/questions/29210748/SQL-server-2017-CU22-and-CU23.html

2) Change from async to sync and sync back to async need SQL server down time /hangs down ?







when I make change to the following:


User generated image
is that mean if 02 failover to 03, it is automated, but it will be manual failover if 03 failover to 02 ?

so if both set to automated, then secondary 03, once 02 is restart/boot up after failed, will also failover back to 02 automatically ?

I read this one:

 https://www.mssqltips.com/sqlservertip/3437/manual-sql-server-availability-group-failover/

I think there are something wrong with the query from that topic:
USE master; GO 
SELECT is_failover_ready, * FROM sys.dm_hadr_database_replica_cluster_states WHERE replica_id = (SELECT replica_id FROM sys.availability_replicas WHERE replica_server_name = 'yourreplicanamehere') GO

Open in new window

I run this on secondary replica and primary one, both return empty set, any problem?
so is that mean if I  force the failover to your secondary replica, data loss is possible? actually as the secondary replica is async mode anyway, then data might lost AFTER failover , no need to use that script to check, right?



Most software shoukd be kept up-to-date.
You shoukd take precaution by having a test environment where you would apply the update and then confirm that it does not cause issues.
You shoukd take precaution by having a test environment where you would apply the update and then confirm that it does not cause issues.

just failover test, I have to verify the AG failover before patching.

Most software shoukd be kept up-to-date.

but it don't means that one is free, right?
Much depends on which software you mean. Those that offer/issue updates are commonly free.

When there is a cost, you have to assess whether the fixes, enhancements offered/included in the updates are worth while or not.

Many these days include support functions as a means to maintain cash flow while at the same time provide those with the contracts "free" updates/upgrades.

MS OS, Applicaiton updates should be applied when their category is Critical/secutiry related. the others should be carefully considered.
Cummulative, Service Pack, types ....
we just finish the failover test, it is ok. change from async to sync and change back DO NOT need to restart/

tks.