SQL Server Always On Availability Group Failover Testing: A Production Runbook That Actually Works

Most AGs go untested until a real failover happens. By then it is too late to discover your application connection string has a hardcoded node name, your secondary is 45 minutes behind, or your monitoring does not alert on replica state. Here is the runbook I run on every new AG setup before it goes to production.


A manufacturing client called me after their primary node failed at 6am. The AG had failed over — technically. The secondary came online. Applications could not connect. The listener was not routing. The connection string had the primary server name hardcoded in two places nobody remembered. They were down for 3 hours while someone reconstructed the connection string and tested every application endpoint.

That outage should have been a 5-minute failover with zero business impact. The AG worked perfectly. The testing never happened.

I have since made AG failover testing a required step before any AG goes to production — regardless of whether the client "doesn't have time." This is the runbook I use. Everything here is reproducible during a maintenance window. You do not need to simulate a disaster. You need to verify your assumptions.

Before You Test: Check Current AG Health

Do not start a failover test from a degraded state. Run this query on the primary replica before touching anything:

-- Current AG health snapshot — run on primary replica\nSELECT\n  ag.name                                    AS ag_name,\n  ag.is_distributed,\n  ar.replica_server_name                     AS replica,\n  ar.availability_mode_desc,\n  ar.failover_mode_desc,\n  ars.operational_state_desc,\n  ars.connection_state_desc,\n  ars.role_desc,\n  drs.synchronization_state_desc,\n  drs.synchronization_health_desc,\n  drs.log_send_queue_size                    AS log_send_queue_kb,\n  drs.log_send_rate                          AS log_send_rate_kb_sec,\n  drs.redo_queue_size                        AS redo_queue_kb,\n  drs.redo_rate                              AS redo_rate_kb_sec,\n  drs.last_commit_time,\n  -- Catch replicas that are behind by more than a few seconds\n  DATEADD(SECOND,\n    DATEDIFF(SECOND, drs.last_commit_time, SYSDATETIME()) / 2,\n    drs.last_commit_time)                     AS estimated_primary_lag_time\nFROM sys.availability_groups ag\nJOIN sys.availability_replicas ar\n  ON ag.group_id = ar.group_id\nJOIN sys.dm_hadr_availability_replica_states ars\n  ON ar.replica_id = ars.replica_id\nJOIN sys.dm_hadr_database_replica_states drs\n  ON ar.replica_id = drs.replica_id\nORDER BY ag.name, ar.replica_server_name;

What you are looking for before testing:

  • Synchronization state = SYNCHRONIZED on all secondaries. SYNCHRONIZING means the secondary has not caught up — a failover now loses data.
  • Log send queue = near zero on secondaries. High queue means the secondary cannot keep up with the primary write rate.
  • Operational state = ONLINE on all replicas. RESOLVING means the replica is recovering — it cannot be failed over to.
  • Connection state = CONNECTED. DISCONNECTED means the replica is unreachable over the AG endpoint.

If any secondary shows SYNCHRONIZING, disconnected state, or a redo queue over a few MB, resolve that first. Failover testing from a healthy state only.

Step 1: Document Everything Before You Touch Anything

Free · 2 Minutes
How healthy is your database, really?
Get your free database health score — spot risks before they become incidents.
Get my health score

Before initiating any failover, document the current state. If something goes wrong, you need a baseline to return to. Capture:

  • Current primary replica — server name, IP, instance name
  • All secondary replicas — server names, roles, synchronization state
  • Listener DNS name and IP — which replica owns it currently
  • Application connection strings — explicitly verify they use the listener name, not a direct server name. This is where the manufacturing client outage came from.
  • Last full backup timestamp — you want a clean restore point before testing
  • Current failover mode settings — automatic vs. manual for each replica

If you find hardcoded server names in connection strings, fix those before testing. That is the whole point of a listener — application-level HA that survives node failure without application changes.

Step 2: Planned Failover — The Correct Sequence

A planned failover is a graceful operation: the primary syncs fully to secondaries, all replicas agree on the transition, and there is zero data loss. Here is the sequence:

2a. Verify the target secondary is ready

-- Run on the secondary you plan to fail over TO\n-- Verify the secondary can become primary\nSELECT\n  replica_server_name,\n  availability_mode_desc,\n  failover_mode_desc,\n  seeding_mode_desc,\n  is_primary_replica,\n  synchronization_state_desc,\n  last_commit_time\nFROM sys.dm_hadr_availability_replica_states\nJOIN sys.availability_replicas ar\n  ON replica_id = replica_id\nORDER BY is_primary_replica DESC;

The target secondary must show synchronization_state_desc = SYNCHRONIZED and availability_mode_desc = SYNCHRONOUS_COMMIT for a zero-data-loss planned failover.

2b. Route the AG to prefer the target replica

-- Run on the PRIMARY replica\n-- Configure failover preference (optional step — can also fail over directly)\nALTER AVAILABILITY GROUP [YourAGName]\n  MODIFY REPLICA ON\n    N'TargetServer\\InstanceName' WITH (\n      FAILOVER_MODE = AUTOMATIC\n    );\n\n-- Initiate planned failover\nALTER AVAILABILITY GROUP [YourAGName] FAILOVER;\n-- The command completes when the old primary has synchronized and handed off.\n-- Your session reconnects to the new primary automatically.

The ALTER AVAILABILITY GROUP FAILOVER command is synchronous — it completes only after the new primary confirms it is ready. Your connection automatically redirects to the new primary replica.

2c. Verify the failover completed correctly

-- Run on the NEW primary replica\n-- Verify the AG is healthy after failover\nSELECT\n  ag.name,\n  ar.replica_server_name                     AS current_primary,\n  drs.is_local,\n  drs.synchronization_state_desc,\n  drs.synchronization_health_desc,\n  ar.availability_mode_desc,\n  ar.failover_mode_desc\nFROM sys.dm_hadr_availability_replica_states drs\nJOIN sys.availability_replicas ar\n  ON drs.replica_id = ar.replica_id\nJOIN sys.availability_groups ag\n  ON ar.group_id = ag.group_id\nWHERE ag.name = 'YourAGName'\nORDER BY drs.is_local DESC;

The new primary should show is_local = true, synchronization_state_desc = SYNCHRONIZED, synchronization_health_desc = HEALTHY. If it shows anything else, something went wrong in the failover — investigate before proceeding.

Step 3: Test Application Connectivity Through the Listener

This is the step most teams skip and regret later. After failover:

  1. Test from a non-DBA application host — not from the database server where you have elevated access
  2. Connection string must use the listener DNS name, not the old primary server name
  3. Run a representative query — not just SELECT 1, but a query that exercises your application typical workload pattern
  4. Verify read-only routing works if you use readable secondaries — connect to the listener with ApplicationIntent=ReadOnly and confirm it routes to the secondary
-- Test AG connectivity from a client session\n-- This should work from any server in your network\nSELECT\n  @@SERVERNAME                          AS connected_server,\n  DB_NAME()                             AS current_database,\n  GETDATE()                             AS connection_time;

Log the result — you want to compare this against a query from before the failover. The connected_server should show the new primary node name if you connected directly, or the listener name if you used the listener. Either way, the application query must succeed without any configuration changes.

Step 4: Unplanned Failover Simulation — What Happens When the Primary Dies

An unplanned failover is what happens when the primary node bluescreens, loses network, or is shut down unexpectedly. SQL Server forces the secondary to become primary, potentially with some committed transactions that had not been received from the old primary — this is the potential data loss scenario and why synchronous commit matters.

You cannot simulate an unplanned failover cleanly, but you can test what the application experiences when it has to reconnect to a different node:

  • Kill the application connection and reconnect — test how the client driver handles reconnection. Does it retry? Does it timeout? Does it fail permanently?
  • Check AG failover timing — how long did the failover take? From the moment the primary died to when the secondary was accepting connections. Run this query on the new primary after recovery:
-- Check when the replica became primary — run on new primary\n-- Shows failover history if AlwaysOn_health extended events session is running\nSELECT\n  event_time,\n  session_id,\n  event_data.value('(/event/data[@name="failover_type"])[1]', 'varchar(50)')  AS failover_type,\n  event_data.value('(/event/data[@name="failover_reason"])[1]', 'varchar(100)') AS failover_reason,\n  event_data.value('(/event/data[@name="target_replica"])[1]', 'varchar(128)') AS target_replica\nFROM sys.fn_xe_file_target_read_file('AlwaysOn_health*.xel', NULL, NULL, NULL)\nWHERE event_object_name = 'failover'\nORDER BY event_time DESC;

Note: The extended events session name is AlwaysOn_health — this is created by default when you enable ALLOW_ANMOUS or HEALTH_TRACKING on the AG. If you do not have this session running, you will not have the history — and that is worth fixing now. Enable it:

-- Create the system health session if it does not exist\nIF NOT EXISTS (SELECT 1 FROM sys.server_event_sessions WHERE name = 'AlwaysOn_health')\nBEGIN\n  CREATE EVENT SESSION [AlwaysOn_health] ON SERVER\n  ADD EVENT sqlserver.alwayson_ddl_executed,\n  ADD EVENT sqlserver.hadr_db_manager_health_report,\n  ADD EVENT sqlserver.hadr_transport_connection_error,\n  ADD EVENT sqlserver.ag_database_detach,\n  ADD EVENT sqlserver.ag_failover_partial,\n  ADD EVENT sqlserver.ag_replica_manager_health_report\n  ADD TARGET package0.event_file(SET filename = 'AlwaysOn_health')\n  WITH (\n    MAX_MEMORY = 4MB,\n    MAX_DISPATCH_LATENCY = 5 SECONDS,\n    ON_FAILURE = INCLUDE_NULL_INCOMPLETE_EVENTS\n  );\n\n  ALTER EVENT SESSION [AlwaysOn_health] ON SERVER STATE = START;\nEND

Step 5: Post-Failover Validation Checklist

After every failover — planned or unplanned — work through this checklist before declaring success:

  1. AG dashboard in SSMS — green across all replicas
  2. Listener is onlineSELECT * FROM sys.dm_tcp_listener_states shows state = ON
  3. Application connection string test — run from a non-DBA application host
  4. Read-only routing test — if applicable, test ApplicationIntent=ReadOnly
  5. Jobs and alerts — verify SQL Agent jobs that reference the old primary server name have been updated or will fail over correctly
  6. Log backups — confirm log backup chain is intact on the new primary; AG databases should still be taking log backups normally
  7. DR replica health — check any off-site or async replicas are still synchronizing and have not stalled

One of the most common post-failover surprises: SQL Agent jobs. If a job references MyServer\\MyInstance and you fail over to a different server, the job target changes — which may be correct behavior, or may mean you are running a backup job against the wrong server. Jobs should either use the listener name or use msdb.dbo.sp_add_jobserver to update targets after failover.

The 30-Day Post-Go-Live AG Health Program

A single failover test is not enough. Set up a recurring health check to catch degradation before it becomes an outage. Run this weekly:

-- Weekly AG health check — catch issues before they become outages\n-- Run on primary replica\nSELECT\n  ag.name,\n  ar.replica_server_name,\n  ar.availability_mode_desc,\n  ars.operational_state_desc,\n  drs.synchronization_state_desc,\n  -- Alert if queue is growing (potential sync lag)\n  drs.log_send_queue_size,\n  drs.log_send_rate,\n  -- Alert if redo is falling behind\n  drs.redo_queue_size,\n  drs.redo_rate,\n  -- Check for databases not joined to AG\n  db.name                                    AS database_name,\n  drs.is_database_joined\nFROM sys.availability_groups ag\nJOIN sys.availability_replicas ar\n  ON ag.group_id = ar.group_id\nJOIN sys.dm_hadr_availability_replica_states ars\n  ON ar.replica_id = ars.replica_id\nJOIN sys.dm_hadr_database_replica_states drs\n  ON ar.replica_id = drs.replica_id\nJOIN sys.databases db\n  ON drs.database_id = db.database_id\nWHERE ar.replica_id = (SELECT primary_replica FROM sys.dm_hadr_availability_primary_replica(ag.group_id))\n  AND (drs.log_send_queue_size > 1024         -- More than 1MB queue\n    OR drs.redo_queue_size > 1024             -- Redo falling behind\n    OR drs.synchronization_state_desc <> 'SYNCHRONIZED'\n    OR drs.is_database_joined = 0);          -- Database not joined

Any rows returned from this query warrant immediate investigation. A growing log send queue means the secondary cannot keep up — whether from network saturation, disk I/O pressure, or a large batch operation. Catch it weekly, not at 2am.

How Often Should You Test?

Test at go-live. Test after any network or infrastructure change. Test quarterly minimum for production AGs. Test before a major version upgrade of SQL Server — especially when you are moving from Windows Server Failover Cluster to a different cluster configuration, which sometimes requires rebuilding the AG entirely.

The 3am failover is not the time to discover your application connection string uses a hardcoded server name, your monitoring does not see replica state changes, or your async DR replica is 4 hours behind with no alerts. Run the test. Fix what breaks. Repeat until nothing breaks.

Running an AG that has not been tested?
Download the Always On AG Health & Failover Playbook — 10 DMV queries, pre/post-failover checklists, and a 30-day health program. Or get a free environment assessment that includes a live AG health check.

Free · Takes 2 Minutes

Get your free database health score

Find out exactly where your database is vulnerable before it causes an incident. 30+ years of DBA expertise, distilled into a single assessment.

Back to all posts