Hello Experts,

We doing the stress test where replication hits about 22 minutes of latency. The threshold limits is 15 mins. How to analysis and find the Root cause.

Steps taken.

Ran below query and find the output details in the attachment. I personally feel there is I/O issue. Avg.disk queue lenght is about 40. But when we escalated to SAN team, they said this is normal. But in internet blogs says ,avg.disk queue length shouldn't be morethan 2 .
Please help me out of trouble.

SELECT  r.session_id ,
        r.[status] ,
        r.wait_type ,
        r.scheduler_id ,
        SUBSTRING(qt.[text], r.statement_start_offset / 2,
            ( CASE WHEN r.statement_end_offset = -1
                   THEN LEN(CONVERT(NVARCHAR(MAX), qt.[text])) * 2
                   ELSE r.statement_end_offset
              END - r.statement_start_offset ) / 2) AS [statement_executing] ,
        DB_NAME(qt.[dbid]) AS [DatabaseName] ,
        OBJECT_NAME(qt.objectid) AS [ObjectName] ,
        r.cpu_time ,
        r.total_elapsed_time ,
        r.reads ,
        r.writes ,
        r.logical_reads ,
FROM    sys.dm_exec_requests AS r
        CROSS APPLY sys.dm_exec_sql_text(sql_handle) AS qt
WHERE   r.session_id > 50
ORDER BY r.scheduler_id

Open in new window

My question :
Database has around 20 filegroups and each has 1/2 files in it.
How find which files is causing this issue.

Most of time the time writelog thread goes into suspended state.What does it mean?
Who is Participating?
Ryan McCauleyConnect With a Mentor Data and Analytics ManagerCommented:
For starters, a disk queue length of 40 is really concerning - If I saw anything over about 5, I'd start to get concerned. Can you tell what the translates into as far as disk wait time (in ms)? A queue of 40 is really long, and your SAN should be keeping things moving much more quickly than that, but if you're seeing really low latency and high throughput anyways, that may not be the root cause of the lag you're seeing, though I suspect it is.

Have you run a non-replication stress test on the storage? Is the storage performing well under normal usage, but just lagging excessively under replication load, or is it just that the replication lag actually puts a number (in minutes of lag) to slowness you've been seeing all along?

I use a tool called SQLIO to stress test the storage, as it simulates SQL-esque disk activity and gives you real throughput, latency, and Iops numbers that you can use to compare things. You can read about it and download it here:


or you can use Powershell to import/analyze the output if you want:


Once you download and install that tool, you can use the script I've attached, which is what I use when testing (change the extension to BAT from TXT) - it runs for about 45 minutes and performs a pretty wide array of tests (each for 15 seconds), and should give you a good overall feel for your performance. If you're seeing excessive latency, iops in the hundreds, or MB/sec in the single-digits, I think it's time to turn it back over to your storage team as that's definitely not normal.
How many replication are setup? Is the distributor server different or the publisher itself is acting as a distributor? What kind of replication is it and are all the subscription agents of push type? How many different disks are there?
How many replication are setup? 2 transactional replication
 Is the distributor server different or the publisher itself is acting as a distributor? -both lies on the same server
What kind of replication is it and are all the subscription agents of push type? push
How many different disks are there? 12 disk
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.