Link to home
Start Free TrialLog in
Avatar of cyberleo2000
cyberleo2000Flag for United States of America

asked on

Exchange SCR Replication ReplayQueueLength very high

I am running Exchange 2007 SP3 RU9 in an CCR active-passive cluster. I recently setup another server for SCR. I used the following cmdlet -  Enable-StorageGroupCopy -id "StorageGroup" -StandByMachine "SCR-Server" -ReplayLagTime 0.0:0:0

I have 20 storage groups and had successfully seeded 16 of them. All 16 had been replicating without a problem, CopyQueueLength was 0 and ReplayQueueLength was 50. A few days ago I noticed the ReplayQueueLength for almost all of them was in the thousands. The logs are being copied, but the SCR server does not seem to be truncating them fast enough.

We had a flurry of activity on our exchange server last week, a number of mailbox moves plus a large number of public folder data was moved into shared mailboxes. Could this of caused the high replayqueuelength? Do I simply need to wait for the SCR server to "catch up"?

I really don't want to reseed because that process takes up to 12 hours per mailstore. The SCR server is in the Amazon cloud.

I attached a screenshot of the current status.

Any help is greatly appreciated. Thank you.

User generated image
Avatar of Kotteeswaran Rajendran
Kotteeswaran Rajendran
Flag of Malaysia image

Hi,

Yes it must be because of the too much activity or slowness of the network.

I would suggest you to wait, let it complete but keep an eye on it.

FYI- I used to have copy queue length around 50k plus in my environment.
What is set for -TruncationLagTime   ??

http://technet.microsoft.com/en-us/library/bb676502(EXCHG.80).aspx
-TruncationLagTime   This parameter is used to specify the amount of time that the Microsoft Exchange Replication service should wait before truncating log files that have been copied to the SCR target computer and replayed into the copy of the database. The time period begins after the log has been successfully replayed into the copy of the database. The format for this parameter is (Days.Hours:Minutes:Seconds). The maximum allowable setting for this value is 7 days. The minimum allowable setting is 0 seconds, although setting this value to 0 seconds effectively eliminates any delay in log truncation activity. After the value for this parameter is set, it cannot be changed without disabling and then enabling SCR.

- Rancy
Avatar of cyberleo2000

ASKER

To: amitkulshrestha: what checkpoint file is deleted? I'm assuming only the one on the target SCR server node?

To: Rancy: See original question. I did not add a TruncationLagTime value, therefore using the default, which I believe should truncate the logs immediately after they are replayed. I used this command - Enable-StorageGroupCopy -id "StorageGroup" -StandByMachine "SCR-Server" -ReplayLagTime 0.0:0:0

thank you
I dont think so why dont we run the command to confirm the Truncation time

- Rancy
ReplayQueneLength is the number of files waiting to be committed to the exchange database, i.e.: replayed. So my problem is that the SCR node is not replaying log files fast enough. I'm just trying to find out if i had a problem or if I simply have to be patient and wait for the log files to be replayed.
Create one Test DB and check how fast it replicates.
I guess with SCR it doesnt reply's last 50 logs

- Rancy
Thanks imkottees ..... for the articles :)

Cyberleo: if its the same its fine if not maybe something to check, i guess you would have more DB's does all have the same issue or only one\few ?

- Rancy
All the replicating DBs have large ReplayQueueLengths ranging from a few hundred to over 8000. And all replicating DBs have CopyQueueLength of 0. The logs are copying with no problem. It the issue seems to be with the SCR server not replaying logs fast enough. SCR replication was working just fine for about a month or two. CopyQueueLengths were all 0 and ReplayQueueLengths were all 50.

The we had a flurry of activity: mailbox moves, public folder data being transferred to shared mailboxes, etc. Is it coincidence or did this increase i activity cause more logs to be generated than the SCR node can handle?

If its a matter of waiting for the SCR ndoe to catch up and replay all the extra logs, ok, no problem, I can wait. But if there is a problem I don't want to wait for it to get worse.

Reseeding takes too long, and I have 20 stores, so that's really a last resort, worse case solution.

thank you
wait and watch is the right choice for now, as you were doing lot of activity
Is it coincidence or did this increase i activity cause more logs to be generated than the SCR node can handle? - I dont think so but Weekend is a good time and logs should purge after backup on Active CCR

I Personally dont think its an issue as too said by Amit best is we can wait for this weekend as i guess over the entire week there is too much activity with what you just mentioned

- Rancy
So I waited about 5 days. My latest SCR replication status is still way too high. ReplayQueueLength should be around 50. See attachment. At this point I'm think that maybe the SCR node doesn't have enough resources to process the logs fast enough? The specs of the server are as follows: Processor is Xeon E5-2665 @ 2.4 GHz and memory is 34.1 GB @ 2.64 GHz. I would think that this is enough.
scrstatus.jpg
I do agree but what if we Suspend all and only leave 2-3 SG with SCR does it keeps low as 50 ?

- Rancy
I guess that's my next test. Right now I am waiting to hear back from a  server engineer who is checking the box for any I/O errors or latency of any kind.
Perfect, please do keep us updated ..... getting hands on SCR-CCR after a long time now :)

- Rancy
I suspended replaiction on all storageg roups except 16 whihc had a replayqueuelength of over 15000. The queue has gone down to 10000 in about 15 minutes, so maybe the problem is with the SCR node not having enough resources to keep up.

One thing I noticed is that when I suspend replication for a storage group, the replaying of logs to the database on the SCR node also stops. Why would that be? Why wouldn't the SCR node continue to replay logs until the confiured limit was reached? In my case, 50 logs. That does not make much sense.
ASKER CERTIFIED SOLUTION
Avatar of Manpreet SIngh Khatra
Manpreet SIngh Khatra
Flag of India image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
I figured that. I just wished the software was programmed so that there was a way to pause replication but continue to replay logs. Anyhow, it turns out that the problem is with low resources on the scr node, specifically disk I/O. I have a server engineer looking at moving the data to faster disks. Of course that means convincing my company to spend money :)

Thank you for the help. Greatly appreciated. I will try to "pay it forward"