Distribution_Log File growth out of control. - Agent history clean up: distribution job not working.

Hi Experts,

Need some help here. Today I had our DISTRIBUTION Database Log File Grow from 100mb - 85GB. This crashed the server by taking up all the disk space. After a quick review i noticed the following job running and seemed to not stop. This usually ran every 10 mins but this time if never stopped.

Agent history clean up: distribution

which executes the following T-SQL script.

EXEC dbo.sp_MShistory_cleanup @history_retention = 48

To remedy this and get the server back online, i did the flowing.

STOP/START SQL Services

-- Changed the database recovery model to SIMPLE.
ALTER DATABASE DISTRIBUTION
SET RECOVERY SIMPLE;
GO

-- Shrink the truncated log file to 1 MB.
DBCC SHRINKFILE (DISTRIBUTION_Log, 1);
This worked fine and i recovered the 85GB back.

However, As soon as I re-ran that job, the LOG grew again and eventually took all the space. I redid the same steps above but this time Disabled that job. The server is running optimally again and replication is working fine.

The distribution_Log file options are

 Enable autogrowth
    File Growth
            In Percent 10%
    MAximum File Growth (MB)  2,097,152 or 2TB.

These are the defaults. The TAble distribution.sp_MShistory_cleanup has 41174557 records and that job should technically cleanup this table for 48hr back. The strange part is that the table still has the first initializing record plus the latest ones.


Initializing      0      0x0000000011E41BF5      0      2010-12-05 23:18:03.390
...
Waiting 60 seconds      0      0x00000000173A7AA1      1      2011-03-30 20:22:26.843

Any idea what i should do? I don't want to blow up this DB and recreate all the replications snapshots and rules.

PLATFORM:
SQL 2005 Standard 9.0.4035

-Hades666
LVL 30
Brad HoweDevOps ManagerAsked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

lluddenCommented:
I found a thread about it on the microsoft forums.  The basic answer was:

Run a job which deletes the contents of msmerge_history with enteries beyond your retention period.

Do it in batches of 5000. Put an index on the date column (time).

The history cleanup job is a low priority transaction victim and fails often, by design.  

You can read the whole thread at http://social.msdn.microsoft.com/Forums/en/sqlreplication/thread/480537f3-14d5-4819-bc7f-672d408296fd


SQLSergentMikeCommented:
If that database is growing then replication isn't replicating the transactions to the subscriber or as said above your retention period is too long. The Distribution database shouldn't be that large under normal circumstances. The only other issue that I can think of that would cause that type of growth would be deferred transaction.
Anthony PerkinsCommented:
Somehow I suspect the author is MIA or no longer cares...
Brad HoweDevOps ManagerAuthor Commented:
Hi All,

sorry about going MIA. Broke my leg and essentially did nothing for 6 weeks.  Anyways, I did correct this with pieces from the article above. After many hours babysitting this DB, it is now corrected.  Disk space has been restored, there is now 90GB Free and replication is optimized.

What I did was create an INDEX on the Time Column to speed up the process. This took 2 hrs to comment against 45+ million records.

CREATE INDEX IDX_MSMerge_History_Time
on MSMerge_History (time)


After that was completed, I proceeded to delete the older records manually at a rate of about 1,000,000 a time.

delete records
from (select top (1000000) *
        from [distribution].[dbo].[MSmerge_history] order by time) records


Each delete took about 30 minutes to complete.

After each delete, I needed to purge the LOG file back to 1mb to recover the space to continue to delete the remaining records as it grew again to 70+GB.

DBCC SHRINKFILE (DISTRIBUTION_Log, 1);

I did this 14 times and now all Replication jobs run properly and the table is optimized. I also finished by shrinking the DB Datafile since the space was 98% free after all the deletions and it shrunk from 20 GB to 200 mb.

Thanks for keeping an eye on it.
cheers,
Hades666

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Brad HoweDevOps ManagerAuthor Commented:
A little more detailed then the original post. He gets the points but my comment was the solution :)
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Microsoft SQL Server

From novice to tech pro — start learning today.