Exchange 2007 rpc requests growing, mail stuck in queue

Win SBS 2008 server running Exchange 2007 for about 25 users.

Dell T310 4*1Tb Raid 5, 8gb Ram

Was working fine until last night, suddenly stopped delivering mail. The mail is ending up in the exchange mail queue but not being delivered properly. Thought the mail DB was dead (thats the error msg in the Queued mails), bit it is still delivering mail very slowly. No backpressure, and no useful logs in the event viewer application section.

Checked the perf monitor for rpc requests, they start around 20 right after reboot and grow to over a hundred fairly quickly. No mail is being delivered at that point. After I reboot it delivers a few more messages until it locks up again. Disks look fine on the Dell server manager, no idea how it could go so wrong, so fast! Tried killing most non-essential services, but still not working. Tried to run exmon to see if the problem is with one of the clients rpc requests (I'm skeptical, we've been running fine for 5 years here!), but it only runs once per reboot, and also locks up after collecting 60 seconds worth of data, so its not that useful (:

Any ideas? I'm about out of them after 24 straight hours ...
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Adam FarageEnterprise ArchCommented:
RPC counter on the mailbox database is only 500, so you are probably hitting the maximum. Can you provide the actual error you are getting within the SMTP queue?

Get-TransportServer | Get-Queue to display the queues
Once you know the queue that is backed up run the following:

Get-Queue "queue name here" | Get-Message -ResultSize 10 | FL

Open in new window

If my suspicions are correct you are most likely seeing a "server rejecting connection" type of error, which is causing this. If you look at the RPC Counter on the database what is it at?

- Open Perfmon
- Select only the following monitors:

MSExchangeIS\RPC Requests (Information Store service in 2007 can only handle 500 concurrent connections, otherwise they are dropped)
MSExchangeIS\RPC Averaged Latency (should be less than 25 seconds)
MSExchangeIS\RPC Operations/sec (shows client activity - no "best practice" threshold)
MSExchangeIS\RPC Num. of Slow Packets (should be less than two)

Depending on what is causing this is based off those RPC counters most likely. I would also check AV exclusions or just shut off AV and see if that helps.

Also make sure all services are online by running "Test-ServiceHealth"
call_me_ishmaelAuthor Commented:
Hi Adam,

The error that all of the queued messages are getting is
4.3.2 Mailbox database is offline - that's strange because it occasionally delivers a message. Theres about 700 in the queue now.

RPC requests is 80 and growing slowly
Averaged latency is 1060!
RPC operations/sec averages around 3, but it is spiky. Runs at 0 for a while and spikes up to 10 or 20 for a second
Num of slow packets is 2

Test-ServiceHealth shows everything running. Shut off backup s/w as well as most other services. No AV running. I'm baffled.
Adam FarageEnterprise ArchCommented:
Make sure backups are not running, but it sounds like something is hitting your disks hard. I would look at your average read / write latency on the disks, as this would mimic this issue.

Another thing you can do is run ExMon (Exchange Monitor - its a download) on the SBS box itself to see which user is taking up the most CPU (1-5% is normal, abnormal is like 20% sustaining) and then shutdown that person's CAS Access:

Set-CASMailbox -MapiEnabled $FALSE -ActiveSyncEnabled $FALSE

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
call_me_ishmaelAuthor Commented:
I looked at %disk read time and %disk write time on the physical disks, read time seems high (2300 - not sure of units), % write time is in single digits. Not sure what this means for a raid 5 array or how to fix it. Dell server manager says the disks are ok, what should I do to run this down?

Ran Exmon, it was a pain and kept crashing and there was no one individual who was consistent offender. Many of them are above10% for short samples.
call_me_ishmaelAuthor Commented:
Thanks Adam. Just as a post-mortem, after your comment about disk usage, I downloaded process explorer:

this gave me a look at the processes and their I/O usage. The only thing that stuck out was that the system tray program to manage shutdown in a UPS event seemed to be using a lot of disk i/o. Killed that, and the email firehose began!

Funny that a tiny memory resident utility for power mgmt would cause database lockups, non-delivered mail ,and general panic. Anyway, thanks for the help, and glad that's over!
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today

From novice to tech pro — start learning today.