I have a single Exchange 2003 server running on Windows 2003 Standard SP1. Exchange has SP2 loaded, and a couple of security fixes. (Including the one released last week). I have 80 mailboxes, with about 40 of those as daily active users. The server has 2 GB of RAM, and two RAID-5 arrays on 10k drives, one array holds the OS, pagefile, transaction logs, etc. The other holds the private mailbox database and the streaming file for it. The server is also a DC and a GC, but does not hold any of the FSMO roles. It also runs DNS, WINS, and DHCP. Most of my users are on Outlook XP or 2003, but I have between 5-10 users on Entourage. (OS X) Recently, I have had many complaints of mail responding extremely slowly. Running some performance logs, I have found some interesting numbers, and I don't know what else to look at. The CPU utilization is extremely low 3-7 %. I have almost a gig of free memory, and pages and interrupts are in the low-normal range. The average disk write queue length for my disk holding the mailbox store is extremely high though. It is common to see numbers >250. This is on a 3 spindle RAID-5. Read queue length stays in the normal range, from about .05 - 1.5; Read and write queue lengths stay normal on the other array, in about the same range as read queue on the database disk. Naturally, performance of the entire system goes downhill when the write queue gets this backed up. The strange part is, it fluctuates regularly, but not on 1-2 second intervals as you might expect for a high-load server. More like 10 minute intervals. For 10 minutes, the system is slammed, and the queue backs up. Then, for 10 minutes there is almost no activity and everything is running great. (The mail queues never back up during this time, there is rarely more than 1 piece of mail in any queue, and there is no mail in the badmail folder) I have traced everything that I can think of to try to find where this activity is coming from. Filemon reports that it is indeed store.exe writing to priv1.edb causing the backup. Performance logs of active users, SMTP counters, ExchangeIS counters for RPC times only show the obvious, that RPC latency is bad when the server is lagging due to the backed up write queue. Exchange Mailbox counters show that user activity is not varying obviously between periods of good and bad performance. The performance analyzer also shows the obvious, that RPC latency is high. The best practice analyzer shows a couple of things, that the transaction logs and temp files are on the same drive as the pagefile; and that the /3GB switch is not set. I actually do have the /3GB switch set, although I have /USERVA=2900 instead of /USERVA=3030 because I was having difficulties running some things (Such as Remote Desktop inbound), and decided maybe the OS needed just a bit more memory (changing it to 2900 did fix my issues). The Exchange User Monitor displays varying server latencies, as performance fluctuates. From 0-3 ms when performance is good, to >2000 ms when performance is bad. The amount of data sent and received during these times, according to the user mon, is relatively unchanged. DCDIAG comes back clean, and my event logs are almost perfectly clean. I loaded and ran our Symantec Anti-Virus on the server to check for viruses, and loaded and ran Windows Defender to check for spy-ware etc. Both came back clean, so I removed them again, since they were causing more disk activity. (We have a Barracuda hardware anti-virus/spyware device in front of the Exchange server)
One thing worth mentioning, as it is very non-standard, is that there are a bunch of ports open to the internet. Aside from the standard SMTP and POP ports, there is 445 (DS), 88 (kerberos), 135 (DCOM), 3268 (GC), 389 (ldap), and 53 (DNS). This was configured before I arrived, (I have only worked here for 3 weeks). I have enabled RPC over HTTP and am in the process of moving remote users off of the old config (in place so users could access Exchange remotely through outlook). Once I get management off of this old config, I will close the ports. I have considered that we may have gotten hacked, and have some sort of malicous program running in the background, but I have yet to find any evidence to support that. There are no new processes that appear durring the high-load periods.
Does anyone have any idea what could be causing this up and down trend in disk writing? What other performance counters should I look to for more insight? Are there other settings that I have overlooked to further tune disk usage in exchange? Any ideas would be greatly appreciated!