Outlook locking up when server avg. disk queue length spike to 500, 800

I just inherited a small network (10 days ago) and I am having lock up issues with outlook on the terminal server, everyone in the main office is working fine.  It’s a small environment with 2 server, 2003 R2 with Exchange 2003, the other 2008 terminal server.  The local workstations are a mixture of XP, WIN7, all with Office 2007 all OS professional.  20 users, 6-8 remote using the terminal sever.

So far I tracked the lock ups to when the Avg. Disk Queue Length is above 150, and the reason why the desktop users are not experiencing the problem is because they have outlook in cache mode.  Take them out and they will experience the same problem as the terminal users.

Using perfmon and Process Explorer, I can see the most active process is the store.exe and when outlook switches to not responding/get server not responding the Avg. Disk Queue from perfmon is above 500, reports upward to 800

The disk spikes can last for anywhere from 15 seconds up to 2 minutes, effectively locking outlook and even the console of the server during the event.   I am seeing a few ftdisk warnings in the event viewer, 2 from a few days ago, 12 from a few weeks ago, but nothing during the event.  It’s happening ever 40 to 70 minutes on the server.

My question, what’s the best method to tell what Exchange process is causing the IO spike or am I dealing with a damaged store. This just the beginning of a  hardware failure?  The firmware is out date and I plan on updating disk/controller/system board this weekend.

The server is an older Dell SC1430, with a simple SATA RAID1.  No errors reported from the controller, but it is listing a number of firmware initializations information notifications for some reason.  I don’t recall rebooting the server 17 times in the last few days, but the card is listing initilaizations occuring.

The nightly Exchange defrag are running and listed as completing successfully in the event viewer.
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Jeff_CreedAuthor Commented:
Update - The RAID card hasn't listed a firmware initialization for past 6 hours, but perform is showing 17 jumps above 500.  Even appears to be every 15 to 17 minutes
Paul SDesktop Support Manager / Network AdministratorCommented:
I suspect hardware, be worried. make sure you are backing up this server. I had an exchange 2003 server with RAID-5 have a disk fail and when we replaced the disk everything appeared fine until we found EDB store corruption two or three days later.

I would definitely update everything (BIOS, Firmware, RAID driver, etc...) maybe start locating a compatible RAID card to replace the current one in case it is failing. Do you have spare drives on site already? Do you have Open manage installed? can you download the RAID firmware logs from the card?

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Paul SDesktop Support Manager / Network AdministratorCommented:
chkdsk c: /f might be a good idea too.

also, read this:
Ultimate Tool Kit for Technology Solution Provider

Broken down into practical pointers and step-by-step instructions, the IT Service Excellence Tool Kit delivers expert advice for technology solution providers. Get your free copy now.

Jeff_CreedAuthor Commented:
Thanks - I think I will perform a DR test restore of the system into a VM this evening instead of performing the firmware upgrade.  I will reschedule the firmware upgrade for tommorrow evening.

The store before I arrived had not been correctly backed up since 6/19/11.  They had Backup Exec uninstalled and switched to Mozy which had never ran corectly.  Since Mozy running correctly/didnt the log files, and circuarl was disabled from Symantec - they had almost filled the drive with log files before I came on board.  I used ntbackup to clear the log files, but I am now using StorageCraft which is VSS snapshot technology.

I will restore the server right now actually.

The Dell SC1430 line doesnt have OpenMangement tools like the other systems, checked with Dell. I have standard SATA drives onsite.

The system drive was at 54% fragmentation, it took over 3 hours to defrag a 9GB of data just a few nights ago.  The drive with Exchange store is 26%.

On the plus side the store is mounting and dismounting nicely all the nights I have worked on the server.
Jeff_CreedAuthor Commented:
DR restore went clean.  I have an ESXi with an OpenSource ISCSI target for testing.   The IO spiking that the physical server is doing while idle is not happening in the restored virtual environment.  So thanks for the insight on the hardware.

I moved forwarded with BIOS updates for the drives, controller, and system board.  I also ran Exchange Server Analyzer and found some non standard memory configurations.  I made the following changes to the production system over the weekend.


Also ran chkdsk /f on all drives, twice on the exchange volume - no errors.

After the work, the IO looks very different now and outlook is responding differently.  The administrator account had 22K of warning emails in the inbox.  Before trying to select them all and delete from the terminal server would send the disk latency immediately into the 500 range, cause outlook to stop responding and communication warnings.

This time though it worked as expected, took 15 minutes.  The disk latency instead of shooting up to 500 and beyond only rose to 30 to 40.  Outlook only warned twice about lost of connection during the deletion.  I don’t like seeing the warning but it much better than having everyone taking out.

Going forward – Going to watch tomorrow.  I have two consumer grade 500GB SATA and I am ordering a replacement controller card.   I am entertaining the thought just install the two 500GB drive direct to motherboard and doing a software mirror.   But not sure if software mirroring, etc will keep up with say 20 Exchange users?  Thoughts?  4GB – dual quad core 1.86

High hopes the firmware upgrade will buy me the time to get the new controller in and the user group will be patient.
Jeff_CreedAuthor Commented:
All drives replaced along with controller card, still seeing spikes to 500.  Now that I am on a new controller and HD performing a defrag/integrity check on the store.
Jeff_CreedAuthor Commented:
The integrity check was successful and the server has been running like a charm all day.  The integrity check took several hours to complete but was successful.  Ran out of time last night to do the defrag, but will perform one this evening.
Paul SDesktop Support Manager / Network AdministratorCommented:
Sounds like you are making great progress. Has Outlook and/or the server froze since the controller and disk change?
Jeff_CreedAuthor Commented:
The defrag of the store started running outside my window and had to cancel.  Getting around 3GB an hour and the store with stream is 40Gb.  Shedule for this weekend.

Yesterday all day server ran great. Only had one just to 250 on server and all users report great performance and no lock ups.  But then 5pm last night till morning the old behavior is back, shooting up to 500.

I can see firmware initialization information listed on the new controller also, just like before.   There is no time stamp on the log for them just ##################.  

Controller ID: 0 MegaRAID firmware intialization started:    (PCI ID 0x1000/ 0x0054/ 0x1028  / 0x1f09)  have nine listed between 10/23 5AM to this morning.

Going to schedule a reboot of the server this evening if the IO spikes move into the day.
Jeff_CreedAuthor Commented:
Everything about this server had been goofed with in some fashion; regrettably I am crying uncle.  Just planning for a migration to Server 2008/Exchange 2010 this month instead of next year.   The disk IO spikes return after a day or so, store.exe and system process are listed as the heavy hitters.  But a aimple reboot clears the condition.  The nightly exchange defrags and weekly tasks are completing normally, chkdsk clean, and offline integrity checks are passing for the store.  All the DR restores have been successful.
Jeff_CreedAuthor Commented:
Dumped hardware completely moved to virtual environment - restored server from backup into vm, no issues since.  Failing hardware
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Windows Server 2003

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.