Link to home
Start Free TrialLog in
Avatar of Jeff_Creed
Jeff_Creed

asked on

Outlook locking up when server avg. disk queue length spike to 500, 800

I just inherited a small network (10 days ago) and I am having lock up issues with outlook on the terminal server, everyone in the main office is working fine.  It’s a small environment with 2 server, 2003 R2 with Exchange 2003, the other 2008 terminal server.  The local workstations are a mixture of XP, WIN7, all with Office 2007 all OS professional.  20 users, 6-8 remote using the terminal sever.

So far I tracked the lock ups to when the Avg. Disk Queue Length is above 150, and the reason why the desktop users are not experiencing the problem is because they have outlook in cache mode.  Take them out and they will experience the same problem as the terminal users.

Using perfmon and Process Explorer, I can see the most active process is the store.exe and when outlook switches to not responding/get server not responding the Avg. Disk Queue from perfmon is above 500, reports upward to 800

The disk spikes can last for anywhere from 15 seconds up to 2 minutes, effectively locking outlook and even the console of the server during the event.   I am seeing a few ftdisk warnings in the event viewer, 2 from a few days ago, 12 from a few weeks ago, but nothing during the event.  It’s happening ever 40 to 70 minutes on the server.

My question, what’s the best method to tell what Exchange process is causing the IO spike or am I dealing with a damaged store. This just the beginning of a  hardware failure?  The firmware is out date and I plan on updating disk/controller/system board this weekend.

The server is an older Dell SC1430, with a simple SATA RAID1.  No errors reported from the controller, but it is listing a number of firmware initializations information notifications for some reason.  I don’t recall rebooting the server 17 times in the last few days, but the card is listing initilaizations occuring.

The nightly Exchange defrag are running and listed as completing successfully in the event viewer.
Avatar of Jeff_Creed
Jeff_Creed

ASKER

Update - The RAID card hasn't listed a firmware initialization for past 6 hours, but perform is showing 17 jumps above 500.  Even appears to be every 15 to 17 minutes
ASKER CERTIFIED SOLUTION
Avatar of Paul S
Paul S
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
chkdsk c: /f might be a good idea too.

also, read this:
http://www.msexchange.org/tutorials/exchange-isinteg-eseutil.html
Thanks - I think I will perform a DR test restore of the system into a VM this evening instead of performing the firmware upgrade.  I will reschedule the firmware upgrade for tommorrow evening.

The store before I arrived had not been correctly backed up since 6/19/11.  They had Backup Exec uninstalled and switched to Mozy which had never ran corectly.  Since Mozy running correctly/didnt the log files, and circuarl was disabled from Symantec - they had almost filled the drive with log files before I came on board.  I used ntbackup to clear the log files, but I am now using StorageCraft which is VSS snapshot technology.

I will restore the server right now actually.

The Dell SC1430 line doesnt have OpenMangement tools like the other systems, checked with Dell. I have standard SATA drives onsite.

The system drive was at 54% fragmentation, it took over 3 hours to defrag a 9GB of data just a few nights ago.  The drive with Exchange store is 26%.

On the plus side the store is mounting and dismounting nicely all the nights I have worked on the server.
DR restore went clean.  I have an ESXi with an OpenSource ISCSI target for testing.   The IO spiking that the physical server is doing while idle is not happening in the restored virtual environment.  So thanks for the insight on the hardware.

I moved forwarded with BIOS updates for the drives, controller, and system board.  I also ran Exchange Server Analyzer and found some non standard memory configurations.  I made the following changes to the production system over the weekend.

http://support.microsoft.com/kb/315407
http://technet.microsoft.com/en-us/library/aa996786(EXCHG.80).aspx

Also ran chkdsk /f on all drives, twice on the exchange volume - no errors.

After the work, the IO looks very different now and outlook is responding differently.  The administrator account had 22K of warning emails in the inbox.  Before trying to select them all and delete from the terminal server would send the disk latency immediately into the 500 range, cause outlook to stop responding and communication warnings.

This time though it worked as expected, took 15 minutes.  The disk latency instead of shooting up to 500 and beyond only rose to 30 to 40.  Outlook only warned twice about lost of connection during the deletion.  I don’t like seeing the warning but it much better than having everyone taking out.

Going forward – Going to watch tomorrow.  I have two consumer grade 500GB SATA and I am ordering a replacement controller card.   I am entertaining the thought just install the two 500GB drive direct to motherboard and doing a software mirror.   But not sure if software mirroring, etc will keep up with say 20 Exchange users?  Thoughts?  4GB – dual quad core 1.86

High hopes the firmware upgrade will buy me the time to get the new controller in and the user group will be patient.
All drives replaced along with controller card, still seeing spikes to 500.  Now that I am on a new controller and HD performing a defrag/integrity check on the store.
The integrity check was successful and the server has been running like a charm all day.  The integrity check took several hours to complete but was successful.  Ran out of time last night to do the defrag, but will perform one this evening.
Sounds like you are making great progress. Has Outlook and/or the server froze since the controller and disk change?
The defrag of the store started running outside my window and had to cancel.  Getting around 3GB an hour and the store with stream is 40Gb.  Shedule for this weekend.

Yesterday all day server ran great. Only had one just to 250 on server and all users report great performance and no lock ups.  But then 5pm last night till morning the old behavior is back, shooting up to 500.

I can see firmware initialization information listed on the new controller also, just like before.   There is no time stamp on the log for them just ##################.  

Controller ID: 0 MegaRAID firmware intialization started:    (PCI ID 0x1000/ 0x0054/ 0x1028  / 0x1f09)  have nine listed between 10/23 5AM to this morning.

Going to schedule a reboot of the server this evening if the IO spikes move into the day.
Everything about this server had been goofed with in some fashion; regrettably I am crying uncle.  Just planning for a migration to Server 2008/Exchange 2010 this month instead of next year.   The disk IO spikes return after a day or so, store.exe and system process are listed as the heavy hitters.  But a aimple reboot clears the condition.  The nightly exchange defrags and weekly tasks are completing normally, chkdsk clean, and offline integrity checks are passing for the store.  All the DR restores have been successful.
Dumped hardware completely moved to virtual environment - restored server from backup into vm, no issues since.  Failing hardware