Avatar of sybadmin1040
sybadmin1040Flag for Afghanistan

asked on 

Exchange Average Disk Queue Length

My Microsoft Exchange server will "radomly" slow down to the point where Outlook clients will intermittently receive the message "Outlook is trying to retrieve data from the Microsoft Exchange server".  When I run performance monitor on the server at these times, Average Disk Queue Length is pegged at 100%.  We had an Exchange health check performed and was told everything was fine.  I recently replaced the HP Smartarray controller card and the batteries are fully charged and hardware level caching is enabled.  We have 3 partitions (1-100gb, 1-300gb - both on the same RAID 5 array; and an additional 200gb on a separate RAID 5 array).  I know that RAID 5 isn't the "best practice" for Exchange, but, we went this route for other reasons (internal demands for journalling).  Journalling is now disabled.  There is more than enough disk space available.  The server has 4gb of RAM, has PAE enabled, and the drives are 10K (15K would be better).  Does anyone know of anything that would cause this spike of activity and slow down Outlook clients?
Storage

Avatar of undefined
Last Comment
borgmember
Avatar of tigermatt
tigermatt
Flag of United Kingdom of Great Britain and Northern Ireland image


My first thoughts are that this could be caused by some anti-virus software scanning the Exchange databases. You may want to make sure that all the Exchange .EDB and .STM files have been excluded from Anti-Virus scans in your AV software. Manually scanning those files can cause corruption and other strange issues.

Similarly, are you sure that the server is not acting as an open relay and relaying spam? If it's being hit hard by a spammer, you would notice these sorts of issues. Check the Message Tracking log and verify the messages the server is handling.

RAID 5 for Exchange is not ideal - RAID 10 is much better for storing and accessing database files. Also, ideally, you should have the databases on a separate array to the OS and pagefile.

-tigermatt
Avatar of sybadmin1040
sybadmin1040
Flag of Afghanistan image

ASKER

Thank you for the suggestions.  There is no real-time AV running on the Exchange server itself.  We're currently running Trend ScanMail which is set to only scan email messages for malicious content.  This has actually been running for several years and these symptoms just recently started.  The Exchange server itself is not accessible from the public Net - there is a SurfControl server sitting in front of the Exchange server which accepts requests on port 25 and then relays them on to Exchange.  I've checked for any scheduled tasks that may have been scheduled on the Windows side and there is nothing going on.  The symptoms will persist for a few minutes and then will magically clear up.  I've checked Task Manager and it is definitely the combination of the store.exe and inetinfo.exe.  I saw a TechNet article which described reinstalling the smtp service (which is apparently part of inetinfo.exe) and didn't know if anyone had run into this issue with that service.  The OS and mailstores are running on one RAID5 array (separate logical drives) and the transactions logs are pointed to the second RAID5 array on another logical drive.  The Exchange Best Practices Analyzer shows no critical issues and only shows warnings at this point - none of which would seem to cause a spike in the disk I/O.  The RAID controller card is an HP Smartarray 6400 series card (with the latest firmware - 2.84 - from HP).  The latest SupportPaq files have been installed on the server as well.  I'm not sure if information store fragmentation would cause issues like this, and I may do some offline degragging and/or new Information Store creation and mailbox moving if things don't improve.   Thanks again for the suggestions, tigermatt - if you can think of anything else that I may be overlooking - just let me know.
Avatar of nmcdermaid
nmcdermaid

I rewally don't know much about Exchange but I'm suprised that you need IIS (Internet Information Services) installed. If you uninstalled it, or at least stopped it, you woudn't have a inetinfo.exe process anymore.
Or are do you require that for webmail?
Avatar of sybadmin1040
sybadmin1040
Flag of Afghanistan image

ASKER

We are running OWA for certain users and I believe that at least a portion of IIS will install with the installation of SMTP - I may be wrong about that, but, regardless, we do use Outlook Web Access.
Avatar of nmcdermaid
nmcdermaid

OK bowing out as thats the extent of my knowledge... Good luck.
Avatar of sybadmin1040
sybadmin1040
Flag of Afghanistan image

ASKER

nmcdermaid - thanks for the suggestion!  I appreciate your time.
You say disk queue is at 100% but what about disk I/Os per second (or bytes per second or anything similar). If the disk queue goes up and iops isn't high it's a sign of a stalled disk subsystem. You've replaced the controller (presumably to try to fix this problem) but it could be the SCSI cables, backplane or even a flakey disk pulling the bus down. Anything in the IML log on the systems management homepage?
Avatar of sybadmin1040
sybadmin1040
Flag of Afghanistan image

ASKER

Nothing at all - ran diags on all disks yesterday afternoon and each one shows clean and gives the message "This drive IS functioning within the proper operating specifications and should NOT be replaced".  On the main Systems Management Homepage it shows no errors on any device.  If drive I/O goes crazy again today at some point, I may disable ScanMail temporarily to see if that clears it up at all.  No settings have changed, but, some update to the software (which is done automatically) may have done something.  I just added bytes per sec to my performance monitor to see what it tells me.  Again, I do know that store.exe and inetinfo.exe are the main disk i/o users which is to be expected since Exchange is very disk intensive.
Avatar of BrianLanter
BrianLanter

We too are experiencing the same problem.  The errors went away for a little while now they are back.  The disk Que length pegs at 100% and users receive the mail server is unavailable error message.  Looking into a solution now.  
Avatar of sybadmin1040
sybadmin1040
Flag of Afghanistan image

ASKER

Well, at least I'm alone - if I come up with anything, I'll post it.  Thanks again everybody!
Avatar of sybadmin1040
sybadmin1040
Flag of Afghanistan image

ASKER

Whoops - meant to say not alone - big pet peeve of mine not checking what I wrote.
Avatar of BrianLanter
BrianLanter

We recently moved our exchange server to a blade.  Are you running in a similar environment?
Avatar of sybadmin1040
sybadmin1040
Flag of Afghanistan image

ASKER

No, ours is on a standalone HP ML370 server.  4gb RAM, 2 - RAID5 driver arrays running from a single HP Smartarray 6400 series controller card.  Our security guy needed a product called "snare" installed almost 6 months ago that seems to be consistently hammering the system.  In addition to perfmon I'm now running filemon to see what files are being accessed during the slow times.  Of course, nothing has happened since I launched filemon, but, if I see anything that seems really out of the ordinary, I'll be sure to post it.  You aren't running anything similar to snare are you?  I never knew what it was, but, apprently it pipes out event logs to a monitoring system so, obviously it could be running 24x7.  
Avatar of sybadmin1040
sybadmin1040
Flag of Afghanistan image

ASKER

Update - one thing I did change this morning, and, knock on wood, the issue hasn't happened again since - the SMTP directory was located on the same logical drive as the transaction logs.  Again, due to limited options at this time on the drive partitions, I have now moved it to the same drive as the information stores.  So far, it seems to have helpded a little, but, I'm definitely not considering it fixed at this point.
Avatar of sybadmin1040
sybadmin1040
Flag of Afghanistan image

ASKER

That last update didn't help for long - still getting pegged at times.
Avatar of sybadmin1040
sybadmin1040
Flag of Afghanistan image

ASKER

Found an interesting article on TechNet.  The Outlook issue was the first clue from some users that something was going on.  http://support.microsoft.com/kb/892764
Avatar of tigermatt
tigermatt
Flag of United Kingdom of Great Britain and Northern Ireland image


OK, a few more questions. I'm not ruling out that it could be a hardware issue, but:

What sort of RAM usage is the store.exe service utilising?
How many users do you have? Do they have "Cached Mode" enabled in their Outlook profiles?

-tigermatt
Avatar of sybadmin1040
sybadmin1040
Flag of Afghanistan image

ASKER

The store.exe is using a little over 1gb of RAM - about 1.2gb.  We have about 700 mail users, but, not all of those actually have "active" mailboxes.  We use Cisco Unity for Unified Messaging and I had to configure dummy Exchange accounts for a lot of the greetings to play properly.  The users (at least most of them) have been configured to not use cached mode.  This is actually part of our Ghost images and anytime a client is installed, this setting is unchecked.
Avatar of tigermatt
tigermatt
Flag of United Kingdom of Great Britain and Northern Ireland image


It is known that using Cached Mode in Outlook does reduce the load on your Exchange Server, and that having many Outlook users connecting non-cached to Exchange can cause Outlook to go offline on occasions, should the server be overloaded. It doesn't seem this is the case though, as the store.exe RAM usage would be much higher if this was the case.

I am inclined to suggest that this could be an issue with the hardware, per andyalder's comments. What RAID array format (RAID 0, 1, 1+0, 5 etc.?) are the databases stored on?
Avatar of sybadmin1040
sybadmin1040
Flag of Afghanistan image

ASKER

They are on a separate RAID5 array.  I know that RAID 1 has much better performance, but, at this time, I'm unable to re-partition and there are no more available drive bays in the server.  They are running from an HP Smartarray 6400 series controller with the latest firmware.  Caching is enabled at the hardware level.  I'm leaning toward the hardware itself and I'm convinced that faster drivers/better RAID setup would help greatly.
Avatar of tigermatt
tigermatt
Flag of United Kingdom of Great Britain and Northern Ireland image


It's RAID 1+0 (RAID 10) which has the best performance for storing databases. I appreciate the fact that you can't rebuild the array at this time though into a different RAID type.

What was the security software you were referring to in your previous comment. Has that been uninstalled?
Unity can significantly increase the load on the Exchange server, with all those audio file attachments. But it wouldn't normally give bursts of higher than normal disk wueue but just add more to the general load on the server.
Avatar of tigermatt
tigermatt
Flag of United Kingdom of Great Britain and Northern Ireland image


I thought that might have been the issue with the security software. Andy, what do you reckon? Could the actual issue here possibly be wrongly classified as a disk issue - a possibility, I am thinking?
Have to compare processed I/Os per second with queued I/Os per second to see if one ramps up while the other goes down to seperate hardware from software.

If some 3rd party app or even Exchange itself suddenly throws lots of work at the disks all at once the disks get swamped same as the queue gets swamped if the disks stop. A badly written virus checking app may even make the pagefile go bonkers. It can be hard to tell whether the disks have stopped or are simply being thrashed.
Avatar of jchri66
jchri66

I'd like to get in on this discussion since I'm having the same issue.  I have had it since mid December and the users are really starting to grumble.  I have an ticket open with Microsoft and they are still going through the motions of detecting the problem.  I have about 75 mailboxes but half of them are in the 2Gig range, 10 of them were 3 - 5Gig(yikes).  All are on the Exchange server, not using PSTs.  10 or so people have folder structures that are mind boggling.  I just spend the weekend archiving every mailbox that was over 2 Gig.  The store is about 105Gig.  

Exchange 2003 SP2
Server 2003 R2 SP2
Outlook 2003  / some 2007

Some things I have done:
Changed NICs and switch ports
Uninstalled all Symantec software.
Tried Outlook in SAFE MODE
Rebooted
Currently trying to take a closer look at the Network to see if there is anything hidden going on.  When I look at the traffic on the switch I don't see high utilization but I think I has fragmentation when I watch the traffic from the DCs and Exchange server.

Changes to my environment in the last month:
Added 9 iPhones
Added Blackberry Server and 6 Blackberry phones.
Sugar Outlook Addin to allow Email upload to Sugar system
Remote site downed a DC without allowing me to DCPROMO. (I did clean ADS though)

Figured maybe together we can figure this out.
Does anyone know if a patch might be the problem?
Avatar of jchri66
jchri66

I have also looked at the same counters and see the disk gets pegged and RPC latency goes up, but CPU and memory are great.  I ran best the practices tool, ExMon and troubleshooter.  They all show high RPC but offer no help as to what is causing it.  

I'm still brain dumping here so I'll probably have more information and I'll keep you posted with MS finds.  I believe I'm waiting for a 2003 Server guy to call back.
Avatar of jchri66
jchri66

More:

I checked the time sync to make sure Kerberos wasn't off. RPCPings are good between DCs/GCs and Exchange server and clients.  There are 2 GCs on the subnet with the Exchange server and Exchange sees both of them.
Avatar of sybadmin1040
sybadmin1040
Flag of Afghanistan image

ASKER

The security software I refernced is a product called SNARE - free download - basically pipes Windows event logs out to a syslog box or something similar.  The service is disabled and I just had a spike in the average disk queue.  I thought perhaps it might have been information store fragmentation, so, this weekend I downed each one and ran an offline degrag - apparently, that wasn't the cause either.  I have done the same thing - Best Practices analyzer, my boss paid for an "Exchange Expert" to come in and take a look at my setup and he gave me the greenlight (big waste of money I know - we could have bought a new server with that cash).  Anyway, our setup is pretty basic - one Exchange server and one front end server - the front end server is only being used for iPhone access a the present time.  Thanks for checking everyone - if someone comes up with something, please let me know. I seem to be in the same boat as jchri66 - users are starting to grumble.
Avatar of tigermatt
tigermatt
Flag of United Kingdom of Great Britain and Northern Ireland image


I can't remember if it was this question or another one - however, did you say your users have Cached Mode disabled? Based on that I think it would be interesting to see whether you see these disk spikes over a weekend period, when nobody is around. It seems the 'Outlook is disconnected' and Average Disk Queue is related in some way.
Avatar of sybadmin1040
sybadmin1040
Flag of Afghanistan image

ASKER

Most of the weekend is flatlined.  We do have some Saturday workers, but, after hours - it's basically nothing.  I just can't put my finger on what goes on during the spikes.  I know that upgrading hardware would work - there's not too much of a doubt - it's just the fact that I can't pinpoint why the spike has happened.  It may have been high all along and we hit a threshold and just can't recover from it, but, journalling has been disabled (it was enabled for over a year) which should save the system from basically duplicating work by copying every email.
Avatar of jchri66
jchri66

I solved my problem!!  Hopefully this will help you.  

I have two Macs on my network and on one of them Entourage was pegging the Exchange server.  I had noticed a lot of broadcasting from the Macs IP last week and decided to take note and then go down that thread if all else fails.  I had noticed that the timeouts occurred at specific times during the day so this morning everything was still good at 10:30am I checked and the Mac user wasn't in yet so I kept watching.  Sure enough at 11:15am RPC latency started happening, he had gotten into work at 11:05.  I disabled his port and the issue resolved pretty much in 10 seconds.  After enabling the port a big Write Spike would happen and then Disk Queue would go off the charts.  After that RPC latency would go up slowly and killl all the Outlook clients.  We closed Entourage on his Mac and all was well again and could replicate the issue ny opening and closing the app so it wasn't the Mac over all, just the app.  We are working on a fix for him but I truely hope this helps you guys out because I've been working on this for about a month and finally a breakthrough!!  
Avatar of jchri66
jchri66

By the way....

Jan 12th will forever be known as VOE(Vistory Over Entourage) Day for my small IT systems group.   :)
Avatar of jchri66
jchri66

Or rather "Victory" Over Entourage.
Avatar of sybadmin1040
sybadmin1040
Flag of Afghanistan image

ASKER

Well, we do have only one MAC on our network, but, I will check it out for sure! Thank you so much for your help - I'll let everyone know if this works on my network.
Avatar of sybadmin1040
sybadmin1040
Flag of Afghanistan image

ASKER

Well, I checked with our one Mac user and he doesn't use Entourage.  I guess it could still be something on the Mac side that's doing it, I'll have to investigate further.  Thanks for the tip!
ASKER CERTIFIED SOLUTION
Avatar of jchri66
jchri66

Blurred text
THIS SOLUTION IS ONLY AVAILABLE TO MEMBERS.
View this solution by signing up for a free trial.
Members can start a 7-Day free trial and enjoy unlimited access to the platform.
See Pricing Options
Start Free Trial
Avatar of sybadmin1040
sybadmin1040
Flag of Afghanistan image

ASKER

Just disabled it and we'll see what happens.  Since running an offline defrag of all information stores, disk activity is nowhere near what it was before.  I'm guessing that was part of the issue, but, I will still see spikes at time that last for a minute or so.  Again, with older 10K drives and older hardware (approximately 3-4 years old), this may be typical.  I'll see if disabling TOE helps.  Thanks!
Avatar of sybadmin1040
sybadmin1040
Flag of Afghanistan image

ASKER

Well, I tried disabling TOE and Disk Queueing is still being pegged at 100% intermittently.
Avatar of sybadmin1040
sybadmin1040
Flag of Afghanistan image

ASKER

I'm closing this one and giving jchri66 the points.  We still have this issue intermittently and I'm going to be migrating to different hardware shortly (faster drives, etc).  Thanks for everyone's suggestions.
Avatar of jcoyle9
jcoyle9

I had the same exact problem it was a Mac using Entourage that was causing the problem.  I actually shut down every outlook client on the network and started to bring them up one at a time to find which client was causing the problem.  As soon at the computer that was causing the problem opened the mail client the server pegged to 100%.  Its nice to have two people to do this.  Have one person start up the mail client and the other watching the servers performance.  It worked for me, good luck.
Avatar of jchri66
jchri66

As an update to this problem, the accepted solution to this problem above really just identifies the cause of the problem and not the solution.  After I found that it was the Entourage client we still worked for a couple weeks trying to figure out how to stop it.  What I ended up finding out was that ONE email in the Entourage clients inbox was causing the problem.  We had to clean out the inbox and move portions of the inbox back in while testing each time for the problem.  This revealed the one Email that was bringing the Exchange server to it's knees.  Again, ONE email in someone's inbox was killing the Exchange server.  We have actually added another MAC to the network bringing the total to 3 so I still monitor average disk queue.

To further the information for digestion, in the weeks trying to fix the client we did everything from uninstalling Entrouage 2008 upgrade version and installing the full version of Entourage 2008 thinking maybe the upgrade caused something.  We had upgraded his application 3 months earlier.  We also went back to 2004 which still had the problem.  So go straight for the Email hunt to save time.

Because I had opened a support ticket with Microsoft on this they were very interested in everything I did to solve the problem and called back several times to take notes.  They also credited back my service call since they didn't solve the problem for me.  So now have another parachute in case another haystack problem comes up.

On a non IT management note, after we got everything fixed I had to deal with the doubts that it was actually fixed from this group of people for another two weeks.  The phrase, "there's nothing wrong with the Email system, I can replicate the problem and show you we fixed" was used numerous times.  I'm sure we can all relate.  
Avatar of jcoyle9
jcoyle9

Yes you are correct it was one single email.  I just put all of the emails in an archive folder off of the server and made a new mailbox on the server for the Mac.  The problem is when is this going to happen again.  
Avatar of borgmember
borgmember
Flag of United States of America image

I have an Iphone 3G at a customer that when Activesync is enabled after 24 hours it will slow exchange to a crawl. Around the 24hr mark of being on store.exe on the server will stay around 30% CPU all the time and make the outlook clients on the network unusable. The only way to fix it is turn off activesync for that user and reboot the server. Very frustrating.
Storage
Storage

Computer data storage, often called storage or memory, is a technology consisting of computer components and recording media used to retain digital data. In addition to local storage devices like CD and DVD readers, hard drives and flash drives, solid state drives can hold enormous amounts of data in a very small device. Cloud services and other new forms of remote storage also add to the capacity of devices and their ability to access more data without building additional data storage into a device.

45K
Questions
--
Followers
--
Top Experts
Get a personalized solution from industry experts
Ask the experts
Read over 600 more reviews

TRUSTED BY

IBM logoIntel logoMicrosoft logoUbisoft logoSAP logo
Qualcomm logoCitrix Systems logoWorkday logoErnst & Young logo
High performer badgeUsers love us badge
LinkedIn logoFacebook logoX logoInstagram logoTikTok logoYouTube logo