Windows 2003 unresponsive during high IO utlization.

Hello Experts,

First the specs:

Role: Main File Server/Data Dump.  Server consists of small and large files including Outlook PST’s ranging from 200megs to 5gigs.  As well as large creative images (1gig) and small documents 1k to 25megs in size.

Dell Poweredge 2800
2x Intel Xeon 3ghz hyperthreading off
2gig DDR2 ECC
AHA-39160 Ultra 160 PCI used by Powervault 124T LTO3 Autoloader
PERC 4e/Di 256meg (Embedded)
8x 300gig Ultra 320 RAID 5 on PERC 4e/Di (Data)
2x 36gig Ultra 320 RAID 1 on PERC 4/SC (Windows/Boot files)
Adaptive Read Ahead/Write Back/Direct IO Raid Policy
Intel Pro 1000/MT running at 100BaseT Full Duplex
Windows Server 2003 R2 SP1 /w up to date patches
120 users approx 50 users connected at any given time

•      Symantec Antivirus Client 10.0.1
•      Symantec Backup Exec 10d
•      Diskeeper 2007 Enterprise Edition

Symptoms: IO jumps to 100%  read times randomly for a few min on the data drive during peek hours which causes the server to slow down to the point where users get disconnected and windows becomes unresponsive. Using perfmon write time on the data drive CPU/RAM/network utilization minimal during this time.

Solutions that didn’t work:

•      Originally the server was installed on a separate partition but on the same RAID 5 array as the data drive  (8x 300gig Ultra 320 RAID 5 on PERC 4e/Di) resulting in windows becoming unresponsive when the IO jumps.  I recently added a separate PERC 4/SC and moved windows/boot files to its own RAID 1 array.  This relieved windows of it unresponsiveness but the problem still persisted on the DATA drive.
•      Installed
•      Turned off antivirus
•      Defragged Drive
•      Installed Diskeeper 2007 Enterprise Edition
•      Switched from 1000BaseTx to 100BaseTx
•      Changed to No Read Ahead/Write Through/Cache IO Raid Policy
•      Ran various Dell Diagnostics tools… server passed with no errors.  Event Viewer shows no errors.
•      Ran raid consistency checks
•      Tried other various solutions found in

Temporary Solutions that work:

•      Reboot the server
•      Disable the network card thereby disconnecting all users.

Solutions that I’m considering:

•      Replace PERC 4e/di (faulty?) and recreate the raid 5 array backup/restore data
•      Install Windows Server 2003 SP2

Thank  You and I hope someone can provide some insight.

Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

1) Are they any warnings or errors in he Event logs ?
2) Have you tried runing perfmon to see exactly where the issue is occurring ?

I hope this helps !
JDCRCAuthor Commented:
1. Unfortunately there are no warnings or errors in event viewer.  
2. I monitor the server in realtime using another box running perfmon.  The only error that perfmon shows me is 100% IO disk read time/logical disk on the data drive /w virtually minimal IO disk write time/ on the same logical disk.  Since i moved windows/bootfiles to a separate RAID1 controller, windows hasnt been affected during the IO jumps.   However, users still get disconnected and browsing files in the server takes a horrendously long time.
You said that you defragged the drive; since you are running Diskkeeper you can get a report on the MFT.

It's quite possible that the number of files has exceeded a good amount of the MFT.  Once that gets full it starts corrupting quite a bit.  If that is fragmented and/or corrupted you will see exactly this problem as it attempts to find all the files.  

The MFT lists the locations and sizes of all the files on the drives.  When that file is full or fragmented, that process of reading the MFT takes a long time and that is added onto the time it takes to find and read the files in question.

What's the status of your MFT?
Newly released Acronis True Image 2019

In announcing the release of the 15th Anniversary Edition of Acronis True Image 2019, the company revealed that its artificial intelligence-based anti-ransomware technology – stopped more than 200,000 ransomware attacks on 150,000 customers last year.

JDCRCAuthor Commented:
Total MFT size: 1,642 MB
MFT records in use: 1,458,938
Percent MFT in use: 86%
Total MFT Fragments: 0

I've ran diskeepers MFT defragment utility multiple times before and it would stay at 86% usage even though i specify it to resize automatically.   It did however defrag the MFT during the first run.

Thanks for all the inputs so far.
Ok so that is not the problem.  That is high usage but not horrible and there are no fragments which is very good and says that it's not overly used.

since you mentioned that disconnecting the NIC resets the problem.  Have you tried changing nics?  There could be a problem with the NIC card.  Drop an off the shelf workstation nic in and use that for a while just to see what happens.
JDCRCAuthor Commented:
Yes, the server comes with 2x Intel Pro 1000/MT.  I've tried both nic's @ 100Tx half and full duplex as well as changed the cat6 cable and switch.

Thank you.
Dean ChafeeIT/InfoSec ManagerCommented:
I have had this same problem with a very similar setup. The issue has been when a user takes a sizable (multi gig) archive file like .ZIP and extracts it back to the same location, or another folder on the same network drive. The IO contention on the drive array with a large read and write back to the same location can kill... especially Raid5 as it is not the most high performance of the Raid options.
The trick is nailing down the "who" and re-train them to use a seperate source and target drive. I have a small set of users (about 40) so it's not too bad here. I'm not sure how you get disk IO performance counters on a user basis...     Anyone know?
JDCRCAuthor Commented:
Interesting... I was looking for an application that would track IO usage on a per user level but was unsucessful in locating one (I was using Task Managers IO process counters but that only tracks per program).  That would definitely help me track down the user or users that's causing this if that is the case.   Another thing i did notice is that the same 100% IO read utilization happens during backup, this sometimes causes all my shadow copies to be deleted due to high disk activity.... another pain in itself.

Thanks again. Keep the suggestions coming.
Too many apps doing synchronous disk I/O calls.  This is why for Windows servers, IOPS matters more than MB/sec.

Picture this:

Your hard drive is a bank.
Your disk controller is the ATM.
Your apps are people wanting to get to the ATM.

The backup app takes $64k out at a time, then gets right back in line to deposit $64k.  If you're lucky, he's going to a different ATM so he's in a second line somewhere else.  If you're not lucky, he's back in the same line.

The ZIP apps takes $32k out at a time, then gets right back in line, deposits $20k, and repeats.  Like the Enron CEO, just on a smaller scale.

Explorer apps get in line to get an account inquiry (directory list).  He's got 200 accounts (files).  He gets back inline and does one account inquiry for each of the 200 accounts just so he can get a pretty deposit slip (file icon).  Explorer apps should be shot.

The more apps that want disk I/O, the longer the line.

Once you get in line, you cannot get out.  Synchronous I/O calls make the I/O request, and WAITS for the I/O to complete.  This is a bad thing in a heavily multi-user environment.

Asynchronous I/O calls means dispatching your intern to go stand in line for you and bring back the results when he's done.  In the mean time, you can get other things done (like updating the user interface).  But an application needs to be designed to use asynchronous I/O.  And as any HR department will tell you, managing interns is a complicated process.  It's simpler to just go do the I/O in synchronous mode than to write an intern thread to take care of it asynchronously.

To FIX the problem...

Increase I/O queue depth on the disk controller if possible.  This would place multiple ATM's for the people in line for faster read/writes.  Faster because the line is shorter and the controller can decide the fastest way to reorder the read/write requests.

Having multiple smaller disks will be faster than fewer larger disks.  I/O queuing is on a per-LUN/disk basis, not per controller.  The controller typically has a finite number of buffers shared across all LUNs.

Reorder read/writes to different controllers (preferred) or disks.  Withdraw from one ATM, deposit to another ATM with a different bank.

Seperate USER drives from WORK drives.  The amount of disk I/O generated by simply browsing folders with Explorer is insane.

You may want to think about going to a higher performance disk array.  An HDS 9980V can easily do 48,000 IOPS (I/O's per second).  Downside it that it's a bit pricy (seven figures).  The AMS/WMS line is five-figures and still has outstanding IOPS.

During all this, your throughput may be quite small.  It's not necessary transferring lots of data, just making lots of requests, and the requests sit in a line til they're processed.  This is when the system's UI becomes unresponsive, apps die, explorer explodes, and mouse clicks take fifteen seconds to process.

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
Dean ChafeeIT/InfoSec ManagerCommented:
@ VXDguy

"Like the Enron CEO, just on a smaller scale."  LOL

Good analogy. I totally agree with the IOPS and disk queue length.

@ JDCRC   In Perfmon, you should watch the Physical Disk, Avg. Disk Queue Length to see if this indeed is an issue.    Also, to help narrow down the possible "offending user" list, you can use 'Net Files' at the command line to see who has files open, or use Computer Management Console, Shared Folders, Open Files. Same list different methods.
JDCRCAuthor Commented:
Ok perfect.  I will add the disk queue length counter on my perfmon box and see if that is what is causing the issue.   I'm already using the file server management console to check which users are connnected during the IO jumps.  As far as i can tell the some of the users are attached to their outlook PST files which are by in large pretty big.   Maybe its the Outlook PST files that are causing the issue.

There are around 20 locked PST files at any given time ranging from 200 megs to 4 gigs.  

Great Feedback!
I really find it hard to believe that it's the disks causing the problem.  I run 100 users on a virtual server with 4 250GB Raid 5 drives for what sounds like roughly the same thing.  These same disks also host 2 web servers and a certificate server.  All of my users have their PST files open all day long and they range in size for 200MB to over 10GB for a few users (who constantly get yelled at).

It may end up being the drive, but since things lock up... I'd be more willing to think somewhere else is causing the issue.  It could be a controller issue.
JDCRCAuthor Commented:
Aackley it's very good to know that someone actually has a very similar setup without any issues.  I do find it hard to believe that a file server of this caliber can't handle this amount of traffic.  I was planning to move the PST files (and have finance yell at me for making a PO for another file server) to a new server.  

Can you please tell me the model of your raid controller as well as the raid policy?

PS FixingStuff:  I added the avg disk read queue length and avg disk write queue length in perfmon.   We had the issue pop up again today and the counters seem to match my old counters % Disk Read Time.  So other than having another counter in perfmon i dont see the advantage versus % Disk Read Time.
Ok actual full setup of this server is:
Dell 2950
Duel Xeon PIVs with 8GB RAm
Perc 5i controller (128MB) with 6 146GB SAS Drives. Configured in a 2 for Raid 1 (mirror) and 4 in a Raid 5.

The server runs Win2k3 Enterprise with Microsoft Virtual Server 2005 R2.

This hosts a file server for approx. 100 users including department files, application files, user folders, psts, etc.  2 Web Servers, a Small SQL 2005 Database server and a Certificate Server.

This is a huge upgrade from the old Dell server with duel P2 400s and 4 18GB SCSI drives on an old Perc 3 card that was running the file server before for the same users up until 6 months ago without lockups.  It was quite a bit slower but never to the point you're talking about.
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today

From novice to tech pro — start learning today.