Solved

Windows 2003 unresponsive during high IO utlization.

Posted on 2007-03-27
17
990 Views
Last Modified: 2012-08-14
Hello Experts,

First the specs:

Role: Main File Server/Data Dump.  Server consists of small and large files including Outlook PST’s ranging from 200megs to 5gigs.  As well as large creative images (1gig) and small documents 1k to 25megs in size.

Dell Poweredge 2800
2x Intel Xeon 3ghz hyperthreading off
2gig DDR2 ECC
AHA-39160 Ultra 160 PCI used by Powervault 124T LTO3 Autoloader
PERC 4/SC PCI
PERC 4e/Di 256meg (Embedded)
8x 300gig Ultra 320 RAID 5 on PERC 4e/Di (Data)
2x 36gig Ultra 320 RAID 1 on PERC 4/SC (Windows/Boot files)
Adaptive Read Ahead/Write Back/Direct IO Raid Policy
Intel Pro 1000/MT running at 100BaseT Full Duplex
Windows Server 2003 R2 SP1 /w up to date patches
120 users approx 50 users connected at any given time

•      Symantec Antivirus Client 10.0.1
•      Symantec Backup Exec 10d
•      Diskeeper 2007 Enterprise Edition

Symptoms: IO jumps to 100%  read times randomly for a few min on the data drive during peek hours which causes the server to slow down to the point where users get disconnected and windows becomes unresponsive. Using perfmon write time on the data drive CPU/RAM/network utilization minimal during this time.

Solutions that didn’t work:

•      Originally the server was installed on a separate partition but on the same RAID 5 array as the data drive  (8x 300gig Ultra 320 RAID 5 on PERC 4e/Di) resulting in windows becoming unresponsive when the IO jumps.  I recently added a separate PERC 4/SC and moved windows/boot files to its own RAID 1 array.  This relieved windows of it unresponsiveness but the problem still persisted on the DATA drive.
•      Installed http://support.microsoft.com/kb/915691
•      Turned off antivirus
•      Defragged Drive
•      Installed Diskeeper 2007 Enterprise Edition
•      Switched from 1000BaseTx to 100BaseTx
•      Changed to No Read Ahead/Write Through/Cache IO Raid Policy
•      Ran various Dell Diagnostics tools… server passed with no errors.  Event Viewer shows no errors.
•      Ran raid consistency checks
•      Tried other various solutions found in experts-exchange.com

Temporary Solutions that work:

•      Reboot the server
•      Disable the network card thereby disconnecting all users.

Solutions that I’m considering:

•      Replace PERC 4e/di (faulty?) and recreate the raid 5 array backup/restore data
•      Install Windows Server 2003 SP2

Thank  You and I hope someone can provide some insight.




0
Comment
Question by:JDCRC
  • 6
  • 4
  • 2
  • +2
17 Comments
 
LVL 63

Expert Comment

by:SysExpert
Comment Utility
1) Are they any warnings or errors in he Event logs ?
2) Have you tried runing perfmon to see exactly where the issue is occurring ?

I hope this helps !
0
 

Author Comment

by:JDCRC
Comment Utility
1. Unfortunately there are no warnings or errors in event viewer.  
2. I monitor the server in realtime using another box running perfmon.  The only error that perfmon shows me is 100% IO disk read time/logical disk on the data drive /w virtually minimal IO disk write time/ on the same logical disk.  Since i moved windows/bootfiles to a separate RAID1 controller, windows hasnt been affected during the IO jumps.   However, users still get disconnected and browsing files in the server takes a horrendously long time.
0
 
LVL 1

Expert Comment

by:AAckley
Comment Utility
You said that you defragged the drive; since you are running Diskkeeper you can get a report on the MFT.

It's quite possible that the number of files has exceeded a good amount of the MFT.  Once that gets full it starts corrupting quite a bit.  If that is fragmented and/or corrupted you will see exactly this problem as it attempts to find all the files.  

The MFT lists the locations and sizes of all the files on the drives.  When that file is full or fragmented, that process of reading the MFT takes a long time and that is added onto the time it takes to find and read the files in question.

What's the status of your MFT?
0
 

Author Comment

by:JDCRC
Comment Utility
Total MFT size: 1,642 MB
MFT records in use: 1,458,938
Percent MFT in use: 86%
Total MFT Fragments: 0

I've ran diskeepers MFT defragment utility multiple times before and it would stay at 86% usage even though i specify it to resize automatically.   It did however defrag the MFT during the first run.

Thanks for all the inputs so far.
0
 
LVL 1

Expert Comment

by:AAckley
Comment Utility
Ok so that is not the problem.  That is high usage but not horrible and there are no fragments which is very good and says that it's not overly used.

since you mentioned that disconnecting the NIC resets the problem.  Have you tried changing nics?  There could be a problem with the NIC card.  Drop an off the shelf workstation nic in and use that for a while just to see what happens.
0
 

Author Comment

by:JDCRC
Comment Utility
Yes, the server comes with 2x Intel Pro 1000/MT.  I've tried both nic's @ 100Tx half and full duplex as well as changed the cat6 cable and switch.

Thank you.
0
 
LVL 9

Expert Comment

by:FixingStuff
Comment Utility
I have had this same problem with a very similar setup. The issue has been when a user takes a sizable (multi gig) archive file like .ZIP and extracts it back to the same location, or another folder on the same network drive. The IO contention on the drive array with a large read and write back to the same location can kill... especially Raid5 as it is not the most high performance of the Raid options.
The trick is nailing down the "who" and re-train them to use a seperate source and target drive. I have a small set of users (about 40) so it's not too bad here. I'm not sure how you get disk IO performance counters on a user basis...     Anyone know?
fs
0
How to improve team productivity

Quip adds documents, spreadsheets, and tasklists to your Slack experience
- Elevate ideas to Quip docs
- Share Quip docs in Slack
- Get notified of changes to your docs
- Available on iOS/Android/Desktop/Web
- Online/Offline

 

Author Comment

by:JDCRC
Comment Utility
Interesting... I was looking for an application that would track IO usage on a per user level but was unsucessful in locating one (I was using Task Managers IO process counters but that only tracks per program).  That would definitely help me track down the user or users that's causing this if that is the case.   Another thing i did notice is that the same 100% IO read utilization happens during backup, this sometimes causes all my shadow copies to be deleted due to high disk activity.... another pain in itself.

Thanks again. Keep the suggestions coming.
0
 
LVL 3

Accepted Solution

by:
VXDguy earned 168 total points
Comment Utility
Too many apps doing synchronous disk I/O calls.  This is why for Windows servers, IOPS matters more than MB/sec.

Picture this:

Your hard drive is a bank.
Your disk controller is the ATM.
Your apps are people wanting to get to the ATM.

The backup app takes $64k out at a time, then gets right back in line to deposit $64k.  If you're lucky, he's going to a different ATM so he's in a second line somewhere else.  If you're not lucky, he's back in the same line.

The ZIP apps takes $32k out at a time, then gets right back in line, deposits $20k, and repeats.  Like the Enron CEO, just on a smaller scale.

Explorer apps get in line to get an account inquiry (directory list).  He's got 200 accounts (files).  He gets back inline and does one account inquiry for each of the 200 accounts just so he can get a pretty deposit slip (file icon).  Explorer apps should be shot.

The more apps that want disk I/O, the longer the line.

Once you get in line, you cannot get out.  Synchronous I/O calls make the I/O request, and WAITS for the I/O to complete.  This is a bad thing in a heavily multi-user environment.

Asynchronous I/O calls means dispatching your intern to go stand in line for you and bring back the results when he's done.  In the mean time, you can get other things done (like updating the user interface).  But an application needs to be designed to use asynchronous I/O.  And as any HR department will tell you, managing interns is a complicated process.  It's simpler to just go do the I/O in synchronous mode than to write an intern thread to take care of it asynchronously.

To FIX the problem...

Increase I/O queue depth on the disk controller if possible.  This would place multiple ATM's for the people in line for faster read/writes.  Faster because the line is shorter and the controller can decide the fastest way to reorder the read/write requests.

Having multiple smaller disks will be faster than fewer larger disks.  I/O queuing is on a per-LUN/disk basis, not per controller.  The controller typically has a finite number of buffers shared across all LUNs.

Reorder read/writes to different controllers (preferred) or disks.  Withdraw from one ATM, deposit to another ATM with a different bank.

Seperate USER drives from WORK drives.  The amount of disk I/O generated by simply browsing folders with Explorer is insane.

You may want to think about going to a higher performance disk array.  An HDS 9980V can easily do 48,000 IOPS (I/O's per second).  Downside it that it's a bit pricy (seven figures).  The AMS/WMS line is five-figures and still has outstanding IOPS.

During all this, your throughput may be quite small.  It's not necessary transferring lots of data, just making lots of requests, and the requests sit in a line til they're processed.  This is when the system's UI becomes unresponsive, apps die, explorer explodes, and mouse clicks take fifteen seconds to process.
0
 
LVL 9

Assisted Solution

by:FixingStuff
FixingStuff earned 166 total points
Comment Utility
@ VXDguy

"Like the Enron CEO, just on a smaller scale."  LOL

Good analogy. I totally agree with the IOPS and disk queue length.

@ JDCRC   In Perfmon, you should watch the Physical Disk, Avg. Disk Queue Length to see if this indeed is an issue.    Also, to help narrow down the possible "offending user" list, you can use 'Net Files' at the command line to see who has files open, or use Computer Management Console, Shared Folders, Open Files. Same list different methods.
fs
0
 

Author Comment

by:JDCRC
Comment Utility
Ok perfect.  I will add the disk queue length counter on my perfmon box and see if that is what is causing the issue.   I'm already using the file server management console to check which users are connnected during the IO jumps.  As far as i can tell the some of the users are attached to their outlook PST files which are by in large pretty big.   Maybe its the Outlook PST files that are causing the issue.

There are around 20 locked PST files at any given time ranging from 200 megs to 4 gigs.  

Great Feedback!
0
 
LVL 1

Assisted Solution

by:AAckley
AAckley earned 166 total points
Comment Utility
I really find it hard to believe that it's the disks causing the problem.  I run 100 users on a virtual server with 4 250GB Raid 5 drives for what sounds like roughly the same thing.  These same disks also host 2 web servers and a certificate server.  All of my users have their PST files open all day long and they range in size for 200MB to over 10GB for a few users (who constantly get yelled at).

It may end up being the drive, but since things lock up... I'd be more willing to think somewhere else is causing the issue.  It could be a controller issue.
0
 

Author Comment

by:JDCRC
Comment Utility
Aackley it's very good to know that someone actually has a very similar setup without any issues.  I do find it hard to believe that a file server of this caliber can't handle this amount of traffic.  I was planning to move the PST files (and have finance yell at me for making a PO for another file server) to a new server.  

Can you please tell me the model of your raid controller as well as the raid policy?

PS FixingStuff:  I added the avg disk read queue length and avg disk write queue length in perfmon.   We had the issue pop up again today and the counters seem to match my old counters % Disk Read Time.  So other than having another counter in perfmon i dont see the advantage versus % Disk Read Time.
0
 
LVL 1

Expert Comment

by:AAckley
Comment Utility
Ok actual full setup of this server is:
Dell 2950
Duel Xeon PIVs with 8GB RAm
Perc 5i controller (128MB) with 6 146GB SAS Drives. Configured in a 2 for Raid 1 (mirror) and 4 in a Raid 5.

The server runs Win2k3 Enterprise with Microsoft Virtual Server 2005 R2.

This hosts a file server for approx. 100 users including department files, application files, user folders, psts, etc.  2 Web Servers, a Small SQL 2005 Database server and a Certificate Server.

This is a huge upgrade from the old Dell server with duel P2 400s and 4 18GB SCSI drives on an old Perc 3 card that was running the file server before for the same users up until 6 months ago without lockups.  It was quite a bit slower but never to the point you're talking about.
0

Featured Post

Save on storage to protect fatherhood memories

You're the dad who has everything. This Father's Day, make sure your family memories are protected. My Passport Ultra has automatic backup and password protection to keep your cherished photos and videos safe. With up to 3TB, you have plenty of room to hold the adventures ahead.

Join & Write a Comment

On July 14th 2015, Windows Server 2003 will become End of Support, leaving hundreds of thousands of servers around the world that still run this 12 year old operating system vulnerable and potentially out of compliance in many organisations around t…
Every server (virtual or physical) needs a console: and the console can be provided through hardware directly connected, software for remote connections, local connections, through a KVM, etc. This document explains the different types of consol…
This tutorial will walk an individual through the process of installing the necessary services and then configuring a Windows Server 2012 system as an iSCSI target. To install the necessary roles, go to Server Manager, and select Add Roles and Featu…
This Micro Tutorial will teach you how to reformat your flash drive. Sometimes your flash drive may have issues carrying files so this will completely restore it to manufacturing settings. Make sure to backup all files before reformatting. This w…

743 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

17 Experts available now in Live!

Get 1:1 Help Now