Avatar of JDCRC
JDCRC

asked on 

Windows 2003 unresponsive during high IO utlization.

Hello Experts,

First the specs:

Role: Main File Server/Data Dump.  Server consists of small and large files including Outlook PST’s ranging from 200megs to 5gigs.  As well as large creative images (1gig) and small documents 1k to 25megs in size.

Dell Poweredge 2800
2x Intel Xeon 3ghz hyperthreading off
2gig DDR2 ECC
AHA-39160 Ultra 160 PCI used by Powervault 124T LTO3 Autoloader
PERC 4/SC PCI
PERC 4e/Di 256meg (Embedded)
8x 300gig Ultra 320 RAID 5 on PERC 4e/Di (Data)
2x 36gig Ultra 320 RAID 1 on PERC 4/SC (Windows/Boot files)
Adaptive Read Ahead/Write Back/Direct IO Raid Policy
Intel Pro 1000/MT running at 100BaseT Full Duplex
Windows Server 2003 R2 SP1 /w up to date patches
120 users approx 50 users connected at any given time

•      Symantec Antivirus Client 10.0.1
•      Symantec Backup Exec 10d
•      Diskeeper 2007 Enterprise Edition

Symptoms: IO jumps to 100%  read times randomly for a few min on the data drive during peek hours which causes the server to slow down to the point where users get disconnected and windows becomes unresponsive. Using perfmon write time on the data drive CPU/RAM/network utilization minimal during this time.

Solutions that didn’t work:

•      Originally the server was installed on a separate partition but on the same RAID 5 array as the data drive  (8x 300gig Ultra 320 RAID 5 on PERC 4e/Di) resulting in windows becoming unresponsive when the IO jumps.  I recently added a separate PERC 4/SC and moved windows/boot files to its own RAID 1 array.  This relieved windows of it unresponsiveness but the problem still persisted on the DATA drive.
•      Installed http://support.microsoft.com/kb/915691
•      Turned off antivirus
•      Defragged Drive
•      Installed Diskeeper 2007 Enterprise Edition
•      Switched from 1000BaseTx to 100BaseTx
•      Changed to No Read Ahead/Write Through/Cache IO Raid Policy
•      Ran various Dell Diagnostics tools… server passed with no errors.  Event Viewer shows no errors.
•      Ran raid consistency checks
•      Tried other various solutions found in experts-exchange.com

Temporary Solutions that work:

•      Reboot the server
•      Disable the network card thereby disconnecting all users.

Solutions that I’m considering:

•      Replace PERC 4e/di (faulty?) and recreate the raid 5 array backup/restore data
•      Install Windows Server 2003 SP2

Thank  You and I hope someone can provide some insight.




StorageWindows Server 2003Server Hardware

Avatar of undefined
Last Comment
AAckley
Avatar of SysExpert
SysExpert
Flag of Israel image

1) Are they any warnings or errors in he Event logs ?
2) Have you tried runing perfmon to see exactly where the issue is occurring ?

I hope this helps !
Avatar of JDCRC
JDCRC

ASKER

1. Unfortunately there are no warnings or errors in event viewer.  
2. I monitor the server in realtime using another box running perfmon.  The only error that perfmon shows me is 100% IO disk read time/logical disk on the data drive /w virtually minimal IO disk write time/ on the same logical disk.  Since i moved windows/bootfiles to a separate RAID1 controller, windows hasnt been affected during the IO jumps.   However, users still get disconnected and browsing files in the server takes a horrendously long time.
Avatar of AAckley
AAckley

You said that you defragged the drive; since you are running Diskkeeper you can get a report on the MFT.

It's quite possible that the number of files has exceeded a good amount of the MFT.  Once that gets full it starts corrupting quite a bit.  If that is fragmented and/or corrupted you will see exactly this problem as it attempts to find all the files.  

The MFT lists the locations and sizes of all the files on the drives.  When that file is full or fragmented, that process of reading the MFT takes a long time and that is added onto the time it takes to find and read the files in question.

What's the status of your MFT?
Avatar of JDCRC
JDCRC

ASKER

Total MFT size: 1,642 MB
MFT records in use: 1,458,938
Percent MFT in use: 86%
Total MFT Fragments: 0

I've ran diskeepers MFT defragment utility multiple times before and it would stay at 86% usage even though i specify it to resize automatically.   It did however defrag the MFT during the first run.

Thanks for all the inputs so far.
Avatar of AAckley
AAckley

Ok so that is not the problem.  That is high usage but not horrible and there are no fragments which is very good and says that it's not overly used.

since you mentioned that disconnecting the NIC resets the problem.  Have you tried changing nics?  There could be a problem with the NIC card.  Drop an off the shelf workstation nic in and use that for a while just to see what happens.
Avatar of JDCRC
JDCRC

ASKER

Yes, the server comes with 2x Intel Pro 1000/MT.  I've tried both nic's @ 100Tx half and full duplex as well as changed the cat6 cable and switch.

Thank you.
Avatar of Dean Chafee
Dean Chafee
Flag of United States of America image

I have had this same problem with a very similar setup. The issue has been when a user takes a sizable (multi gig) archive file like .ZIP and extracts it back to the same location, or another folder on the same network drive. The IO contention on the drive array with a large read and write back to the same location can kill... especially Raid5 as it is not the most high performance of the Raid options.
The trick is nailing down the "who" and re-train them to use a seperate source and target drive. I have a small set of users (about 40) so it's not too bad here. I'm not sure how you get disk IO performance counters on a user basis...     Anyone know?
fs
Avatar of JDCRC
JDCRC

ASKER

Interesting... I was looking for an application that would track IO usage on a per user level but was unsucessful in locating one (I was using Task Managers IO process counters but that only tracks per program).  That would definitely help me track down the user or users that's causing this if that is the case.   Another thing i did notice is that the same 100% IO read utilization happens during backup, this sometimes causes all my shadow copies to be deleted due to high disk activity.... another pain in itself.

Thanks again. Keep the suggestions coming.
ASKER CERTIFIED SOLUTION
Avatar of VXDguy
VXDguy

Blurred text
THIS SOLUTION IS ONLY AVAILABLE TO MEMBERS.
View this solution by signing up for a free trial.
Members can start a 7-Day free trial and enjoy unlimited access to the platform.
See Pricing Options
Start Free Trial
SOLUTION
Avatar of Dean Chafee
Dean Chafee
Flag of United States of America image

Blurred text
THIS SOLUTION IS ONLY AVAILABLE TO MEMBERS.
View this solution by signing up for a free trial.
Members can start a 7-Day free trial and enjoy unlimited access to the platform.
Avatar of JDCRC
JDCRC

ASKER

Ok perfect.  I will add the disk queue length counter on my perfmon box and see if that is what is causing the issue.   I'm already using the file server management console to check which users are connnected during the IO jumps.  As far as i can tell the some of the users are attached to their outlook PST files which are by in large pretty big.   Maybe its the Outlook PST files that are causing the issue.

There are around 20 locked PST files at any given time ranging from 200 megs to 4 gigs.  

Great Feedback!
SOLUTION
Avatar of AAckley
AAckley

Blurred text
THIS SOLUTION IS ONLY AVAILABLE TO MEMBERS.
View this solution by signing up for a free trial.
Members can start a 7-Day free trial and enjoy unlimited access to the platform.
Avatar of JDCRC
JDCRC

ASKER

Aackley it's very good to know that someone actually has a very similar setup without any issues.  I do find it hard to believe that a file server of this caliber can't handle this amount of traffic.  I was planning to move the PST files (and have finance yell at me for making a PO for another file server) to a new server.  

Can you please tell me the model of your raid controller as well as the raid policy?

PS FixingStuff:  I added the avg disk read queue length and avg disk write queue length in perfmon.   We had the issue pop up again today and the counters seem to match my old counters % Disk Read Time.  So other than having another counter in perfmon i dont see the advantage versus % Disk Read Time.
Avatar of AAckley
AAckley

Ok actual full setup of this server is:
Dell 2950
Duel Xeon PIVs with 8GB RAm
Perc 5i controller (128MB) with 6 146GB SAS Drives. Configured in a 2 for Raid 1 (mirror) and 4 in a Raid 5.

The server runs Win2k3 Enterprise with Microsoft Virtual Server 2005 R2.

This hosts a file server for approx. 100 users including department files, application files, user folders, psts, etc.  2 Web Servers, a Small SQL 2005 Database server and a Certificate Server.

This is a huge upgrade from the old Dell server with duel P2 400s and 4 18GB SCSI drives on an old Perc 3 card that was running the file server before for the same users up until 6 months ago without lockups.  It was quite a bit slower but never to the point you're talking about.
Windows Server 2003
Windows Server 2003

Windows Server 2003 was based on Windows XP and was released in four editions: Web, Standard, Enterprise and Datacenter. It also had derivative versions for clusters, storage and Microsoft’s Small Business Server. Important upgrades included integrating Internet Information Services (IIS), improvements to Active Directory (AD) and Group Policy (GP), and the migration to Automated System Recovery (ASR).

129K
Questions
--
Followers
--
Top Experts
Get a personalized solution from industry experts
Ask the experts
Read over 600 more reviews

TRUSTED BY

IBM logoIntel logoMicrosoft logoUbisoft logoSAP logo
Qualcomm logoCitrix Systems logoWorkday logoErnst & Young logo
High performer badgeUsers love us badge
LinkedIn logoFacebook logoX logoInstagram logoTikTok logoYouTube logo