Solved

Windows 2003 unresponsive during high IO utlization.

Posted on 2007-03-27
17
1,006 Views
Last Modified: 2012-08-14
Hello Experts,

First the specs:

Role: Main File Server/Data Dump.  Server consists of small and large files including Outlook PST’s ranging from 200megs to 5gigs.  As well as large creative images (1gig) and small documents 1k to 25megs in size.

Dell Poweredge 2800
2x Intel Xeon 3ghz hyperthreading off
2gig DDR2 ECC
AHA-39160 Ultra 160 PCI used by Powervault 124T LTO3 Autoloader
PERC 4/SC PCI
PERC 4e/Di 256meg (Embedded)
8x 300gig Ultra 320 RAID 5 on PERC 4e/Di (Data)
2x 36gig Ultra 320 RAID 1 on PERC 4/SC (Windows/Boot files)
Adaptive Read Ahead/Write Back/Direct IO Raid Policy
Intel Pro 1000/MT running at 100BaseT Full Duplex
Windows Server 2003 R2 SP1 /w up to date patches
120 users approx 50 users connected at any given time

•      Symantec Antivirus Client 10.0.1
•      Symantec Backup Exec 10d
•      Diskeeper 2007 Enterprise Edition

Symptoms: IO jumps to 100%  read times randomly for a few min on the data drive during peek hours which causes the server to slow down to the point where users get disconnected and windows becomes unresponsive. Using perfmon write time on the data drive CPU/RAM/network utilization minimal during this time.

Solutions that didn’t work:

•      Originally the server was installed on a separate partition but on the same RAID 5 array as the data drive  (8x 300gig Ultra 320 RAID 5 on PERC 4e/Di) resulting in windows becoming unresponsive when the IO jumps.  I recently added a separate PERC 4/SC and moved windows/boot files to its own RAID 1 array.  This relieved windows of it unresponsiveness but the problem still persisted on the DATA drive.
•      Installed http://support.microsoft.com/kb/915691
•      Turned off antivirus
•      Defragged Drive
•      Installed Diskeeper 2007 Enterprise Edition
•      Switched from 1000BaseTx to 100BaseTx
•      Changed to No Read Ahead/Write Through/Cache IO Raid Policy
•      Ran various Dell Diagnostics tools… server passed with no errors.  Event Viewer shows no errors.
•      Ran raid consistency checks
•      Tried other various solutions found in experts-exchange.com

Temporary Solutions that work:

•      Reboot the server
•      Disable the network card thereby disconnecting all users.

Solutions that I’m considering:

•      Replace PERC 4e/di (faulty?) and recreate the raid 5 array backup/restore data
•      Install Windows Server 2003 SP2

Thank  You and I hope someone can provide some insight.




0
Comment
Question by:JDCRC
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 6
  • 4
  • 2
  • +2
17 Comments
 
LVL 63

Expert Comment

by:SysExpert
ID: 18802435
1) Are they any warnings or errors in he Event logs ?
2) Have you tried runing perfmon to see exactly where the issue is occurring ?

I hope this helps !
0
 

Author Comment

by:JDCRC
ID: 18802854
1. Unfortunately there are no warnings or errors in event viewer.  
2. I monitor the server in realtime using another box running perfmon.  The only error that perfmon shows me is 100% IO disk read time/logical disk on the data drive /w virtually minimal IO disk write time/ on the same logical disk.  Since i moved windows/bootfiles to a separate RAID1 controller, windows hasnt been affected during the IO jumps.   However, users still get disconnected and browsing files in the server takes a horrendously long time.
0
 
LVL 1

Expert Comment

by:AAckley
ID: 18803393
You said that you defragged the drive; since you are running Diskkeeper you can get a report on the MFT.

It's quite possible that the number of files has exceeded a good amount of the MFT.  Once that gets full it starts corrupting quite a bit.  If that is fragmented and/or corrupted you will see exactly this problem as it attempts to find all the files.  

The MFT lists the locations and sizes of all the files on the drives.  When that file is full or fragmented, that process of reading the MFT takes a long time and that is added onto the time it takes to find and read the files in question.

What's the status of your MFT?
0
Ransomware-A Revenue Bonanza for Service Providers

Ransomware – malware that gets on your customers’ computers, encrypts their data, and extorts a hefty ransom for the decryption keys – is a surging new threat.  The purpose of this eBook is to educate the reader about ransomware attacks.

 

Author Comment

by:JDCRC
ID: 18803637
Total MFT size: 1,642 MB
MFT records in use: 1,458,938
Percent MFT in use: 86%
Total MFT Fragments: 0

I've ran diskeepers MFT defragment utility multiple times before and it would stay at 86% usage even though i specify it to resize automatically.   It did however defrag the MFT during the first run.

Thanks for all the inputs so far.
0
 
LVL 1

Expert Comment

by:AAckley
ID: 18803655
Ok so that is not the problem.  That is high usage but not horrible and there are no fragments which is very good and says that it's not overly used.

since you mentioned that disconnecting the NIC resets the problem.  Have you tried changing nics?  There could be a problem with the NIC card.  Drop an off the shelf workstation nic in and use that for a while just to see what happens.
0
 

Author Comment

by:JDCRC
ID: 18803711
Yes, the server comes with 2x Intel Pro 1000/MT.  I've tried both nic's @ 100Tx half and full duplex as well as changed the cat6 cable and switch.

Thank you.
0
 
LVL 9

Expert Comment

by:FixingStuff
ID: 18804453
I have had this same problem with a very similar setup. The issue has been when a user takes a sizable (multi gig) archive file like .ZIP and extracts it back to the same location, or another folder on the same network drive. The IO contention on the drive array with a large read and write back to the same location can kill... especially Raid5 as it is not the most high performance of the Raid options.
The trick is nailing down the "who" and re-train them to use a seperate source and target drive. I have a small set of users (about 40) so it's not too bad here. I'm not sure how you get disk IO performance counters on a user basis...     Anyone know?
fs
0
 

Author Comment

by:JDCRC
ID: 18805360
Interesting... I was looking for an application that would track IO usage on a per user level but was unsucessful in locating one (I was using Task Managers IO process counters but that only tracks per program).  That would definitely help me track down the user or users that's causing this if that is the case.   Another thing i did notice is that the same 100% IO read utilization happens during backup, this sometimes causes all my shadow copies to be deleted due to high disk activity.... another pain in itself.

Thanks again. Keep the suggestions coming.
0
 
LVL 3

Accepted Solution

by:
VXDguy earned 168 total points
ID: 18805687
Too many apps doing synchronous disk I/O calls.  This is why for Windows servers, IOPS matters more than MB/sec.

Picture this:

Your hard drive is a bank.
Your disk controller is the ATM.
Your apps are people wanting to get to the ATM.

The backup app takes $64k out at a time, then gets right back in line to deposit $64k.  If you're lucky, he's going to a different ATM so he's in a second line somewhere else.  If you're not lucky, he's back in the same line.

The ZIP apps takes $32k out at a time, then gets right back in line, deposits $20k, and repeats.  Like the Enron CEO, just on a smaller scale.

Explorer apps get in line to get an account inquiry (directory list).  He's got 200 accounts (files).  He gets back inline and does one account inquiry for each of the 200 accounts just so he can get a pretty deposit slip (file icon).  Explorer apps should be shot.

The more apps that want disk I/O, the longer the line.

Once you get in line, you cannot get out.  Synchronous I/O calls make the I/O request, and WAITS for the I/O to complete.  This is a bad thing in a heavily multi-user environment.

Asynchronous I/O calls means dispatching your intern to go stand in line for you and bring back the results when he's done.  In the mean time, you can get other things done (like updating the user interface).  But an application needs to be designed to use asynchronous I/O.  And as any HR department will tell you, managing interns is a complicated process.  It's simpler to just go do the I/O in synchronous mode than to write an intern thread to take care of it asynchronously.

To FIX the problem...

Increase I/O queue depth on the disk controller if possible.  This would place multiple ATM's for the people in line for faster read/writes.  Faster because the line is shorter and the controller can decide the fastest way to reorder the read/write requests.

Having multiple smaller disks will be faster than fewer larger disks.  I/O queuing is on a per-LUN/disk basis, not per controller.  The controller typically has a finite number of buffers shared across all LUNs.

Reorder read/writes to different controllers (preferred) or disks.  Withdraw from one ATM, deposit to another ATM with a different bank.

Seperate USER drives from WORK drives.  The amount of disk I/O generated by simply browsing folders with Explorer is insane.

You may want to think about going to a higher performance disk array.  An HDS 9980V can easily do 48,000 IOPS (I/O's per second).  Downside it that it's a bit pricy (seven figures).  The AMS/WMS line is five-figures and still has outstanding IOPS.

During all this, your throughput may be quite small.  It's not necessary transferring lots of data, just making lots of requests, and the requests sit in a line til they're processed.  This is when the system's UI becomes unresponsive, apps die, explorer explodes, and mouse clicks take fifteen seconds to process.
0
 
LVL 9

Assisted Solution

by:FixingStuff
FixingStuff earned 166 total points
ID: 18810310
@ VXDguy

"Like the Enron CEO, just on a smaller scale."  LOL

Good analogy. I totally agree with the IOPS and disk queue length.

@ JDCRC   In Perfmon, you should watch the Physical Disk, Avg. Disk Queue Length to see if this indeed is an issue.    Also, to help narrow down the possible "offending user" list, you can use 'Net Files' at the command line to see who has files open, or use Computer Management Console, Shared Folders, Open Files. Same list different methods.
fs
0
 

Author Comment

by:JDCRC
ID: 18811484
Ok perfect.  I will add the disk queue length counter on my perfmon box and see if that is what is causing the issue.   I'm already using the file server management console to check which users are connnected during the IO jumps.  As far as i can tell the some of the users are attached to their outlook PST files which are by in large pretty big.   Maybe its the Outlook PST files that are causing the issue.

There are around 20 locked PST files at any given time ranging from 200 megs to 4 gigs.  

Great Feedback!
0
 
LVL 1

Assisted Solution

by:AAckley
AAckley earned 166 total points
ID: 18811521
I really find it hard to believe that it's the disks causing the problem.  I run 100 users on a virtual server with 4 250GB Raid 5 drives for what sounds like roughly the same thing.  These same disks also host 2 web servers and a certificate server.  All of my users have their PST files open all day long and they range in size for 200MB to over 10GB for a few users (who constantly get yelled at).

It may end up being the drive, but since things lock up... I'd be more willing to think somewhere else is causing the issue.  It could be a controller issue.
0
 

Author Comment

by:JDCRC
ID: 18811932
Aackley it's very good to know that someone actually has a very similar setup without any issues.  I do find it hard to believe that a file server of this caliber can't handle this amount of traffic.  I was planning to move the PST files (and have finance yell at me for making a PO for another file server) to a new server.  

Can you please tell me the model of your raid controller as well as the raid policy?

PS FixingStuff:  I added the avg disk read queue length and avg disk write queue length in perfmon.   We had the issue pop up again today and the counters seem to match my old counters % Disk Read Time.  So other than having another counter in perfmon i dont see the advantage versus % Disk Read Time.
0
 
LVL 1

Expert Comment

by:AAckley
ID: 18812227
Ok actual full setup of this server is:
Dell 2950
Duel Xeon PIVs with 8GB RAm
Perc 5i controller (128MB) with 6 146GB SAS Drives. Configured in a 2 for Raid 1 (mirror) and 4 in a Raid 5.

The server runs Win2k3 Enterprise with Microsoft Virtual Server 2005 R2.

This hosts a file server for approx. 100 users including department files, application files, user folders, psts, etc.  2 Web Servers, a Small SQL 2005 Database server and a Certificate Server.

This is a huge upgrade from the old Dell server with duel P2 400s and 4 18GB SCSI drives on an old Perc 3 card that was running the file server before for the same users up until 6 months ago without lockups.  It was quite a bit slower but never to the point you're talking about.
0

Featured Post

Easy, flexible multimedia distribution & control

Coming soon!  Ideal for large-scale A/V applications, ATEN's VM3200 Modular Matrix Switch is an all-in-one solution that simplifies video wall integration. Easily customize display layouts to see what you want, how you want it in 4k.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Title # Comments Views Activity
VMware lost connectivity to datastore 15 1,136
Question about Buffalo NAS devices 4 80
windows Server 2003 in 2017 10 112
How to know which tape has been assigned on a TSM Archive 6 37
Learn about cloud computing and its benefits for small business owners.
While rebooting windows server 2003 server , it's showing "active directory rebuilding indices please wait" at startup. It took a little while for this process to complete and once we logged on not all the services were started so another reboot is …
This video teaches viewers how to encrypt an external drive that requires a password to read and edit the drive. All tasks are done in Disk Utility. Plug in the external drive you wish to encrypt: Make sure all previous data on the drive has been …
This Micro Tutorial will teach you how to reformat your flash drive. Sometimes your flash drive may have issues carrying files so this will completely restore it to manufacturing settings. Make sure to backup all files before reformatting. This w…

752 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question