Server slow - very high disk activity
Posted on 2009-07-10
We have a client who has a real issue with their server. the spec is:
Supermicro tower with hot-swap bays
Supermicro X7DVL-E motherboard
Intel Xeon 5405 CPU
3GB ECC Kingston DDR2 RAM
4x WD-10EACS 1Tb HDD in RAID-5 with hot-spare
Single RAID-5 container partitoned into two drives: C: 50Gb; D: 1.77Tb
On-board Intel ESB2 RAID controller with latest 8.5 drivers from Intel
Latest build and updates for SBS 2003 R2
20 or so client machines. Reasonably large Exchange MDB (16Gb) Lots of data transfer back and forth with some quite large files being transferred to and from the data drive.
Gigabit Ethernet network with structured cabling.
1. The server is very slow and unresponsive at totally random times. When logged into it over RDP, I can click something and it may open straight away, it may take up to 1 minute for it to happen. This will happen over and over again in the same session. All the while, task manager says there is nothing going on that might cause it
2. Users are reporting "lock-ups" and "lock-outs" when accessing the drives across the network. This can happen two or three times in a day or not for two weeks.
3. The disk light on the front of the server is on _all the time_ as if the disk is really churning badly.
4. When using Matrix Storage Manager (which frequently crashes on opening and is very slow to use) to view the RAID, it says all is OK with the RAID, but when you right click to enable disk cache and hard-drive data cache, you can enable them but, crucially, when you next open MSM, they will be disabled again. The same thing applies when enabling both options at device manager/hardware level.
5. The server sometimes takes ages to log us in over RDP. This morning, the user reported no acess to the users and it took us over 10 minutes to get a log-in screen.
6. Software and process explorer report bizarre amounts of resources being used by random programs: Backup Exec Remote agent maxing out on I/O; SQL Server from SBS monitoring doing the same; ESET HTTP server (updates AV clients on the network) maxes out in memory and I/O. Crucially, however, when the server is doing it's slow thing, and we stop one or more of these services, it temporarily gets better.
7. The machine is _always_ paging to HDD. It seems to be maxed out on physical RAM all the time. It's only x32 so we can only stick 3Gb in it.
We have set up perfmonitors on the HDD and we are seeing reasonably normal I/O most of the time. The av. disk queue length is now 1.5, the average idle time is now 95, but other times these will be very different.
Scanned properly for viruses and spyware. RAM has been tested.
My gut feeling is we have a duff RAID controller but I'd really like an expert to have an opinion on this.
Please don't post a reply with basic stuff in it, no matter how helpful you think you are being! We now need someone who _really_ knows what they are talking about,