Link to home
Start Free TrialLog in
Avatar of quadraspleen
quadraspleen

asked on

Server slow - very high disk activity

We have a client who has a real issue with their server. the spec is:

Supermicro tower with hot-swap bays
Supermicro X7DVL-E motherboard
Intel Xeon 5405 CPU
3GB ECC Kingston DDR2 RAM
4x WD-10EACS 1Tb HDD in RAID-5 with hot-spare
Single RAID-5 container partitoned into two drives: C: 50Gb; D: 1.77Tb
On-board Intel ESB2 RAID controller with latest 8.5 drivers from Intel
Latest build and updates for SBS 2003 R2

20 or so client machines. Reasonably large Exchange MDB (16Gb) Lots of data transfer back and forth with some quite large files being transferred to and from the data drive.

Gigabit Ethernet network with structured cabling.

Issues:

1. The server is very slow and unresponsive at totally random times. When logged into it over RDP, I can click something and it may open straight away, it may take up to 1 minute for it to happen. This will happen over and over again in the same session. All the while, task manager says there is nothing going on that might cause it

2. Users are reporting "lock-ups" and "lock-outs" when accessing the drives across the network. This can happen two or three times in a day or not for two weeks.

3. The disk light on the front of the server is on _all the time_ as if the disk is really churning badly.

4. When using Matrix Storage Manager (which frequently crashes on opening and is very slow to use) to view the RAID, it says all is OK with the RAID, but when you right click to enable disk cache and hard-drive data cache, you can enable them but, crucially, when you next open MSM, they will be disabled again. The same thing applies when enabling both options at device manager/hardware level.

5. The server sometimes takes ages to log us in over RDP. This morning, the user reported no acess to the users and it took us over 10 minutes to get a log-in screen.

6. Software and process explorer report bizarre amounts of resources being used by random programs: Backup Exec Remote agent maxing out on I/O; SQL Server from SBS monitoring doing the same; ESET HTTP server (updates AV clients on the network) maxes out in memory and I/O. Crucially, however, when the server is doing it's slow thing, and we stop one or more of these services, it temporarily gets better.

7. The machine is _always_ paging to HDD. It seems to be maxed out on physical RAM all the time. It's only x32 so we can only stick 3Gb in it.

We have set up perfmonitors on the HDD and we are seeing reasonably normal I/O most of the time. The av. disk queue length is now 1.5, the average idle time is now 95, but other times these will be very different.

Scanned properly for viruses and spyware. RAM has been tested.

My gut feeling is we have a duff RAID controller but I'd really like an expert to have an opinion on this.
Please don't post a reply with basic stuff in it, no matter how helpful you think you are being! We now need someone who _really_ knows what they are talking about,
Avatar of DCMBS
DCMBS
Flag of United Kingdom of Great Britain and Northern Ireland image

I had a very similar issue some time ago.  It turned out that the RAID 5 was unable to cope with the high level of update transactions generated by the users due to the overhead of calculaing the redundancy info during write operations.  This company was a data processing company and updated lots of access databases stored on the server.  Our solution at that time was to restructure their disks as two RAID 1 arrays.  This had a dramatic effect on the performace as the write performance improved several fold.  Something similar could be happening here.
Incidently I thought the Max RAM for 32 bit windows is 4GB.
Avatar of quadraspleen
quadraspleen

ASKER

Hi there,

Very useful suggestion - thanks for that. We had had a similar idea of taking up a dedicated RAID controller and creating a new container and doing some acronis magic, but we weren't sure. How did you actually identify that that was the issue? Did you have any logging software? What gave you the final piece of the puzzle?

The Max RAM for 32-bit is limited to 2Gb but can address up to 4 with the /3Gb switch. I have been advised not to use that switch on SBS/Exchange boxes. Perhaps this is wrong?
Yeah we had a lot of trouble diagnosing it.  The main symtons we had were that when the server locked up the disk queue went through the roof and as the queue came down the server started responding.  We tried initially just putting in a single large IDE disk and getting that to impersonate the RAID 5. This just blew the RAID 5 away for performance so we came up with the idea of the RAID 1s.  

All our servers have 4GB of RAM.  I only use the /3GB of RAM on database servers where I need as much RAM as I can get for the Database engine.  It can have a negative effect as it limits the O/S to just 1GB.
A bit more info about memory.

You are basically right when you say that the /3GB switch should not be used on SBS machines. However the machine can have up to 4GB installed.  It will Divide thisd into 2GB Kernal memory and 2GB Application memory.

http://www.brianmadden.com/blogs/brianmadden/archive/2004/02/19/the-4gb-windows-memory-limit-what-does-it-really-mean.aspx
Your problem is a single RAID 5 array.
RAID 5 isn't very fast to begin with.
Then you have the additional problem that Exchange is a very high transactional database. Everytime something happens, Exchange writes to two locations at the same time. Add that to the poor performance of a RAID 5 array and you have a system that is thrashing itself in to the ground.

Throwing memory at the problem is not going to help, because you have a major disk bottleneck. Realistically on a complete redesign of the storage structure is going to improve matters. Anything else is basically "tinkering" with the edges.

Simon.
Thanks for the replies.

Simon, I have many customers with very similar setups i.e more than 10 users with SBS 2K3 and a RAID-5; some of them have way older and lower spec servers (older disks, too) than this one with nowhere near the issues we are seeing here. Not even close.

I hear what you are saying, but having spoken to Mr. Supermicro tech guru today, he seems to think that the ESB2 doesn't like SATA-2 drives in RAID-5 format and has advised me to jumper them to restrict the transfer rate to SATA-1 (whcih would seem a retrograde step, perhaps, but if it fixes it...). He says he has seen this exact same spec and issues with the fix just mentioned sorting it out. If it doesn't, we will be adding a dedicated RAID card and setting up a RAID-1+0 and ghosting the partitons.

I will keep the Q posted, if anyone has anything else to add, please feel free. Thanks for the replies, again.
The fact that you have other servers running in the same configuration doesn't mean it is the right thing to do. I call that the drink drivers excuse.
The simple fact is that Exchange is very hard on its storage, a single RAID array will always be a major bottleneck. If you have four disks then I would have two mirror arrays so the database and logs can be split.

Simon.
Avatar of robocat

First, when running a high I/O server, you should always use SAS disks, because SATA is not suitable for such environments. SATA is never recommended for Exchange environments, even for small ones.

Second, you should always run a raid-5 controller with write back cache enabled. Check if your raid controller has cache memory on board and a (working) battery backup. If your raid controller lacks any of these, get a controller that does.
This will speed up write performance (and general server performance) *significantly*.

You should also look at the memory that the individual processes are using. Heavy paging will always kill your machine. Disable individual processes if needed to avoid paging as much as possible, perhaps you're trying to do too much on a single server.


Hello again and thanks for the extra comments.

This is NOT an Exchange issue! Nor is it a RAID issue. I hear and agree with everything that is being said about RAID, but in this application it is not the issue. The server is not under enough load for it to be the RAID type failing us. It does it witrh no clients connected to the server right after a clean reboot and Exchange services disabled.  I also know we should, in a perfect world, be using SAS disks, but that is not the configuration we have, nor will we be changing it.

As I said earlier, and Simon, I do hear what you are saying, but when we have a configuration whereby changing it would mean serious disruption to the end-user, not to mention a cash implication, and also two very similar companies, who have a similar numbers of users, with a very similar MDB size and similar numbers of emails going through the business each day, with _identical_ hardware and one server does it and the other doesn't, I can be pretty sure it is not the users/environment causing it, however suitable or not it may be. It has nothing to do with drink-driving, nor is it an excuse. It's just the way it is.

We have faulty hardware. We have had this confirmed to us by Supermicro and Intel, who have advised us to replace one of the drives (which, it turns out, is faulty) and then with regard to Exchange, to do what we were going to do anyway, which was to stick a dedicated RAID card in there (with battery and memory, Robocat!) and create a seperate drive/container for the Exchange DB and temp/log files. We really don't want to be migrating their whole system to ghosted drives unless we absolutely have to, as it will be very disruptive to all concerned. We will leave the system and data on the RAID-5 and migrate Exchange alone to the new ocntainer and see how it goes. If the system is still mullered, we shall re-evaluate.

Thanks for everone's time.
Well an update on this: We have now changed the board and all the SATA cables and we still have an issue. The drive in port0 on the controller is the one that is churning, so we have now replaced that drive and are rebuilding the RAID. I will post results as and when...
This should be PAQed. Quadraspleen should post his resolution and accept it as the answer.  There is useful info here.
ASKER CERTIFIED SOLUTION
Avatar of quadraspleen
quadraspleen

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial