Solved

RAID 5 on SUSE LINUX 9.3 - Strange activity every 12 hours

Posted on 2006-11-08
13
412 Views
Last Modified: 2010-04-27
I am having a problem with a RAID 5 array on SUSE Linux 9.3.
The server is a Dell Poweredge 2850 with a LSI Logic Controller PERC 4e/Di .
This server is used for storage and at this moment it holds up to 300.000 small files.
The server is doing alot of reads but no so many writes. Every 12 hours (or so) the system is running very slow for about 30 minutes, during this time the cpu is up to 70% average and the hard drives are running at full throttle.

This is the following server configuration: 2 Xeon CPU 64bit, 1 TB of storage SCSI drives, Suse Linux 9.3 updated, java 1.5.0, apache tomcat.
Usually the average server load is about 4% of CPU Time, but every 12 hours those strange spikes appears but nothing changes in the process list.

The disks are reiserfs and were mounted acl, user_xattr.
I also tried to mount them with noatime, data=journal, acl, user_xattr but nothing changed.

I have the same software configuration on another server but the difference is that there is no RAID on that one - it is running flawlessly.

Thank you in advance for your help.


Povas

0
Comment
Question by:Povas
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 7
  • 3
  • 2
  • +1
13 Comments
 
LVL 16

Expert Comment

by:gurutc
ID: 17897817
Hi,

I've seen this type of issue before on my setup that is very similar to yours.  Are the array drives installed in the server or are they in a separate array cabinet?  If they are in a cabinet like a Powervault 220 it may be an issue with that enclosure power supply, a failing drive on its last leg, or a bad SCSI cable causing intermittent array inconsistency that the PERC controller is trying to fix.  All of these have happened to me.   Also, how is your storage partitioned?  If it's one big partition and it's full then there may not be enough free space for the OS to efficiently do its housekeeping chores.  Also, are you running any indexing for your web content?  If so, a cache, timeout, or scheduled index setting may be doing its work during these increased activity periods.  With 300,000 files this could take some time.  Also, if those files are in one logical partition and a relatively flat directory structure then indexing would really kill your cpu.

The overall disk activity looks like it's coming from the PERC controller though.  CPU utilization may rise due to slow disk i/o.  When these spikes occur check the cpu utilization by process and compare that to values observed during normal operation.  Also, while booting go into the PERC setup menu and check values and messages for the array components.  And finally, you need to run array housekeeping and consistency checking on a regular basis.  On my Linux Dell RAID boxes I usually do this from the PERC controller menu manually during a reboot.

Good Luck,
- gurutc
0
 
LVL 88

Expert Comment

by:rindi
ID: 17897950
Also check for any cron jobs, maybe there is a virus scan or something scheduled for the times it runs high.
0
 

Author Comment

by:Povas
ID: 17898186
gurutc:
The array drives are installed in the server.
There is not one big partition, but 8 of them. I am not running any indexing on the servers for sure.

I will do the things you suggested (not right now because the servers are still in production).

rindi:

There is nothing in the cron, I checked that.
0
Optimizing Cloud Backup for Low Bandwidth

With cloud storage prices going down a growing number of SMBs start to use it for backup storage. Unfortunately, business data volume rarely fits the average Internet speed. This article provides an overview of main Internet speed challenges and reveals backup best practices.

 

Author Comment

by:Povas
ID: 17898339
What I did not specified before is this:

There is not one, but two servers with the same configuration and they act the same (not at the same time difference though). These servers are used just for storage.

The partitions are 20% average occupied so there is not a problem of disk space.

Is there something that I can modify in dellmgr to make things smoother?
0
 
LVL 88

Expert Comment

by:rindi
ID: 17898518
Are the users connected when that happens? Maybe they are running an AV scan on the network drives... (they shouldn't, of course).
0
 

Author Comment

by:Povas
ID: 17898539
There is no one connected to the servers, everything is done using some scripts that I also tested with the same number of files on other servers and it worked fine.
0
 
LVL 16

Expert Comment

by:gurutc
ID: 17898730
Ok,

First, I'd update the PERC firmware on the controllers.  The updates to SUSE and its drivers will be looking for the new firmware.  Also you should be able to install RAID monitoring in SUSE if all the driver stuff is right.  That way you can look at the controller activity during these episodes.  One likely cause of these issues is controller battery conditioning.  When this occurs most of the controller's performance options including caching are partially to completely disabled and the controller processor allocates threads to the conditioning leading to terrible array i/o.

- gurutc
0
 
LVL 30

Expert Comment

by:Duncan Meyers
ID: 17903741
It may be that SuSe is refreshing the file slocate database.
0
 

Author Comment

by:Povas
ID: 17904571
It is not the slocate for sure.

I did something yesterday which seems to have some effect.
I started dellmgr and run a consistency check from there. After that test (which I did not leave to run completely), the spikes did not appeared till now (as they appeared every 12 hours or so).

Is there a way to set dellmgr to run consistency check by cron?
How often?
What is the amount of I/O used when checking for consistency?
0
 

Author Comment

by:Povas
ID: 17905090
Can it be something related to Patrol Read?

I found out this on a website:

"What is interesting in Dell/LSI docs “Patrol Read” is positioned as lower overhead alternative to consistency check. What I’ve found out however is - it does not really catches errors well enough (as in this case) plus it has some strange performance problems - in certain cases I’ve seen it slowing down array to probably 20% of its capacity for 20-30min. Could be bug but Dell just told to disable Patrol read. "

I will disable Patrol Read and hope that this was the problem.
0
 
LVL 16

Accepted Solution

by:
gurutc earned 500 total points
ID: 17905449
Hi,

Yes absolutely it could be the flawed patrol read module causing your issues.  With PERC controllers in Dell boxes running Linux you really have to tweak the heck out of things to make all of it work.  One thing to consider although I can't find documentation to explain this is to upgrade to SUSE 10  or later.  When I went from 9.x to 10 and now 10.1 OSS 64 bit on my 2850s things are much smoother with less weird driver and hardware flakiness, I have much more native driver support in the kernel, and it's way faster as far as i/o to disk and network.

- gurutc
0
 

Author Comment

by:Povas
ID: 17908142
Yes. I am really considering to upgrade to 10.1 OSS 64 bit, but for now I disabled the Patrol Read in the controller bios.

So far everything is fine, but is too soon to say that things are really great.

I will keep you posted.

Thank you very much for your help so far.
0
 

Author Comment

by:Povas
ID: 17936616
It is finally done.

Disabling the Patrol Read resolved it.
Thanks all.

I consider that the points should go to gurutc.
0

Featured Post

Technology Partners: We Want Your Opinion!

We value your feedback.

Take our survey and automatically be enter to win anyone of the following:
Yeti Cooler, Amazon eGift Card, and Movie eGift Card!

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Ever notice how you can't use a new drive in Windows without having Windows assigning a Disk Signature?  Ever have a signature collision problem (especially with Virtual Machines?)  This article is intended to help you understand what's going on and…
The question appears often enough, how do I transfer my data from my old server to the new server while preserving file shares, share permissions, and NTFS permisions.  Here are my tips for handling such a transfer.
This video teaches viewers how to encrypt an external drive that requires a password to read and edit the drive. All tasks are done in Disk Utility. Plug in the external drive you wish to encrypt: Make sure all previous data on the drive has been …
This Micro Tutorial will teach you how to reformat your flash drive. Sometimes your flash drive may have issues carrying files so this will completely restore it to manufacturing settings. Make sure to backup all files before reformatting. This w…

734 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question