Solved

RAID 5 on SUSE LINUX 9.3 - Strange activity every 12 hours

Posted on 2006-11-08
13
399 Views
Last Modified: 2010-04-27
I am having a problem with a RAID 5 array on SUSE Linux 9.3.
The server is a Dell Poweredge 2850 with a LSI Logic Controller PERC 4e/Di .
This server is used for storage and at this moment it holds up to 300.000 small files.
The server is doing alot of reads but no so many writes. Every 12 hours (or so) the system is running very slow for about 30 minutes, during this time the cpu is up to 70% average and the hard drives are running at full throttle.

This is the following server configuration: 2 Xeon CPU 64bit, 1 TB of storage SCSI drives, Suse Linux 9.3 updated, java 1.5.0, apache tomcat.
Usually the average server load is about 4% of CPU Time, but every 12 hours those strange spikes appears but nothing changes in the process list.

The disks are reiserfs and were mounted acl, user_xattr.
I also tried to mount them with noatime, data=journal, acl, user_xattr but nothing changed.

I have the same software configuration on another server but the difference is that there is no RAID on that one - it is running flawlessly.

Thank you in advance for your help.


Povas

0
Comment
Question by:Povas
  • 7
  • 3
  • 2
  • +1
13 Comments
 
LVL 16

Expert Comment

by:gurutc
ID: 17897817
Hi,

I've seen this type of issue before on my setup that is very similar to yours.  Are the array drives installed in the server or are they in a separate array cabinet?  If they are in a cabinet like a Powervault 220 it may be an issue with that enclosure power supply, a failing drive on its last leg, or a bad SCSI cable causing intermittent array inconsistency that the PERC controller is trying to fix.  All of these have happened to me.   Also, how is your storage partitioned?  If it's one big partition and it's full then there may not be enough free space for the OS to efficiently do its housekeeping chores.  Also, are you running any indexing for your web content?  If so, a cache, timeout, or scheduled index setting may be doing its work during these increased activity periods.  With 300,000 files this could take some time.  Also, if those files are in one logical partition and a relatively flat directory structure then indexing would really kill your cpu.

The overall disk activity looks like it's coming from the PERC controller though.  CPU utilization may rise due to slow disk i/o.  When these spikes occur check the cpu utilization by process and compare that to values observed during normal operation.  Also, while booting go into the PERC setup menu and check values and messages for the array components.  And finally, you need to run array housekeeping and consistency checking on a regular basis.  On my Linux Dell RAID boxes I usually do this from the PERC controller menu manually during a reboot.

Good Luck,
- gurutc
0
 
LVL 87

Expert Comment

by:rindi
ID: 17897950
Also check for any cron jobs, maybe there is a virus scan or something scheduled for the times it runs high.
0
 

Author Comment

by:Povas
ID: 17898186
gurutc:
The array drives are installed in the server.
There is not one big partition, but 8 of them. I am not running any indexing on the servers for sure.

I will do the things you suggested (not right now because the servers are still in production).

rindi:

There is nothing in the cron, I checked that.
0
 

Author Comment

by:Povas
ID: 17898339
What I did not specified before is this:

There is not one, but two servers with the same configuration and they act the same (not at the same time difference though). These servers are used just for storage.

The partitions are 20% average occupied so there is not a problem of disk space.

Is there something that I can modify in dellmgr to make things smoother?
0
 
LVL 87

Expert Comment

by:rindi
ID: 17898518
Are the users connected when that happens? Maybe they are running an AV scan on the network drives... (they shouldn't, of course).
0
 

Author Comment

by:Povas
ID: 17898539
There is no one connected to the servers, everything is done using some scripts that I also tested with the same number of files on other servers and it worked fine.
0
How your wiki can always stay up-to-date

Quip doubles as a “living” wiki and a project management tool that evolves with your organization. As you finish projects in Quip, the work remains, easily accessible to all team members, new and old.
- Increase transparency
- Onboard new hires faster
- Access from mobile/offline

 
LVL 16

Expert Comment

by:gurutc
ID: 17898730
Ok,

First, I'd update the PERC firmware on the controllers.  The updates to SUSE and its drivers will be looking for the new firmware.  Also you should be able to install RAID monitoring in SUSE if all the driver stuff is right.  That way you can look at the controller activity during these episodes.  One likely cause of these issues is controller battery conditioning.  When this occurs most of the controller's performance options including caching are partially to completely disabled and the controller processor allocates threads to the conditioning leading to terrible array i/o.

- gurutc
0
 
LVL 30

Expert Comment

by:Duncan Meyers
ID: 17903741
It may be that SuSe is refreshing the file slocate database.
0
 

Author Comment

by:Povas
ID: 17904571
It is not the slocate for sure.

I did something yesterday which seems to have some effect.
I started dellmgr and run a consistency check from there. After that test (which I did not leave to run completely), the spikes did not appeared till now (as they appeared every 12 hours or so).

Is there a way to set dellmgr to run consistency check by cron?
How often?
What is the amount of I/O used when checking for consistency?
0
 

Author Comment

by:Povas
ID: 17905090
Can it be something related to Patrol Read?

I found out this on a website:

"What is interesting in Dell/LSI docs “Patrol Read” is positioned as lower overhead alternative to consistency check. What I’ve found out however is - it does not really catches errors well enough (as in this case) plus it has some strange performance problems - in certain cases I’ve seen it slowing down array to probably 20% of its capacity for 20-30min. Could be bug but Dell just told to disable Patrol read. "

I will disable Patrol Read and hope that this was the problem.
0
 
LVL 16

Accepted Solution

by:
gurutc earned 500 total points
ID: 17905449
Hi,

Yes absolutely it could be the flawed patrol read module causing your issues.  With PERC controllers in Dell boxes running Linux you really have to tweak the heck out of things to make all of it work.  One thing to consider although I can't find documentation to explain this is to upgrade to SUSE 10  or later.  When I went from 9.x to 10 and now 10.1 OSS 64 bit on my 2850s things are much smoother with less weird driver and hardware flakiness, I have much more native driver support in the kernel, and it's way faster as far as i/o to disk and network.

- gurutc
0
 

Author Comment

by:Povas
ID: 17908142
Yes. I am really considering to upgrade to 10.1 OSS 64 bit, but for now I disabled the Patrol Read in the controller bios.

So far everything is fine, but is too soon to say that things are really great.

I will keep you posted.

Thank you very much for your help so far.
0
 

Author Comment

by:Povas
ID: 17936616
It is finally done.

Disabling the Patrol Read resolved it.
Thanks all.

I consider that the points should go to gurutc.
0

Featured Post

Highfive + Dolby Voice = No More Audio Complaints!

Poor audio quality is one of the top reasons people don’t use video conferencing. Get the crispest, clearest audio powered by Dolby Voice in every meeting. Highfive and Dolby Voice deliver the best video conferencing and audio experience for every meeting and every room.

Join & Write a Comment

I have written before on the benefits of using a Boot media other than your HDD when it has become infected.   The article I wrote about creating a bootable CD/DVD/USB (http://e-e.com/A_2343.html) was mainly concerned with building a UBCD4Win on CD …
Create your own, high-performance VM backup appliance by installing NAKIVO Backup & Replication directly onto a Synology NAS!
This video teaches viewers how to encrypt an external drive that requires a password to read and edit the drive. All tasks are done in Disk Utility. Plug in the external drive you wish to encrypt: Make sure all previous data on the drive has been …
This tutorial will walk an individual through the process of installing the necessary services and then configuring a Windows Server 2012 system as an iSCSI target. To install the necessary roles, go to Server Manager, and select Add Roles and Featu…

705 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

19 Experts available now in Live!

Get 1:1 Help Now