• Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 1682
  • Last Modified:

extreem high iowait issues openfiler machine with Openfiler 2.3 x86_64

Hi,

I have a mysterious high io issue with a Openfiler version 2.3 64bit linux box.
kernel: Linux san01.intoto.local 2.6.29.1-1.7.smp.gcc3.4.x86_64 #1 SMP Sat Apr 11 01:29:24 BST 2009 x86_64 x86_64 x86_64 GNU/Linux

hw spec's box:
Intel S3000AH motherbord
Intel Core2Duo 6600 2,66Ghz
6gb memory
areca 1220 8 port sata RAID Controller with 256Mb cache
8 disk RAID6 data array
2 os disks raid1

Problem:
No matter what IO happens on the machine the IOWait gets sky high. It doesn't matter if this is on the raid array or on the disks connected to the SATA ports or even to an external USB disk.

Whenever their is activity systemloads goes sky high. At that moment I just have 1 activity running. (an activity is a RSYNC file copy, our a simple  4gb DD file write (exmp. dd if=/dev/zero of=./bigfile count=65535 bs=65535))

Below an output of the SAR logging, I started a RSYNC action on a data set around 6.30PM, after that iowait goes to 50% of processor (which makes sense since it's a single thread running on 1 of the 2 cpu's)
05:50:01 PM       CPU     %user     %nice   %system   %iowait     %idle
05:50:01 PM       all      0.00      0.00      0.03      0.00     99.97
06:00:01 PM       all      0.01      0.00      0.15      0.20     99.63
06:10:01 PM       all      0.01      0.00      0.03      0.05     99.91
06:20:01 PM       all      0.35      0.00      0.11      0.07     99.47
06:30:01 PM       all      1.01      0.00      1.06     36.22     61.72
06:40:01 PM       all      0.55      0.00      0.67     55.73     43.05
06:50:01 PM       all      0.36      0.00      0.62     44.97     54.05
07:00:01 PM       all      0.64      0.00      0.86     34.02     64.48
07:10:01 PM       all      0.53      0.00      0.77     46.62     52.08

I've setup another box with a 4 port sata contoroller and a quadcore 2.66ghz cpu with 4gb of ram, also running same version of openfiler and have no issue's at all, the box runs/performs perfect.

Can someone point me in the right direction, is this a hardware issue? or software? What can I do to find out? please advise.
0
nui-nl
Asked:
nui-nl
1 Solution
 
nui-nlAuthor Commented:
Ok, the rsync task just finished and update the backup of the data.
The statistics show that the task took little under 60 minutes.
rsync shows the following results:

sent 3985847564 bytes  received 100434 bytes  1124544.51 bytes/sec
total size is 415407438856

This means, a data transfer rate of 1 Mb/sec?!?! for around 4 gigabytes of data, this does not look normal to me.
0
 
ai_ja_naiCommented:
are the kernel indentical in the 2 boxes?
0
 
arnoldCommented:
use iostat -xt 5 5

I think your RAID controller is your bottleneck.
0
Configuration Guide and Best Practices

Read the guide to learn how to orchestrate Data ONTAP, create application-consistent backups and enable fast recovery from NetApp storage snapshots. Version 9.5 also contains performance and scalability enhancements to meet the needs of the largest enterprise environments.

 
nociSoftware EngineerCommented:
iowait means a task that is active (waiting for IO not for a timer) is outstanding.
Idle means processes are wait for a different reason.

It doesn't mean that the system is computational busy.,
Try to start a computational set of jobs, you will then see that after a certain amount of load the IOwait disappears and it goes to user.

The real work that get's done during I/O is system time/kernel time.
Btw, 1MB/sec = 8Mb/sec.  (Bytes vs. bits). The first is used for storage size, the latter is used for datacom.

rsync is special in a sense that it reads a lot of disk, then computes checksums, and then if needed transfers the data, maybe if you want to transfer it al, just skip the checksumming.
Try to use raw copy to check transfer speeds.
Starting other tasks that are computebound will show you that the IOWait will decrease.

Transfer problems can exist when autonegotiation of a datalink fails. Full Duplex/Half duplex issues. For raw speed checks use: NetIO
http://www.ars.de/ars/ars.nsf/docs/netio
With netio you can verify the capacity of the path between the systems.
Also expriment with packet sizes using this tool.

10Mbps - will allow for 7~8Mbps gross. data  ~0.9-1MB/sec
100Mbps - will allow for 7-80bps gross. data ~7-8MB/sec.
1Gbps - will allow for 200~400Mbps gross. data ~40MB/sec normal ethernet frames
                                  ~1Gbps gross data ~100MB/sec using jumbo frames.

0
 
diepesCommented:
This sounds like hardware, my guess would be a faulty HD.

How to pinpoint.
# iostat -m 10 -d -x

This should allow you to see if a specific disk is very busy, not sure if this will work with hardware raid, could be that linux will only see one block device.
I have used it on software raid with a similar problem to find a problematic disk, then looking at the disk with
# #  smartctl --all /dev/sdb | grep Health



0
 
nui-nlAuthor Commented:
Hi all,

thanks for the sugesstions, I finally (sort of) sorted out the issue.

It looked like the 'auto' mode for writeback caching was not working on the areca controller. (although the Backup Battery Unit was installed and 100% charged) It seems that the writeback cache was not used. So I just put it on enabled and now the machine can do multiple tasks without stuttering.

I'm not 100% sure this was the issue (I've also replaced the CPU with a quadcore instead of the dualcore), and load still seems to get quite high during a DD test run of 8 gigabytes. But the machine is still normal responsive under load and when I give other commands (like a simple LS, but also another DD to another HW Raid diskset) it goes on and on without a glitch :)


in regard to the questions:

@ai_ja_nai - Yes the kernels were the same
@noci - I know that an high io not neccasserly means something, but the strange thing was that the machine would be verry unresponsive, and this was just while doing a simple DD (with no other people or processes,  this was also the case I posted above with the rsync run on the SAME box no network connection)
@diepes - I checked the smart values of the disks using the RAID controller web interface and the all seemed ok.
0

Featured Post

Free Tool: Port Scanner

Check which ports are open to the outside world. Helps make sure that your firewall rules are working as intended.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Tackle projects and never again get stuck behind a technical roadblock.
Join Now