extreem high iowait issues openfiler machine with Openfiler 2.3 x86_64

Posted on 2009-04-29
Last Modified: 2013-12-16

I have a mysterious high io issue with a Openfiler version 2.3 64bit linux box.
kernel: Linux san01.intoto.local #1 SMP Sat Apr 11 01:29:24 BST 2009 x86_64 x86_64 x86_64 GNU/Linux

hw spec's box:
Intel S3000AH motherbord
Intel Core2Duo 6600 2,66Ghz
6gb memory
areca 1220 8 port sata RAID Controller with 256Mb cache
8 disk RAID6 data array
2 os disks raid1

No matter what IO happens on the machine the IOWait gets sky high. It doesn't matter if this is on the raid array or on the disks connected to the SATA ports or even to an external USB disk.

Whenever their is activity systemloads goes sky high. At that moment I just have 1 activity running. (an activity is a RSYNC file copy, our a simple  4gb DD file write (exmp. dd if=/dev/zero of=./bigfile count=65535 bs=65535))

Below an output of the SAR logging, I started a RSYNC action on a data set around 6.30PM, after that iowait goes to 50% of processor (which makes sense since it's a single thread running on 1 of the 2 cpu's)
05:50:01 PM       CPU     %user     %nice   %system   %iowait     %idle
05:50:01 PM       all      0.00      0.00      0.03      0.00     99.97
06:00:01 PM       all      0.01      0.00      0.15      0.20     99.63
06:10:01 PM       all      0.01      0.00      0.03      0.05     99.91
06:20:01 PM       all      0.35      0.00      0.11      0.07     99.47
06:30:01 PM       all      1.01      0.00      1.06     36.22     61.72
06:40:01 PM       all      0.55      0.00      0.67     55.73     43.05
06:50:01 PM       all      0.36      0.00      0.62     44.97     54.05
07:00:01 PM       all      0.64      0.00      0.86     34.02     64.48
07:10:01 PM       all      0.53      0.00      0.77     46.62     52.08

I've setup another box with a 4 port sata contoroller and a quadcore 2.66ghz cpu with 4gb of ram, also running same version of openfiler and have no issue's at all, the box runs/performs perfect.

Can someone point me in the right direction, is this a hardware issue? or software? What can I do to find out? please advise.
Question by:nui-nl

    Author Comment

    Ok, the rsync task just finished and update the backup of the data.
    The statistics show that the task took little under 60 minutes.
    rsync shows the following results:

    sent 3985847564 bytes  received 100434 bytes  1124544.51 bytes/sec
    total size is 415407438856

    This means, a data transfer rate of 1 Mb/sec?!?! for around 4 gigabytes of data, this does not look normal to me.
    LVL 16

    Expert Comment

    are the kernel indentical in the 2 boxes?
    LVL 76

    Expert Comment

    use iostat -xt 5 5

    I think your RAID controller is your bottleneck.
    LVL 39

    Expert Comment

    iowait means a task that is active (waiting for IO not for a timer) is outstanding.
    Idle means processes are wait for a different reason.

    It doesn't mean that the system is computational busy.,
    Try to start a computational set of jobs, you will then see that after a certain amount of load the IOwait disappears and it goes to user.

    The real work that get's done during I/O is system time/kernel time.
    Btw, 1MB/sec = 8Mb/sec.  (Bytes vs. bits). The first is used for storage size, the latter is used for datacom.

    rsync is special in a sense that it reads a lot of disk, then computes checksums, and then if needed transfers the data, maybe if you want to transfer it al, just skip the checksumming.
    Try to use raw copy to check transfer speeds.
    Starting other tasks that are computebound will show you that the IOWait will decrease.

    Transfer problems can exist when autonegotiation of a datalink fails. Full Duplex/Half duplex issues. For raw speed checks use: NetIO
    With netio you can verify the capacity of the path between the systems.
    Also expriment with packet sizes using this tool.

    10Mbps - will allow for 7~8Mbps gross. data  ~0.9-1MB/sec
    100Mbps - will allow for 7-80bps gross. data ~7-8MB/sec.
    1Gbps - will allow for 200~400Mbps gross. data ~40MB/sec normal ethernet frames
                                      ~1Gbps gross data ~100MB/sec using jumbo frames.

    LVL 7

    Expert Comment

    This sounds like hardware, my guess would be a faulty HD.

    How to pinpoint.
    # iostat -m 10 -d -x

    This should allow you to see if a specific disk is very busy, not sure if this will work with hardware raid, could be that linux will only see one block device.
    I have used it on software raid with a similar problem to find a problematic disk, then looking at the disk with
    # #  smartctl --all /dev/sdb | grep Health


    Accepted Solution

    Hi all,

    thanks for the sugesstions, I finally (sort of) sorted out the issue.

    It looked like the 'auto' mode for writeback caching was not working on the areca controller. (although the Backup Battery Unit was installed and 100% charged) It seems that the writeback cache was not used. So I just put it on enabled and now the machine can do multiple tasks without stuttering.

    I'm not 100% sure this was the issue (I've also replaced the CPU with a quadcore instead of the dualcore), and load still seems to get quite high during a DD test run of 8 gigabytes. But the machine is still normal responsive under load and when I give other commands (like a simple LS, but also another DD to another HW Raid diskset) it goes on and on without a glitch :)

    in regard to the questions:

    @ai_ja_nai - Yes the kernels were the same
    @noci - I know that an high io not neccasserly means something, but the strange thing was that the machine would be verry unresponsive, and this was just while doing a simple DD (with no other people or processes,  this was also the case I posted above with the rsync run on the SAME box no network connection)
    @diepes - I checked the smart values of the disks using the RAID controller web interface and the all seemed ok.

    Featured Post

    Free Trending Threat Insights Every Day

    Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

    Join & Write a Comment

    Daily system administration tasks often require administrators to connect remote systems. But allowing these remote systems to accept passwords makes these systems vulnerable to the risk of brute-force password guessing attacks. Furthermore there ar…
    Using 'screen' for session sharing, The Simple Edition Step 1: user starts session with command: screen Step 2: other user (logged in with same user account) connects with command: screen -x Done. Both users are connected to the same CLI sessio…
    Learn several ways to interact with files and get file information from the bash shell. ls lists the contents of a directory: Using the -a flag displays hidden files: Using the -l flag formats the output in a long list: The file command gives us mor…
    Get a first impression of how PRTG looks and learn how it works.   This video is a short introduction to PRTG, as an initial overview or as a quick start for new PRTG users.

    731 members asked questions and received personalized solutions in the past 7 days.

    Join the community of 500,000 technology professionals and ask your questions.

    Join & Ask a Question

    Need Help in Real-Time?

    Connect with top rated Experts

    16 Experts available now in Live!

    Get 1:1 Help Now