File transfer between SSD and RAID array uber-slow

It's really simple: this server has an SSD for the main drive, and a RAID array (mdadm) for storage.

Intermittently, it has a problem writing to the array: the write speeds slow down to the point where it's really not that usable.

I just finished testing the speed, and copying a 1G file from the SSD to the RAID array took almost an hour:

root@office:/tmp# rsync -vhr --progress test.img  /mnt/raid/
sending incremental file list
       1.07G 100%  301.99kB/s    0:57:52 (xfer#1, to-check=0/1)

sent 1.07G bytes  received 31 bytes  309.25K bytes/sec
total size is 1.07G  speedup is 1.0

Open in new window

The array seems to be fine:
root@office:/tmp# cat /proc/mdstat 
Personalities : [raid6] [raid5] [raid4] 
md0 : active raid5 sdb[0] sdd[3] sdc[1]
      5860270080 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU]
unused devices: <none>

Open in new window

and syslog doesn't mention anything.

Where else should I be looking to troubleshoot this?

PS. About a month ago, we had to replace a switch after a thunderstorm. This problem started around that time as well. I can't tell if it is a failing motherboard or a controller, or something else that is broken. I would prefer to find a smoking gun before I advise the client to start replacing hardware...
LVL 32
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Darrell PorterEnterprise Business Process ArchitectCommented:
Who is the manufacturer of the SSD drive?  What 3rd party drivers, if any, are being used by the SATA and RAID controllers?
The manufacturer of the SATA controller?  Is it on-board or bus-connected?
The make/model of server?
The make/model of RAID controller?
Speed and size of physical drives in RAID?
a 512KB chunk size with a 3-disk software RAID5?    

First try "time cp test.img /dev/null" and see how long it takes
then repeat with time cp /mnt/raid/test.img /dev/null

just to set a baseline
then repeat the copy using dd if=test.img of=/mnt/raid/test.img bs=64k  and see how quickly that test runs.

what are the timers?  At the very least chunk size is way to big unless the whole point of this is large block I/O.  rsync is also not the most efficient program to begin with.
DrDamnitAuthor Commented:
I was using rsync because it gave me a timer. "time" is a much better way of doing it.

Here are your timers. The SSD is fast as hell:
root@office:/tmp# time cp test.img /dev/null 

real	0m0.274s
user	0m0.004s
sys	0m0.268s

Open in new window

The regular copy from SSD to RAID array was still very slow. I gave up 9 minutes into the copy trying to copy it from the SSD to /mnt/raid:

root@office:/tmp# time cp test.img /mnt/raid/
real	9m4.918s
user	0m0.000s
sys	0m0.980s

root@office:/tmp# cd /mnt/raid/
root@office:/mnt/raid# ls -lah *.img
-rw-r--r-- 1 root root 564M Oct 13 21:36 test.img

Open in new window

DD seemed to be interesting. On one session I was running watch -n1 ls -lah so I can see the transfer rates. I removed /mnt/raid/test.img, and then ran the dd command. Almost instantly we jump to 391 MB, but then as it progressed, it got slower and slower. By the time I finished writing this paragraph, it was only up to 440M, and increasing about 1MB per 3 seconds.

64k bs seems to increase the write speed significantly; however, it doesn't change the fact that this should not take nearly this long. On this workstationg (the one I am typing on, not the server with the problem) copying a 1GB file from the SATA drive to my RAID array took only 6 seconds:

sudo time cp test.img /mnt/raid/
0.02 user
2.74 system
0:06.45 elapsed

(0avgtext+0avgdata 940maxresident)k
104inputs+2097152outputs (1major+295minor)pagefaults 0swaps

Open in new window

Back to server DD that I was talking about before, it has been at least a minute, and we have only progressed 40 MB to 482MB written. The more it writes, the slower it gets. In fact, it is so slow now, I am not going to continue watching. I am going to post this and check back later. It has gone a full minute without incrementing from 482MB copied to 483MB copied. ...and dd is still running.
Powerful Yet Easy-to-Use Network Monitoring

Identify excessive bandwidth utilization or unexpected application traffic with SolarWinds Bandwidth Analyzer Pack.

Well there is your problem.  You have large block I/o and software RAID5.   If you want write speed you need to add a HDD and go to RAID10.

Going to a hardware RAID controller won't help. You'll saturate the write cache
no matter how big the controller is.

Copying the 1GB file is a red herring. It didn't sync the file system.  All it is doing is copying to RAM and only part of it is on the HDD.  

Your numbers are typical for consumer class SATA disks.
DrDamnitAuthor Commented:
But, this works perfectly on give other setups that are basically the same (except they are non ssd, and this one is ssd). I've, this was working fine until a few weeks ago...
The problem is RAID5 write speed on crappy consumer class HDDs.  Your motherboard is probably saturating I/O to the HDD SATA ports as well.  Look at the block diagram for your motherboard and see what kind of bus architecure you have.  Are all the ports competing for the same   bandwidth?  

RAID5 is a pig, Cheap consumer class drives are awful.  Even vibration can cause 40% or more loss in I/O throughput.
DrDamnitAuthor Commented:
I'll have to look this up on the morning. The machine is 800 miles away.
DrDamnitAuthor Commented:
OK... dd is slow as *** too.

root@office:/tmp# time dd if=test.img of=/mnt/raid/test.img bs=64k
16384+0 records in
16384+0 records out
1073741824 bytes (1.1 GB) copied, 1786.1 s, 601 kB/s

real	29m46.104s
user	0m0.008s
sys	0m1.404s

Open in new window

DrDamnitAuthor Commented:
Just notcied this in dmesg:

[2240187.903570] Call Trace:
[2240187.903571]  [<ffffffff8134f5b3>] ? schedule_timeout+0x2c/0xdb
[2240187.903574]  [<ffffffff8103720c>] ? test_tsk_need_resched+0xa/0x13
[2240187.903575]  [<ffffffff8111c23b>] ? __sync_filesystem+0x76/0x76
[2240187.903577]  [<ffffffff810420a0>] ? __cond_resched+0x1d/0x26
[2240187.903579]  [<ffffffff8134f1f9>] ? wait_for_common+0xa0/0x119
[2240187.903580]  [<ffffffff8103f6e2>] ? try_to_wake_up+0x197/0x197
[2240187.903582]  [<ffffffff8111c23b>] ? __sync_filesystem+0x76/0x76
[2240187.903583]  [<ffffffff81117691>] ? writeback_inodes_sb_nr+0x7c/0x82
[2240187.903585]  [<ffffffff8111c20f>] ? __sync_filesystem+0x4a/0x76
[2240187.903587]  [<ffffffff810fd36b>] ? iterate_supers+0x5e/0xab
[2240187.903588]  [<ffffffff8111c2b1>] ? sys_sync+0x29/0x57
[2240187.903590]  [<ffffffff81355452>] ? system_call_fastpath+0x16/0x1b
[2240187.903591] INFO: task sync:6315 blocked for more than 120 seconds.
[2240187.903594] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[2240187.903597] sync            D ffff880116992780     0  6315   6301 0x00000000
[2240187.903599]  ffff880116992780 0000000000000086 ffff880100000001 ffff88011806b1c0
[2240187.903601]  0000000000013780 ffff8800c1bfffd8 ffff8800c1bfffd8 ffff880116992780
[2240187.903603]  ffff880116992780 00000001926d4400 ffff8800926d4400 7fffffffffffffff
[2240187.903604] Call Trace:
[2240187.903606]  [<ffffffff8134f5b3>] ? schedule_timeout+0x2c/0xdb
[2240187.903607]  [<ffffffff8103720c>] ? test_tsk_need_resched+0xa/0x13
[2240187.903609]  [<ffffffff8111c23b>] ? __sync_filesystem+0x76/0x76
[2240187.903610]  [<ffffffff810420a0>] ? __cond_resched+0x1d/0x26
[2240187.903612]  [<ffffffff8134f1f9>] ? wait_for_common+0xa0/0x119
[2240187.903613]  [<ffffffff8103f6e2>] ? try_to_wake_up+0x197/0x197
[2240187.903614]  [<ffffffff8111c23b>] ? __sync_filesystem+0x76/0x76
[2240187.903616]  [<ffffffff81117691>] ? writeback_inodes_sb_nr+0x7c/0x82
[2240187.903618]  [<ffffffff8111c20f>] ? __sync_filesystem+0x4a/0x76
[2240187.903619]  [<ffffffff810fd36b>] ? iterate_supers+0x5e/0xab
[2240187.903621]  [<ffffffff8111c2b1>] ? sys_sync+0x29/0x57
[2240187.903622]  [<ffffffff81355452>] ? system_call_fastpath+0x16/0x1b

Open in new window

DrDamnitAuthor Commented:
Turns out, there was a script that was running on cron, which executed the sync command. It was running every five minutes, and it was generating tens of zombie processes all trying to deal with the array. Kill the script, reboot, problem vanished.

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today

From novice to tech pro — start learning today.