File transfer between SSD and RAID array uber-slow

Posted on 2014-10-13
Last Modified: 2016-12-08
It's really simple: this server has an SSD for the main drive, and a RAID array (mdadm) for storage.

Intermittently, it has a problem writing to the array: the write speeds slow down to the point where it's really not that usable.

I just finished testing the speed, and copying a 1G file from the SSD to the RAID array took almost an hour:

root@office:/tmp# rsync -vhr --progress test.img  /mnt/raid/
sending incremental file list
       1.07G 100%  301.99kB/s    0:57:52 (xfer#1, to-check=0/1)

sent 1.07G bytes  received 31 bytes  309.25K bytes/sec
total size is 1.07G  speedup is 1.0

Open in new window

The array seems to be fine:
root@office:/tmp# cat /proc/mdstat 
Personalities : [raid6] [raid5] [raid4] 
md0 : active raid5 sdb[0] sdd[3] sdc[1]
      5860270080 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU]
unused devices: <none>

Open in new window

and syslog doesn't mention anything.

Where else should I be looking to troubleshoot this?

PS. About a month ago, we had to replace a switch after a thunderstorm. This problem started around that time as well. I can't tell if it is a failing motherboard or a controller, or something else that is broken. I would prefer to find a smoking gun before I advise the client to start replacing hardware...
Question by:DrDamnit
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 6
  • 3
LVL 15

Expert Comment

ID: 40378279
Who is the manufacturer of the SSD drive?  What 3rd party drivers, if any, are being used by the SATA and RAID controllers?
The manufacturer of the SATA controller?  Is it on-board or bus-connected?
The make/model of server?
The make/model of RAID controller?
Speed and size of physical drives in RAID?
LVL 47

Expert Comment

ID: 40378345
a 512KB chunk size with a 3-disk software RAID5?    

First try "time cp test.img /dev/null" and see how long it takes
then repeat with time cp /mnt/raid/test.img /dev/null

just to set a baseline
then repeat the copy using dd if=test.img of=/mnt/raid/test.img bs=64k  and see how quickly that test runs.

what are the timers?  At the very least chunk size is way to big unless the whole point of this is large block I/O.  rsync is also not the most efficient program to begin with.
LVL 32

Author Comment

ID: 40379087
I was using rsync because it gave me a timer. "time" is a much better way of doing it.

Here are your timers. The SSD is fast as hell:
root@office:/tmp# time cp test.img /dev/null 

real	0m0.274s
user	0m0.004s
sys	0m0.268s

Open in new window

The regular copy from SSD to RAID array was still very slow. I gave up 9 minutes into the copy trying to copy it from the SSD to /mnt/raid:

root@office:/tmp# time cp test.img /mnt/raid/
real	9m4.918s
user	0m0.000s
sys	0m0.980s

root@office:/tmp# cd /mnt/raid/
root@office:/mnt/raid# ls -lah *.img
-rw-r--r-- 1 root root 564M Oct 13 21:36 test.img

Open in new window

DD seemed to be interesting. On one session I was running watch -n1 ls -lah so I can see the transfer rates. I removed /mnt/raid/test.img, and then ran the dd command. Almost instantly we jump to 391 MB, but then as it progressed, it got slower and slower. By the time I finished writing this paragraph, it was only up to 440M, and increasing about 1MB per 3 seconds.

64k bs seems to increase the write speed significantly; however, it doesn't change the fact that this should not take nearly this long. On this workstationg (the one I am typing on, not the server with the problem) copying a 1GB file from the SATA drive to my RAID array took only 6 seconds:

sudo time cp test.img /mnt/raid/
0.02 user
2.74 system
0:06.45 elapsed

(0avgtext+0avgdata 940maxresident)k
104inputs+2097152outputs (1major+295minor)pagefaults 0swaps

Open in new window

Back to server DD that I was talking about before, it has been at least a minute, and we have only progressed 40 MB to 482MB written. The more it writes, the slower it gets. In fact, it is so slow now, I am not going to continue watching. I am going to post this and check back later. It has gone a full minute without incrementing from 482MB copied to 483MB copied. ...and dd is still running.
Free Webinar: AWS Backup & DR

Join our upcoming webinar with experts from AWS, CloudBerry Lab, and the Town of Edgartown IT to discuss best practices for simplifying online backup management and cutting costs.

LVL 47

Expert Comment

ID: 40379156
Well there is your problem.  You have large block I/o and software RAID5.   If you want write speed you need to add a HDD and go to RAID10.

Going to a hardware RAID controller won't help. You'll saturate the write cache
no matter how big the controller is.

Copying the 1GB file is a red herring. It didn't sync the file system.  All it is doing is copying to RAM and only part of it is on the HDD.  

Your numbers are typical for consumer class SATA disks.
LVL 32

Author Comment

ID: 40379159
But, this works perfectly on give other setups that are basically the same (except they are non ssd, and this one is ssd). I've, this was working fine until a few weeks ago...
LVL 47

Expert Comment

ID: 40379161
The problem is RAID5 write speed on crappy consumer class HDDs.  Your motherboard is probably saturating I/O to the HDD SATA ports as well.  Look at the block diagram for your motherboard and see what kind of bus architecure you have.  Are all the ports competing for the same   bandwidth?  

RAID5 is a pig, Cheap consumer class drives are awful.  Even vibration can cause 40% or more loss in I/O throughput.
LVL 32

Author Comment

ID: 40379165
I'll have to look this up on the morning. The machine is 800 miles away.
LVL 32

Author Comment

ID: 40380143
OK... dd is slow as *** too.

root@office:/tmp# time dd if=test.img of=/mnt/raid/test.img bs=64k
16384+0 records in
16384+0 records out
1073741824 bytes (1.1 GB) copied, 1786.1 s, 601 kB/s

real	29m46.104s
user	0m0.008s
sys	0m1.404s

Open in new window

LVL 32

Author Comment

ID: 40380153
Just notcied this in dmesg:

[2240187.903570] Call Trace:
[2240187.903571]  [<ffffffff8134f5b3>] ? schedule_timeout+0x2c/0xdb
[2240187.903574]  [<ffffffff8103720c>] ? test_tsk_need_resched+0xa/0x13
[2240187.903575]  [<ffffffff8111c23b>] ? __sync_filesystem+0x76/0x76
[2240187.903577]  [<ffffffff810420a0>] ? __cond_resched+0x1d/0x26
[2240187.903579]  [<ffffffff8134f1f9>] ? wait_for_common+0xa0/0x119
[2240187.903580]  [<ffffffff8103f6e2>] ? try_to_wake_up+0x197/0x197
[2240187.903582]  [<ffffffff8111c23b>] ? __sync_filesystem+0x76/0x76
[2240187.903583]  [<ffffffff81117691>] ? writeback_inodes_sb_nr+0x7c/0x82
[2240187.903585]  [<ffffffff8111c20f>] ? __sync_filesystem+0x4a/0x76
[2240187.903587]  [<ffffffff810fd36b>] ? iterate_supers+0x5e/0xab
[2240187.903588]  [<ffffffff8111c2b1>] ? sys_sync+0x29/0x57
[2240187.903590]  [<ffffffff81355452>] ? system_call_fastpath+0x16/0x1b
[2240187.903591] INFO: task sync:6315 blocked for more than 120 seconds.
[2240187.903594] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[2240187.903597] sync            D ffff880116992780     0  6315   6301 0x00000000
[2240187.903599]  ffff880116992780 0000000000000086 ffff880100000001 ffff88011806b1c0
[2240187.903601]  0000000000013780 ffff8800c1bfffd8 ffff8800c1bfffd8 ffff880116992780
[2240187.903603]  ffff880116992780 00000001926d4400 ffff8800926d4400 7fffffffffffffff
[2240187.903604] Call Trace:
[2240187.903606]  [<ffffffff8134f5b3>] ? schedule_timeout+0x2c/0xdb
[2240187.903607]  [<ffffffff8103720c>] ? test_tsk_need_resched+0xa/0x13
[2240187.903609]  [<ffffffff8111c23b>] ? __sync_filesystem+0x76/0x76
[2240187.903610]  [<ffffffff810420a0>] ? __cond_resched+0x1d/0x26
[2240187.903612]  [<ffffffff8134f1f9>] ? wait_for_common+0xa0/0x119
[2240187.903613]  [<ffffffff8103f6e2>] ? try_to_wake_up+0x197/0x197
[2240187.903614]  [<ffffffff8111c23b>] ? __sync_filesystem+0x76/0x76
[2240187.903616]  [<ffffffff81117691>] ? writeback_inodes_sb_nr+0x7c/0x82
[2240187.903618]  [<ffffffff8111c20f>] ? __sync_filesystem+0x4a/0x76
[2240187.903619]  [<ffffffff810fd36b>] ? iterate_supers+0x5e/0xab
[2240187.903621]  [<ffffffff8111c2b1>] ? sys_sync+0x29/0x57
[2240187.903622]  [<ffffffff81355452>] ? system_call_fastpath+0x16/0x1b

Open in new window

LVL 32

Accepted Solution

DrDamnit earned 0 total points
ID: 40598017
Turns out, there was a script that was running on cron, which executed the sync command. It was running every five minutes, and it was generating tens of zombie processes all trying to deal with the array. Kill the script, reboot, problem vanished.

Featured Post

Simple, centralized multimedia control

Watch and learn to see how ATEN provided an easy and effective way for three jointly-owned pubs to control the 60 televisions located across their three venues utilizing the ATEN Control System, Modular Matrix Switch and HDBaseT extenders.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Suggested Solutions

Data center, now-a-days, is referred as the home of all the advanced technologies. In-fact, most of the businesses are now establishing their entire organizational structure around the IT capabilities.
Fine Tune your automatic Updates for Ubuntu / Debian
This video teaches viewers how to encrypt an external drive that requires a password to read and edit the drive. All tasks are done in Disk Utility. Plug in the external drive you wish to encrypt: Make sure all previous data on the drive has been …
This Micro Tutorial will teach you how to reformat your flash drive. Sometimes your flash drive may have issues carrying files so this will completely restore it to manufacturing settings. Make sure to backup all files before reformatting. This w…

733 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question