File transfer between SSD and RAID array uber-slow

Posted on 2014-10-13
Medium Priority
Last Modified: 2016-12-08
It's really simple: this server has an SSD for the main drive, and a RAID array (mdadm) for storage.

Intermittently, it has a problem writing to the array: the write speeds slow down to the point where it's really not that usable.

I just finished testing the speed, and copying a 1G file from the SSD to the RAID array took almost an hour:

root@office:/tmp# rsync -vhr --progress test.img  /mnt/raid/
sending incremental file list
       1.07G 100%  301.99kB/s    0:57:52 (xfer#1, to-check=0/1)

sent 1.07G bytes  received 31 bytes  309.25K bytes/sec
total size is 1.07G  speedup is 1.0

Open in new window

The array seems to be fine:
root@office:/tmp# cat /proc/mdstat 
Personalities : [raid6] [raid5] [raid4] 
md0 : active raid5 sdb[0] sdd[3] sdc[1]
      5860270080 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU]
unused devices: <none>

Open in new window

and syslog doesn't mention anything.

Where else should I be looking to troubleshoot this?

PS. About a month ago, we had to replace a switch after a thunderstorm. This problem started around that time as well. I can't tell if it is a failing motherboard or a controller, or something else that is broken. I would prefer to find a smoking gun before I advise the client to start replacing hardware...
Question by:DrDamnit
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 6
  • 3
LVL 15

Expert Comment

ID: 40378279
Who is the manufacturer of the SSD drive?  What 3rd party drivers, if any, are being used by the SATA and RAID controllers?
The manufacturer of the SATA controller?  Is it on-board or bus-connected?
The make/model of server?
The make/model of RAID controller?
Speed and size of physical drives in RAID?
LVL 47

Expert Comment

ID: 40378345
a 512KB chunk size with a 3-disk software RAID5?    

First try "time cp test.img /dev/null" and see how long it takes
then repeat with time cp /mnt/raid/test.img /dev/null

just to set a baseline
then repeat the copy using dd if=test.img of=/mnt/raid/test.img bs=64k  and see how quickly that test runs.

what are the timers?  At the very least chunk size is way to big unless the whole point of this is large block I/O.  rsync is also not the most efficient program to begin with.
LVL 32

Author Comment

ID: 40379087
I was using rsync because it gave me a timer. "time" is a much better way of doing it.

Here are your timers. The SSD is fast as hell:
root@office:/tmp# time cp test.img /dev/null 

real	0m0.274s
user	0m0.004s
sys	0m0.268s

Open in new window

The regular copy from SSD to RAID array was still very slow. I gave up 9 minutes into the copy trying to copy it from the SSD to /mnt/raid:

root@office:/tmp# time cp test.img /mnt/raid/
real	9m4.918s
user	0m0.000s
sys	0m0.980s

root@office:/tmp# cd /mnt/raid/
root@office:/mnt/raid# ls -lah *.img
-rw-r--r-- 1 root root 564M Oct 13 21:36 test.img

Open in new window

DD seemed to be interesting. On one session I was running watch -n1 ls -lah so I can see the transfer rates. I removed /mnt/raid/test.img, and then ran the dd command. Almost instantly we jump to 391 MB, but then as it progressed, it got slower and slower. By the time I finished writing this paragraph, it was only up to 440M, and increasing about 1MB per 3 seconds.

64k bs seems to increase the write speed significantly; however, it doesn't change the fact that this should not take nearly this long. On this workstationg (the one I am typing on, not the server with the problem) copying a 1GB file from the SATA drive to my RAID array took only 6 seconds:

sudo time cp test.img /mnt/raid/
0.02 user
2.74 system
0:06.45 elapsed

(0avgtext+0avgdata 940maxresident)k
104inputs+2097152outputs (1major+295minor)pagefaults 0swaps

Open in new window

Back to server DD that I was talking about before, it has been at least a minute, and we have only progressed 40 MB to 482MB written. The more it writes, the slower it gets. In fact, it is so slow now, I am not going to continue watching. I am going to post this and check back later. It has gone a full minute without incrementing from 482MB copied to 483MB copied. ...and dd is still running.
WordPress Tutorial 2: Terminology

An important part of learning any new piece of software is understanding the terminology it uses. Thankfully WordPress uses fairly simple names for everything that make it easy to start using the software.

LVL 47

Expert Comment

ID: 40379156
Well there is your problem.  You have large block I/o and software RAID5.   If you want write speed you need to add a HDD and go to RAID10.

Going to a hardware RAID controller won't help. You'll saturate the write cache
no matter how big the controller is.

Copying the 1GB file is a red herring. It didn't sync the file system.  All it is doing is copying to RAM and only part of it is on the HDD.  

Your numbers are typical for consumer class SATA disks.
LVL 32

Author Comment

ID: 40379159
But, this works perfectly on give other setups that are basically the same (except they are non ssd, and this one is ssd). I've, this was working fine until a few weeks ago...
LVL 47

Expert Comment

ID: 40379161
The problem is RAID5 write speed on crappy consumer class HDDs.  Your motherboard is probably saturating I/O to the HDD SATA ports as well.  Look at the block diagram for your motherboard and see what kind of bus architecure you have.  Are all the ports competing for the same   bandwidth?  

RAID5 is a pig, Cheap consumer class drives are awful.  Even vibration can cause 40% or more loss in I/O throughput.
LVL 32

Author Comment

ID: 40379165
I'll have to look this up on the morning. The machine is 800 miles away.
LVL 32

Author Comment

ID: 40380143
OK... dd is slow as *** too.

root@office:/tmp# time dd if=test.img of=/mnt/raid/test.img bs=64k
16384+0 records in
16384+0 records out
1073741824 bytes (1.1 GB) copied, 1786.1 s, 601 kB/s

real	29m46.104s
user	0m0.008s
sys	0m1.404s

Open in new window

LVL 32

Author Comment

ID: 40380153
Just notcied this in dmesg:

[2240187.903570] Call Trace:
[2240187.903571]  [<ffffffff8134f5b3>] ? schedule_timeout+0x2c/0xdb
[2240187.903574]  [<ffffffff8103720c>] ? test_tsk_need_resched+0xa/0x13
[2240187.903575]  [<ffffffff8111c23b>] ? __sync_filesystem+0x76/0x76
[2240187.903577]  [<ffffffff810420a0>] ? __cond_resched+0x1d/0x26
[2240187.903579]  [<ffffffff8134f1f9>] ? wait_for_common+0xa0/0x119
[2240187.903580]  [<ffffffff8103f6e2>] ? try_to_wake_up+0x197/0x197
[2240187.903582]  [<ffffffff8111c23b>] ? __sync_filesystem+0x76/0x76
[2240187.903583]  [<ffffffff81117691>] ? writeback_inodes_sb_nr+0x7c/0x82
[2240187.903585]  [<ffffffff8111c20f>] ? __sync_filesystem+0x4a/0x76
[2240187.903587]  [<ffffffff810fd36b>] ? iterate_supers+0x5e/0xab
[2240187.903588]  [<ffffffff8111c2b1>] ? sys_sync+0x29/0x57
[2240187.903590]  [<ffffffff81355452>] ? system_call_fastpath+0x16/0x1b
[2240187.903591] INFO: task sync:6315 blocked for more than 120 seconds.
[2240187.903594] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[2240187.903597] sync            D ffff880116992780     0  6315   6301 0x00000000
[2240187.903599]  ffff880116992780 0000000000000086 ffff880100000001 ffff88011806b1c0
[2240187.903601]  0000000000013780 ffff8800c1bfffd8 ffff8800c1bfffd8 ffff880116992780
[2240187.903603]  ffff880116992780 00000001926d4400 ffff8800926d4400 7fffffffffffffff
[2240187.903604] Call Trace:
[2240187.903606]  [<ffffffff8134f5b3>] ? schedule_timeout+0x2c/0xdb
[2240187.903607]  [<ffffffff8103720c>] ? test_tsk_need_resched+0xa/0x13
[2240187.903609]  [<ffffffff8111c23b>] ? __sync_filesystem+0x76/0x76
[2240187.903610]  [<ffffffff810420a0>] ? __cond_resched+0x1d/0x26
[2240187.903612]  [<ffffffff8134f1f9>] ? wait_for_common+0xa0/0x119
[2240187.903613]  [<ffffffff8103f6e2>] ? try_to_wake_up+0x197/0x197
[2240187.903614]  [<ffffffff8111c23b>] ? __sync_filesystem+0x76/0x76
[2240187.903616]  [<ffffffff81117691>] ? writeback_inodes_sb_nr+0x7c/0x82
[2240187.903618]  [<ffffffff8111c20f>] ? __sync_filesystem+0x4a/0x76
[2240187.903619]  [<ffffffff810fd36b>] ? iterate_supers+0x5e/0xab
[2240187.903621]  [<ffffffff8111c2b1>] ? sys_sync+0x29/0x57
[2240187.903622]  [<ffffffff81355452>] ? system_call_fastpath+0x16/0x1b

Open in new window

LVL 32

Accepted Solution

DrDamnit earned 0 total points
ID: 40598017
Turns out, there was a script that was running on cron, which executed the sync command. It was running every five minutes, and it was generating tens of zombie processes all trying to deal with the array. Kill the script, reboot, problem vanished.

Featured Post

Percona Live Europe 2017 | Sep 25 - 27, 2017

The Percona Live Open Source Database Conference Europe 2017 is the premier event for the diverse and active European open source database community, as well as businesses that develop and use open source database software.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Finding original email is quite difficult due to their duplicates. From this article, you will come to know why multiple duplicates of same emails appear and how to delete duplicate emails from Outlook securely and instantly while vital emails remai…
This article aims to explain the working of CircularLogArchiver. This tool was designed to solve the buildup of log file in cases where systems do not support circular logging or where circular logging is not enabled
Learn several ways to interact with files and get file information from the bash shell. ls lists the contents of a directory: Using the -a flag displays hidden files: Using the -l flag formats the output in a long list: The file command gives us mor…
Connecting to an Amazon Linux EC2 Instance from Windows Using PuTTY.
Suggested Courses
Course of the Month8 days, 22 hours left to enroll

764 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question