Solved

File transfer between SSD and RAID array uber-slow

Posted on 2014-10-13
10
219 Views
Last Modified: 2016-12-08
It's really simple: this server has an SSD for the main drive, and a RAID array (mdadm) for storage.

Intermittently, it has a problem writing to the array: the write speeds slow down to the point where it's really not that usable.

I just finished testing the speed, and copying a 1G file from the SSD to the RAID array took almost an hour:

root@office:/tmp# rsync -vhr --progress test.img  /mnt/raid/
sending incremental file list
test.img
       1.07G 100%  301.99kB/s    0:57:52 (xfer#1, to-check=0/1)

sent 1.07G bytes  received 31 bytes  309.25K bytes/sec
total size is 1.07G  speedup is 1.0

Open in new window


The array seems to be fine:
root@office:/tmp# cat /proc/mdstat 
Personalities : [raid6] [raid5] [raid4] 
md0 : active raid5 sdb[0] sdd[3] sdc[1]
      5860270080 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU]
      
unused devices: <none>

Open in new window


and syslog doesn't mention anything.

Where else should I be looking to troubleshoot this?

PS. About a month ago, we had to replace a switch after a thunderstorm. This problem started around that time as well. I can't tell if it is a failing motherboard or a controller, or something else that is broken. I would prefer to find a smoking gun before I advise the client to start replacing hardware...
0
Comment
Question by:DrDamnit
[X]
Welcome to Experts Exchange

Add your voice to the tech community where 5M+ people just like you are talking about what matters.

  • Help others & share knowledge
  • Earn cash & points
  • Learn & ask questions
  • 6
  • 3
10 Comments
 
LVL 15

Expert Comment

by:WalkaboutTigger
ID: 40378279
Who is the manufacturer of the SSD drive?  What 3rd party drivers, if any, are being used by the SATA and RAID controllers?
The manufacturer of the SATA controller?  Is it on-board or bus-connected?
The make/model of server?
The make/model of RAID controller?
Speed and size of physical drives in RAID?
0
 
LVL 47

Expert Comment

by:David
ID: 40378345
a 512KB chunk size with a 3-disk software RAID5?    

First try "time cp test.img /dev/null" and see how long it takes
then repeat with time cp /mnt/raid/test.img /dev/null

just to set a baseline
then repeat the copy using dd if=test.img of=/mnt/raid/test.img bs=64k  and see how quickly that test runs.

what are the timers?  At the very least chunk size is way to big unless the whole point of this is large block I/O.  rsync is also not the most efficient program to begin with.
0
 
LVL 32

Author Comment

by:DrDamnit
ID: 40379087
I was using rsync because it gave me a timer. "time" is a much better way of doing it.

Here are your timers. The SSD is fast as hell:
root@office:/tmp# time cp test.img /dev/null 

real	0m0.274s
user	0m0.004s
sys	0m0.268s

Open in new window


The regular copy from SSD to RAID array was still very slow. I gave up 9 minutes into the copy trying to copy it from the SSD to /mnt/raid:

root@office:/tmp# time cp test.img /mnt/raid/
^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C
real	9m4.918s
user	0m0.000s
sys	0m0.980s

root@office:/tmp# cd /mnt/raid/
root@office:/mnt/raid# ls -lah *.img
-rw-r--r-- 1 root root 564M Oct 13 21:36 test.img

Open in new window


DD seemed to be interesting. On one session I was running watch -n1 ls -lah so I can see the transfer rates. I removed /mnt/raid/test.img, and then ran the dd command. Almost instantly we jump to 391 MB, but then as it progressed, it got slower and slower. By the time I finished writing this paragraph, it was only up to 440M, and increasing about 1MB per 3 seconds.

<WhileIWasWaitingForDDToFinish>
64k bs seems to increase the write speed significantly; however, it doesn't change the fact that this should not take nearly this long. On this workstationg (the one I am typing on, not the server with the problem) copying a 1GB file from the SATA drive to my RAID array took only 6 seconds:

sudo time cp test.img /mnt/raid/
0.02 user
2.74 system
0:06.45 elapsed
42%CPU

(0avgtext+0avgdata 940maxresident)k
104inputs+2097152outputs (1major+295minor)pagefaults 0swaps

Open in new window


</WhileIWasWaitingForDDToFinish>
Back to server DD that I was talking about before, it has been at least a minute, and we have only progressed 40 MB to 482MB written. The more it writes, the slower it gets. In fact, it is so slow now, I am not going to continue watching. I am going to post this and check back later. It has gone a full minute without incrementing from 482MB copied to 483MB copied. ...and dd is still running.
0
Migrating Your Company's PCs

To keep pace with competitors, businesses must keep employees productive, and that means providing them with the latest technology. This document provides the tips and tricks you need to help you migrate an outdated PC fleet to new desktops, laptops, and tablets.

 
LVL 47

Expert Comment

by:David
ID: 40379156
Well there is your problem.  You have large block I/o and software RAID5.   If you want write speed you need to add a HDD and go to RAID10.

Going to a hardware RAID controller won't help. You'll saturate the write cache
no matter how big the controller is.

Copying the 1GB file is a red herring. It didn't sync the file system.  All it is doing is copying to RAM and only part of it is on the HDD.  

Your numbers are typical for consumer class SATA disks.
0
 
LVL 32

Author Comment

by:DrDamnit
ID: 40379159
But, this works perfectly on give other setups that are basically the same (except they are non ssd, and this one is ssd). I've, this was working fine until a few weeks ago...
0
 
LVL 47

Expert Comment

by:David
ID: 40379161
The problem is RAID5 write speed on crappy consumer class HDDs.  Your motherboard is probably saturating I/O to the HDD SATA ports as well.  Look at the block diagram for your motherboard and see what kind of bus architecure you have.  Are all the ports competing for the same   bandwidth?  

RAID5 is a pig, Cheap consumer class drives are awful.  Even vibration can cause 40% or more loss in I/O throughput.
0
 
LVL 32

Author Comment

by:DrDamnit
ID: 40379165
I'll have to look this up on the morning. The machine is 800 miles away.
0
 
LVL 32

Author Comment

by:DrDamnit
ID: 40380143
OK... dd is slow as *** too.

root@office:/tmp# time dd if=test.img of=/mnt/raid/test.img bs=64k
16384+0 records in
16384+0 records out
1073741824 bytes (1.1 GB) copied, 1786.1 s, 601 kB/s

real	29m46.104s
user	0m0.008s
sys	0m1.404s

Open in new window

0
 
LVL 32

Author Comment

by:DrDamnit
ID: 40380153
Just notcied this in dmesg:

[2240187.903570] Call Trace:
[2240187.903571]  [<ffffffff8134f5b3>] ? schedule_timeout+0x2c/0xdb
[2240187.903574]  [<ffffffff8103720c>] ? test_tsk_need_resched+0xa/0x13
[2240187.903575]  [<ffffffff8111c23b>] ? __sync_filesystem+0x76/0x76
[2240187.903577]  [<ffffffff810420a0>] ? __cond_resched+0x1d/0x26
[2240187.903579]  [<ffffffff8134f1f9>] ? wait_for_common+0xa0/0x119
[2240187.903580]  [<ffffffff8103f6e2>] ? try_to_wake_up+0x197/0x197
[2240187.903582]  [<ffffffff8111c23b>] ? __sync_filesystem+0x76/0x76
[2240187.903583]  [<ffffffff81117691>] ? writeback_inodes_sb_nr+0x7c/0x82
[2240187.903585]  [<ffffffff8111c20f>] ? __sync_filesystem+0x4a/0x76
[2240187.903587]  [<ffffffff810fd36b>] ? iterate_supers+0x5e/0xab
[2240187.903588]  [<ffffffff8111c2b1>] ? sys_sync+0x29/0x57
[2240187.903590]  [<ffffffff81355452>] ? system_call_fastpath+0x16/0x1b
[2240187.903591] INFO: task sync:6315 blocked for more than 120 seconds.
[2240187.903594] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[2240187.903597] sync            D ffff880116992780     0  6315   6301 0x00000000
[2240187.903599]  ffff880116992780 0000000000000086 ffff880100000001 ffff88011806b1c0
[2240187.903601]  0000000000013780 ffff8800c1bfffd8 ffff8800c1bfffd8 ffff880116992780
[2240187.903603]  ffff880116992780 00000001926d4400 ffff8800926d4400 7fffffffffffffff
[2240187.903604] Call Trace:
[2240187.903606]  [<ffffffff8134f5b3>] ? schedule_timeout+0x2c/0xdb
[2240187.903607]  [<ffffffff8103720c>] ? test_tsk_need_resched+0xa/0x13
[2240187.903609]  [<ffffffff8111c23b>] ? __sync_filesystem+0x76/0x76
[2240187.903610]  [<ffffffff810420a0>] ? __cond_resched+0x1d/0x26
[2240187.903612]  [<ffffffff8134f1f9>] ? wait_for_common+0xa0/0x119
[2240187.903613]  [<ffffffff8103f6e2>] ? try_to_wake_up+0x197/0x197
[2240187.903614]  [<ffffffff8111c23b>] ? __sync_filesystem+0x76/0x76
[2240187.903616]  [<ffffffff81117691>] ? writeback_inodes_sb_nr+0x7c/0x82
[2240187.903618]  [<ffffffff8111c20f>] ? __sync_filesystem+0x4a/0x76
[2240187.903619]  [<ffffffff810fd36b>] ? iterate_supers+0x5e/0xab
[2240187.903621]  [<ffffffff8111c2b1>] ? sys_sync+0x29/0x57
[2240187.903622]  [<ffffffff81355452>] ? system_call_fastpath+0x16/0x1b

Open in new window

0
 
LVL 32

Accepted Solution

by:
DrDamnit earned 0 total points
ID: 40598017
Turns out, there was a script that was running on cron, which executed the sync command. It was running every five minutes, and it was generating tens of zombie processes all trying to deal with the array. Kill the script, reboot, problem vanished.
0

Featured Post

Optimizing Cloud Backup for Low Bandwidth

With cloud storage prices going down a growing number of SMBs start to use it for backup storage. Unfortunately, business data volume rarely fits the average Internet speed. This article provides an overview of main Internet speed challenges and reveals backup best practices.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Concerto Cloud Services, a provider of fully managed private, public and hybrid cloud solutions, announced today it was named to the 20 Coolest Cloud Infrastructure Vendors Of The 2017 Cloud  (http://www.concertocloud.com/about/in-the-news/2017/02/0…
In this article we will learn how to backup a VMware farm using Nakivo Backup & Replication. In this tutorial we will install the software on a Windows 2012 R2 Server.
Learn how to get help with Linux/Unix bash shell commands. Use help to read help documents for built in bash shell commands.: Use man to interface with the online reference manuals for shell commands.: Use man to search man pages for unknown command…
Learn how to find files with the shell using the find and locate commands. Use locate to find a needle in a haystack.: With locate, check if the file still exists.: Use find to get the actual location of the file.:

729 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question