Solved

File transfer between SSD and RAID array uber-slow

Posted on 2014-10-13
10
195 Views
Last Modified: 2015-02-09
It's really simple: this server has an SSD for the main drive, and a RAID array (mdadm) for storage.

Intermittently, it has a problem writing to the array: the write speeds slow down to the point where it's really not that usable.

I just finished testing the speed, and copying a 1G file from the SSD to the RAID array took almost an hour:

root@office:/tmp# rsync -vhr --progress test.img  /mnt/raid/
sending incremental file list
test.img
       1.07G 100%  301.99kB/s    0:57:52 (xfer#1, to-check=0/1)

sent 1.07G bytes  received 31 bytes  309.25K bytes/sec
total size is 1.07G  speedup is 1.0

Open in new window


The array seems to be fine:
root@office:/tmp# cat /proc/mdstat 
Personalities : [raid6] [raid5] [raid4] 
md0 : active raid5 sdb[0] sdd[3] sdc[1]
      5860270080 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU]
      
unused devices: <none>

Open in new window


and syslog doesn't mention anything.

Where else should I be looking to troubleshoot this?

PS. About a month ago, we had to replace a switch after a thunderstorm. This problem started around that time as well. I can't tell if it is a failing motherboard or a controller, or something else that is broken. I would prefer to find a smoking gun before I advise the client to start replacing hardware...
0
Comment
Question by:DrDamnit
  • 6
  • 3
10 Comments
 
LVL 15

Expert Comment

by:WalkaboutTigger
Comment Utility
Who is the manufacturer of the SSD drive?  What 3rd party drivers, if any, are being used by the SATA and RAID controllers?
The manufacturer of the SATA controller?  Is it on-board or bus-connected?
The make/model of server?
The make/model of RAID controller?
Speed and size of physical drives in RAID?
0
 
LVL 47

Expert Comment

by:dlethe
Comment Utility
a 512KB chunk size with a 3-disk software RAID5?    

First try "time cp test.img /dev/null" and see how long it takes
then repeat with time cp /mnt/raid/test.img /dev/null

just to set a baseline
then repeat the copy using dd if=test.img of=/mnt/raid/test.img bs=64k  and see how quickly that test runs.

what are the timers?  At the very least chunk size is way to big unless the whole point of this is large block I/O.  rsync is also not the most efficient program to begin with.
0
 
LVL 32

Author Comment

by:DrDamnit
Comment Utility
I was using rsync because it gave me a timer. "time" is a much better way of doing it.

Here are your timers. The SSD is fast as hell:
root@office:/tmp# time cp test.img /dev/null 

real	0m0.274s
user	0m0.004s
sys	0m0.268s

Open in new window


The regular copy from SSD to RAID array was still very slow. I gave up 9 minutes into the copy trying to copy it from the SSD to /mnt/raid:

root@office:/tmp# time cp test.img /mnt/raid/
^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C^C
real	9m4.918s
user	0m0.000s
sys	0m0.980s

root@office:/tmp# cd /mnt/raid/
root@office:/mnt/raid# ls -lah *.img
-rw-r--r-- 1 root root 564M Oct 13 21:36 test.img

Open in new window


DD seemed to be interesting. On one session I was running watch -n1 ls -lah so I can see the transfer rates. I removed /mnt/raid/test.img, and then ran the dd command. Almost instantly we jump to 391 MB, but then as it progressed, it got slower and slower. By the time I finished writing this paragraph, it was only up to 440M, and increasing about 1MB per 3 seconds.

<WhileIWasWaitingForDDToFinish>
64k bs seems to increase the write speed significantly; however, it doesn't change the fact that this should not take nearly this long. On this workstationg (the one I am typing on, not the server with the problem) copying a 1GB file from the SATA drive to my RAID array took only 6 seconds:

sudo time cp test.img /mnt/raid/
0.02 user
2.74 system
0:06.45 elapsed
42%CPU

(0avgtext+0avgdata 940maxresident)k
104inputs+2097152outputs (1major+295minor)pagefaults 0swaps

Open in new window


</WhileIWasWaitingForDDToFinish>
Back to server DD that I was talking about before, it has been at least a minute, and we have only progressed 40 MB to 482MB written. The more it writes, the slower it gets. In fact, it is so slow now, I am not going to continue watching. I am going to post this and check back later. It has gone a full minute without incrementing from 482MB copied to 483MB copied. ...and dd is still running.
0
 
LVL 47

Expert Comment

by:dlethe
Comment Utility
Well there is your problem.  You have large block I/o and software RAID5.   If you want write speed you need to add a HDD and go to RAID10.

Going to a hardware RAID controller won't help. You'll saturate the write cache
no matter how big the controller is.

Copying the 1GB file is a red herring. It didn't sync the file system.  All it is doing is copying to RAM and only part of it is on the HDD.  

Your numbers are typical for consumer class SATA disks.
0
 
LVL 32

Author Comment

by:DrDamnit
Comment Utility
But, this works perfectly on give other setups that are basically the same (except they are non ssd, and this one is ssd). I've, this was working fine until a few weeks ago...
0
Do You Know the 4 Main Threat Actor Types?

Do you know the main threat actor types? Most attackers fall into one of four categories, each with their own favored tactics, techniques, and procedures.

 
LVL 47

Expert Comment

by:dlethe
Comment Utility
The problem is RAID5 write speed on crappy consumer class HDDs.  Your motherboard is probably saturating I/O to the HDD SATA ports as well.  Look at the block diagram for your motherboard and see what kind of bus architecure you have.  Are all the ports competing for the same   bandwidth?  

RAID5 is a pig, Cheap consumer class drives are awful.  Even vibration can cause 40% or more loss in I/O throughput.
0
 
LVL 32

Author Comment

by:DrDamnit
Comment Utility
I'll have to look this up on the morning. The machine is 800 miles away.
0
 
LVL 32

Author Comment

by:DrDamnit
Comment Utility
OK... dd is slow as *** too.

root@office:/tmp# time dd if=test.img of=/mnt/raid/test.img bs=64k
16384+0 records in
16384+0 records out
1073741824 bytes (1.1 GB) copied, 1786.1 s, 601 kB/s

real	29m46.104s
user	0m0.008s
sys	0m1.404s

Open in new window

0
 
LVL 32

Author Comment

by:DrDamnit
Comment Utility
Just notcied this in dmesg:

[2240187.903570] Call Trace:
[2240187.903571]  [<ffffffff8134f5b3>] ? schedule_timeout+0x2c/0xdb
[2240187.903574]  [<ffffffff8103720c>] ? test_tsk_need_resched+0xa/0x13
[2240187.903575]  [<ffffffff8111c23b>] ? __sync_filesystem+0x76/0x76
[2240187.903577]  [<ffffffff810420a0>] ? __cond_resched+0x1d/0x26
[2240187.903579]  [<ffffffff8134f1f9>] ? wait_for_common+0xa0/0x119
[2240187.903580]  [<ffffffff8103f6e2>] ? try_to_wake_up+0x197/0x197
[2240187.903582]  [<ffffffff8111c23b>] ? __sync_filesystem+0x76/0x76
[2240187.903583]  [<ffffffff81117691>] ? writeback_inodes_sb_nr+0x7c/0x82
[2240187.903585]  [<ffffffff8111c20f>] ? __sync_filesystem+0x4a/0x76
[2240187.903587]  [<ffffffff810fd36b>] ? iterate_supers+0x5e/0xab
[2240187.903588]  [<ffffffff8111c2b1>] ? sys_sync+0x29/0x57
[2240187.903590]  [<ffffffff81355452>] ? system_call_fastpath+0x16/0x1b
[2240187.903591] INFO: task sync:6315 blocked for more than 120 seconds.
[2240187.903594] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[2240187.903597] sync            D ffff880116992780     0  6315   6301 0x00000000
[2240187.903599]  ffff880116992780 0000000000000086 ffff880100000001 ffff88011806b1c0
[2240187.903601]  0000000000013780 ffff8800c1bfffd8 ffff8800c1bfffd8 ffff880116992780
[2240187.903603]  ffff880116992780 00000001926d4400 ffff8800926d4400 7fffffffffffffff
[2240187.903604] Call Trace:
[2240187.903606]  [<ffffffff8134f5b3>] ? schedule_timeout+0x2c/0xdb
[2240187.903607]  [<ffffffff8103720c>] ? test_tsk_need_resched+0xa/0x13
[2240187.903609]  [<ffffffff8111c23b>] ? __sync_filesystem+0x76/0x76
[2240187.903610]  [<ffffffff810420a0>] ? __cond_resched+0x1d/0x26
[2240187.903612]  [<ffffffff8134f1f9>] ? wait_for_common+0xa0/0x119
[2240187.903613]  [<ffffffff8103f6e2>] ? try_to_wake_up+0x197/0x197
[2240187.903614]  [<ffffffff8111c23b>] ? __sync_filesystem+0x76/0x76
[2240187.903616]  [<ffffffff81117691>] ? writeback_inodes_sb_nr+0x7c/0x82
[2240187.903618]  [<ffffffff8111c20f>] ? __sync_filesystem+0x4a/0x76
[2240187.903619]  [<ffffffff810fd36b>] ? iterate_supers+0x5e/0xab
[2240187.903621]  [<ffffffff8111c2b1>] ? sys_sync+0x29/0x57
[2240187.903622]  [<ffffffff81355452>] ? system_call_fastpath+0x16/0x1b

Open in new window

0
 
LVL 32

Accepted Solution

by:
DrDamnit earned 0 total points
Comment Utility
Turns out, there was a script that was running on cron, which executed the sync command. It was running every five minutes, and it was generating tens of zombie processes all trying to deal with the array. Kill the script, reboot, problem vanished.
0

Featured Post

Top 6 Sources for Identifying Threat Actor TTPs

Understanding your enemy is essential. These six sources will help you identify the most popular threat actor tactics, techniques, and procedures (TTPs).

Join & Write a Comment

Ever notice how you can't use a new drive in Windows without having Windows assigning a Disk Signature?  Ever have a signature collision problem (especially with Virtual Machines?)  This article is intended to help you understand what's going on and…
Lets start to have a small explanation what is VAAI(vStorage API for Array Integration ) and what are the benefits using it. VAAI is an API framework in VMware that enable some Storage tasks. It first presented in ESXi 4.1, but only after 5.x sup…
This video teaches viewers how to encrypt an external drive that requires a password to read and edit the drive. All tasks are done in Disk Utility. Plug in the external drive you wish to encrypt: Make sure all previous data on the drive has been …
This tutorial will walk an individual through the process of installing the necessary services and then configuring a Windows Server 2012 system as an iSCSI target. To install the necessary roles, go to Server Manager, and select Add Roles and Featu…

772 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

15 Experts available now in Live!

Get 1:1 Help Now