We help IT Professionals succeed at work.

HP P400 SAS RAID slow performance

Hello all, I have an issue in HP server with P400 controller on 2008 x64. All in all - it is slow.

I get some 110 MB/s on RAID10 with 6 15K SAS driver, and some 80 MB/s on single SAS 15K drive.

The battery is out now, but the the cache is turned on regardless in settings for testing purposes. Read write is 50%/50%, the firmware is latest, the drivers are latest ver, and all drive have latest firmware installed.

I don't have a clue what to do to fine WTF is wrong with the controller. I asked local HP guys, but they too don't have a clue. - Did anyone have had a similar issue?

Thanks!
Comment
Watch Question

Aaron TomoskyDirector, SD-WAN Solutions

Commented:
How are you testing and what is the nic speed?
110MB/s is the max a 1000Mbit nic will go.
We need to know your exact testing methodology.  Where is the data coming from or going to?  What program are you using to test?  Is there a file system on the disks, and if so, how did you create the files, what sort of directory structure, and how big are the files?

I had thought that the write cache was turned off automatically when there was no batter (or the battery charge fell below a certain level) -- are you sure it's enabled with no battery?
DavidPresident
Top Expert 2010

Commented:
Those numbers are quite reasonable with small block size I/O.   Specifically what is the benchmark and what is the I/O size?

Author

Commented:
Testing is done with HDD Tune executable. Works reliably everywhere; including on this server. Before this slowdown, it was showing 400-600 MB/s speed. Plus, one could feel the system is not snappy; it is characteristically sluggish, like the cache is off, while it isn't.

Cache does turn off automatically, but there is option to leave it on. The symptoms are the same with battery in or out.

Here is the IOPS result: RAID bench
DavidPresident
Top Expert 2010

Commented:
You are showing random I/O on the screen.   No WAY were you getting 400MB-600MB/sec random I/O on 6 15K rpm drives.
Question: Have you looked at drive diags/SMART and seen if perhaps you're starting to get lots of retries on disk access?

Author

Commented:
dlethe - with what app you need me to test this?

SelfGovern - the drives are fine, at least it looks so. I looked at the diags and found no problems there. We replaced two drives that were about to fail some time ago, but other than that all is fine.
DavidPresident
Top Expert 2010

Commented:
HDD Tune is fine. The numbers shown on the screen are REASONABLE for random I/O if you configured the RAID with a large block size.  With 6 drives a much better choice would have been a 2-disk RAID1 for the O/S + swap + scratch table space, and a 4-disk RAID10, all with 64B stripe size on the controller for everything else.

That is because SQL server does native 64KB I/O.  You want each disk to do 64KB I/O as well.  You won't get this on a 6-drive RAID10, so no matter what you do, I/O won't be efficient.

Author

Commented:
All fine, but IO is hardware dependant that is cache or no cache, it depends on HDD performance.

What is the problem is slow transfer speed. 125 MB is laughable, and I was routinely getting 400-500 MB/s on HDtune.
DavidPresident
Top Expert 2010

Commented:
You were not getting 400-500MB/sec on random I/O with the block sizes on the screen on a 6-drive RAID10.  It just isn't possible.  Perhaps before your bench was run for such a short time and tight range of physical blocks that it was all cached I/O.

That would give the appearance of high numbers, but it wouldn't work that way in the real world with data.
Aaron TomoskyDirector, SD-WAN Solutions

Commented:
you can get 400-500 MB/s of sequential I/O...

Author

Commented:
I think we have a misunderstanding here. The TRANSFER speed is around 120 MB/s, and it was 400-500 rutinely. The sequential was requested, so I provided it.

The measured transfer speeds fell down, and it feels on the system.
Aaron TomoskyDirector, SD-WAN Solutions

Commented:
We definately have a misunderstanding..."The TRANSFER speed" doesn't mean anything.

Author

Commented:
However you put it, I do have 400-500% less of something and I can't figure out what is the reason for that.

Does HP have some controller diagnostic routines I could run?

The next steps I will try is booting the machine from another system and test array there. I will also try another PCI-x slot.

System re-installation is an option, but that doesn't seem it would solve an issue, as it is freshly installed as is.
DavidPresident
Top Expert 2010

Commented:
Look, You did not have 400-500% reduction given the numbers you have now.  You had bad info if you thought this would give you 400-500MB/sec routinely in the mix you described.     There are multiple kinds of cache, read and write cache at the controller and at the O/S level.

The battery cache helps with write cache.  So do a 100% pure read test and let's work this one at a time.  Do sequential reads, 64KB chunk size.

Author

Commented:
I am probably unclear; English is not my mother tongue.

I took a picture to describe a problem:

RAID
This is down 4-5x from what it should be, and how it was before.
Aaron TomoskyDirector, SD-WAN Solutions

Commented:
do you have the HP utility to show the array status?  I believe it's here, but I'm not familiar with all the difference choices...
http://h20564.www2.hp.com/hpsc/swd/public/readIndex?sp4ts.oid=1156881&swLangOid=8&swEnvOid=4064
DavidPresident
Top Expert 2010

Commented:
You are showing a large block I/O sequential read test at 358MB/sec vs. a random  test peaks at 58MB/sec.

Based again on the differences between the I/O tests, your numbers are exactly what one would expect.
Aaron TomoskyDirector, SD-WAN Solutions

Commented:
an average of 120MB/s for sequential read large block seems pretty weak to me. That 358 is a burst that doesn't even affect the 134 "max" value.

Author

Commented:
Aaron; yeah, 120 MB/s is cr*p. I get that on USB 3.0 external Green drives.

I just checker on two other servers; on one 4 drives in R10 I got about 350 MB/s, and on the second one with two drives in R1 I got average of 171 (top was 230).

The array status is fine; all green. The system has 6 drives + 1 hotspare + 1 independent drive. I don't see any anomalies in the management.

RAID
I took battery out yesterday to test, but turned on the cache on regardless.

RAID
RAID
The thing is that I WAS getting 4-5x faster transfer speeds, and now I don't get that. And I don't have a clue why.

Author

Commented:
OK, I've done some rumbling around.

1. I took out the second SAS card, and repositioned P400 to SLOT 1. The measured difference is some +10 MB/s.
2. When I measured the standalone drive, it returned strangely constant speed.

I think this is either driver, or bus issue. Will try poking around, and will reinstall the system to test, if needed.

The burst speed is also weird, it is again at cca 350, and it should be over 1GB/s.


Any ideas welcomed.

Author

Commented:
After a lot of work today, I got 120 MB/s single HDD with 100MB/s average, and 150/130 peak/average on 4HDD RAID10. What I did in essence is changed PCIx port, and took out other SAS card.

I also reinstalled system, recreated array several times, and tried RAID0 on 4 HDD on a different SAS controller and got only 160 MB/s average out from it. Cache was on all the time, on both controllers.

So - the problem is still there. As a comparison this is what I on another server - 4 10K HDD RAID10, from a powered on and client-accepting TS, over TeamViewer - periodic dips are mostly log writes:

RAID 10 4HDD 10k
In essence, here controller speed itself is capping the array down.
Aaron TomoskyDirector, SD-WAN Solutions

Commented:
or the port it's plugged into on the mainboard isn't giving it full bandwidth? According to this it needs an 8x pcie port:
http://www8.hp.com/h20195/v2/GetDocument.aspx?docname=c04111741

also, are you running current firmware?

Author

Commented:
One of the first things was to check for Firmware. I updated both firmware on the card, and BIOS on all drives to latest versions. However, that didn't change anything.

For the interface, yes, it is 8x. I intend to try diagnosing hardware with HP Array diagnostics tomorrow, but I am not so sur that will help. I also left machine to memtest during the night to see if there is maybe a memory problem.

Do you happen to know of a way I could use to test pci-express bandwidth? What could cause such a slowdown? I haven't seen this before.

One thing that got on my mind is to try installing SCSI controller on another computer, and test how it would perform there. However, the server is on location and the HDD enclosures might pose a problem.


Whatever diagnostics you think of, just shoot :-) I am kind of out of options. I am also on the line with HP service center, but they too seem clueless.
Aaron TomoskyDirector, SD-WAN Solutions

Commented:
If possible, I'm a big fan of moving things around to play 20 questions. So if you move the drives and controller to another box and it's fast, its the box. if it's not, it's the controller. If you can put a different controller in the original box with the original drives, thats another good test.
Distinguished Expert 2019

Commented:
If the hardware hasn't changed then it may be a software setting. Under device manager disk properties policies tab there's a disk write cache setting (which has no effect as the controller ignores it) plus the flush to disk option, you may get a big performance boost if you turn off the cache flushing option. (it's written as s double negative - try with it ticked assuming you have a UPS).

Author

Commented:
OK, will do. I think I have some W7 computers that have x8 ports available.

Will write back when I test; probably tomorrow, tho i will need some time (expect update tomorrow evening).

Author

Commented:
andyalder - you mean this (this is on another server)?

adaptec.png
Tried it, but it didn't work - the controller does not accept changing of caching settings thru Windows. Is there somehting else you had on mind? (What you say is close to the first thing everyone mentions when bad RAID performance is spoken about, however I checked all caching settings that I could think of at least ten times).

Is there anything else I could check?
Distinguished Expert 2019

Commented:
Yes, the greyed out setting at the bottom. In order to change it you have to tick the "enable disk write caching" box even though that has no effect other than ungreying the tickbox below it. Unless HP have changed the driver recently both boxes can be checked.

Author

Commented:
Well, i suppose the driver is changed, as the boxes can't be ticked. I also tried different drivers, but it was the same.

The system is currently memtesting, so ti is off-line and I can't test right now, but I will.

Author

Commented:
Update:

   tried booting array from another machine. It worked, here are the results:

4hdd-r10-from-another-system
4hdd-R10-another-machine,-safe-mode
another-machine,-2-hdd-r0-left,-1-hdd-right
   I couldn't boot another controller, the computer did not recognize it.

   It seems to me that the card might be defective, altho I am not completely sure.

   Can anyone point to 146GB 10k HP SAS drive specifications? Maybe the mixing of drives revisions causes this (I have three different revisions. - I base this on measurements on single drives. Few days ago I had measured single drive from system and got this from it:

one-hdd,-few-days-ago
   - as you can see, the difference is huge. 100 vs 70 mb/s average. Of course, this could be due to faulty controller, but if we presume controller is not an issue, than this could be the culprit (thus req for HDD specifications, and most importantly - the sustained transfer specifications data).
DavidPresident
Top Expert 2010

Commented:
Are you by any chance benching same logical volume that you are booted to?   If not, are you booted to another volume that is also on the SAME controller?    If so, then all your numbers are suspect because your synthetic load is not the ONLY thing the controller is doing.

As I maintain ... I see nothing wrong here other than misconceptions about benchmarking.

Author

Commented:
It is absolutely irrelevant if the system is booted or not, especially on systems with a lot of RAM, or if the same controller is used. Server cards have discrete and dedicated processing units and RAM, and internal throughput that is rarely surpassed even by fastest arrays or combination of arrays. On small, CPU dependent controllers, the total throughput can suffer, if the channel/s are saturated with data transfers from multiple drives, but that is not the rule. On servers, the difference in measurement should be under few percent, and is not something I would pay much attention to. The trends are important, and easily deciphered from transfer graphs produced.

My issue here is with controller giving me SEVERAL TIMES slower data transfer than it should.

If you don't have anything constructive to suggest for testing, I would like to kindly ask you to move away to another thread.
DavidPresident
Top Expert 2010

Commented:
mrmut - it is ABSOLUTELY relevant.  You have a limited amount of data lanes both internally between the controller and internal disks, and between the PCI bus and the controller.  With this many disks, they compete for shared resources.
Distinguished Expert 2019
Commented:
You may want to check the disk firmware against the drive matrix here - http://h20564.www2.hp.com/hpsc/doc/public/display?docId=emr_na-c00305257

You can also run an ADU report and upload it so we can check there are no problems with the disks.

HP don't list IOPS or MB/S but they do list seek times so you can calculate IOPS for small transfers from the quickspecs - http://www8.hp.com/h20195/v2/GetDocument.aspx?docname=c04111744

Did you try pinching the cache module and battery from the other machine?

Author

Commented:
That would be true 15 years ago, with R10 with 6-8 10k Ultrastars on on Ultra-160 controller.

P400 is only capped by bus that peaks at 2GB/s, you have 1.2 GB/s per lane, one lange per cage of 4 drives, two cages.

Author

Commented:
andyalder - thanks.

Will try syncing drive firmwares tomorrow, if possible. I updated drives to latest available firmwares, but as there are three drive revisions, I have different firmware versions on some of them.

Will update ADU report tomorrow. Can't do it now, the system is off.

I haven't tried the cache path (altho I would liked to). I also tried the other controller, but the system refused to recognize it.

Author

Commented:
Syncing formwares didn't seem a good idea when I wanted to to that today. That was not advised on SUM DVD, due to  specific problems with older firmwares. Still, I decided to reflash *everything* that was possible to reflash on HP SUM DVD, and it seem that has worked.

The system is currently running torture test on 6HDD R10 (several hours now), and everything seems to work fine. The system again runs like it is installed on a big, multi-part and redundant SSD.

So... What caused the problem? I am not sure. We will still see will it be stable, but in general I believe (and hope) it should be.

If any more problems happen with the server, I won't have anything else to do, but to send entire machine to service center.

[I will select andyalder as a solution to the problem, as his suggestions proved most helpful to solve this nightmare.]
Distinguished Expert 2019

Commented:
Thanks. Any cure for piles?

Author

Commented:
Sorry, what do you mean by "piles"? I don't understand?

Author

Commented:
BTW the server is 100% using all resources for 3:30 now (CPU, RAM, HDD]. I almost feel sorry for doing this; don't like to run such tests on any computer.
Distinguished Expert 2019

Commented:
Googling shows that HP fixed the performance issue at about firmware level 5.2 although it doesn't mention anything in the release notes, that's about 7 years old firmware though.

Author

Commented:
Well, the thing is that I have had that firmware on the controller before. I did updated it to 7.24(b), but it seems something didn't work as it should.

The server seems fine. It is at 100% load for ~20 hours now, and it is still snappy.

Author

Commented:
UPDATE: Everything works fine. :-) Haven't had a single issue with the machine, and I will be putting it in production tomorrow.