HP P400 SAS RAID slow performance

Hello all, I have an issue in HP server with P400 controller on 2008 x64. All in all - it is slow.

I get some 110 MB/s on RAID10 with 6 15K SAS driver, and some 80 MB/s on single SAS 15K drive.

The battery is out now, but the the cache is turned on regardless in settings for testing purposes. Read write is 50%/50%, the firmware is latest, the drivers are latest ver, and all drive have latest firmware installed.

I don't have a clue what to do to fine WTF is wrong with the controller. I asked local HP guys, but they too don't have a clue. - Did anyone have had a similar issue?

Thanks!
mrmutAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Aaron TomoskySD-WAN SimplifiedCommented:
How are you testing and what is the nic speed?
110MB/s is the max a 1000Mbit nic will go.
0
Thomas RushCommented:
We need to know your exact testing methodology.  Where is the data coming from or going to?  What program are you using to test?  Is there a file system on the disks, and if so, how did you create the files, what sort of directory structure, and how big are the files?

I had thought that the write cache was turned off automatically when there was no batter (or the battery charge fell below a certain level) -- are you sure it's enabled with no battery?
0
DavidPresidentCommented:
Those numbers are quite reasonable with small block size I/O.   Specifically what is the benchmark and what is the I/O size?
0
Ultimate Tool Kit for Technology Solution Provider

Broken down into practical pointers and step-by-step instructions, the IT Service Excellence Tool Kit delivers expert advice for technology solution providers. Get your free copy now.

mrmutAuthor Commented:
Testing is done with HDD Tune executable. Works reliably everywhere; including on this server. Before this slowdown, it was showing 400-600 MB/s speed. Plus, one could feel the system is not snappy; it is characteristically sluggish, like the cache is off, while it isn't.

Cache does turn off automatically, but there is option to leave it on. The symptoms are the same with battery in or out.

Here is the IOPS result: RAID bench
0
DavidPresidentCommented:
You are showing random I/O on the screen.   No WAY were you getting 400MB-600MB/sec random I/O on 6 15K rpm drives.
0
Thomas RushCommented:
Question: Have you looked at drive diags/SMART and seen if perhaps you're starting to get lots of retries on disk access?
0
mrmutAuthor Commented:
dlethe - with what app you need me to test this?

SelfGovern - the drives are fine, at least it looks so. I looked at the diags and found no problems there. We replaced two drives that were about to fail some time ago, but other than that all is fine.
0
DavidPresidentCommented:
HDD Tune is fine. The numbers shown on the screen are REASONABLE for random I/O if you configured the RAID with a large block size.  With 6 drives a much better choice would have been a 2-disk RAID1 for the O/S + swap + scratch table space, and a 4-disk RAID10, all with 64B stripe size on the controller for everything else.

That is because SQL server does native 64KB I/O.  You want each disk to do 64KB I/O as well.  You won't get this on a 6-drive RAID10, so no matter what you do, I/O won't be efficient.
0
mrmutAuthor Commented:
All fine, but IO is hardware dependant that is cache or no cache, it depends on HDD performance.

What is the problem is slow transfer speed. 125 MB is laughable, and I was routinely getting 400-500 MB/s on HDtune.
0
DavidPresidentCommented:
You were not getting 400-500MB/sec on random I/O with the block sizes on the screen on a 6-drive RAID10.  It just isn't possible.  Perhaps before your bench was run for such a short time and tight range of physical blocks that it was all cached I/O.

That would give the appearance of high numbers, but it wouldn't work that way in the real world with data.
0
Aaron TomoskySD-WAN SimplifiedCommented:
you can get 400-500 MB/s of sequential I/O...
0
mrmutAuthor Commented:
I think we have a misunderstanding here. The TRANSFER speed is around 120 MB/s, and it was 400-500 rutinely. The sequential was requested, so I provided it.

The measured transfer speeds fell down, and it feels on the system.
0
Aaron TomoskySD-WAN SimplifiedCommented:
We definately have a misunderstanding..."The TRANSFER speed" doesn't mean anything.
0
mrmutAuthor Commented:
However you put it, I do have 400-500% less of something and I can't figure out what is the reason for that.

Does HP have some controller diagnostic routines I could run?

The next steps I will try is booting the machine from another system and test array there. I will also try another PCI-x slot.

System re-installation is an option, but that doesn't seem it would solve an issue, as it is freshly installed as is.
0
DavidPresidentCommented:
Look, You did not have 400-500% reduction given the numbers you have now.  You had bad info if you thought this would give you 400-500MB/sec routinely in the mix you described.     There are multiple kinds of cache, read and write cache at the controller and at the O/S level.

The battery cache helps with write cache.  So do a 100% pure read test and let's work this one at a time.  Do sequential reads, 64KB chunk size.
0
mrmutAuthor Commented:
I am probably unclear; English is not my mother tongue.

I took a picture to describe a problem:

RAID
This is down 4-5x from what it should be, and how it was before.
0
Aaron TomoskySD-WAN SimplifiedCommented:
do you have the HP utility to show the array status?  I believe it's here, but I'm not familiar with all the difference choices...
http://h20564.www2.hp.com/hpsc/swd/public/readIndex?sp4ts.oid=1156881&swLangOid=8&swEnvOid=4064
0
DavidPresidentCommented:
You are showing a large block I/O sequential read test at 358MB/sec vs. a random  test peaks at 58MB/sec.

Based again on the differences between the I/O tests, your numbers are exactly what one would expect.
0
Aaron TomoskySD-WAN SimplifiedCommented:
an average of 120MB/s for sequential read large block seems pretty weak to me. That 358 is a burst that doesn't even affect the 134 "max" value.
0
mrmutAuthor Commented:
Aaron; yeah, 120 MB/s is cr*p. I get that on USB 3.0 external Green drives.

I just checker on two other servers; on one 4 drives in R10 I got about 350 MB/s, and on the second one with two drives in R1 I got average of 171 (top was 230).

The array status is fine; all green. The system has 6 drives + 1 hotspare + 1 independent drive. I don't see any anomalies in the management.

RAID
I took battery out yesterday to test, but turned on the cache on regardless.

RAID
RAID
The thing is that I WAS getting 4-5x faster transfer speeds, and now I don't get that. And I don't have a clue why.
0
mrmutAuthor Commented:
OK, I've done some rumbling around.

1. I took out the second SAS card, and repositioned P400 to SLOT 1. The measured difference is some +10 MB/s.
2. When I measured the standalone drive, it returned strangely constant speed.

I think this is either driver, or bus issue. Will try poking around, and will reinstall the system to test, if needed.

The burst speed is also weird, it is again at cca 350, and it should be over 1GB/s.


Any ideas welcomed.
0
mrmutAuthor Commented:
After a lot of work today, I got 120 MB/s single HDD with 100MB/s average, and 150/130 peak/average on 4HDD RAID10. What I did in essence is changed PCIx port, and took out other SAS card.

I also reinstalled system, recreated array several times, and tried RAID0 on 4 HDD on a different SAS controller and got only 160 MB/s average out from it. Cache was on all the time, on both controllers.

So - the problem is still there. As a comparison this is what I on another server - 4 10K HDD RAID10, from a powered on and client-accepting TS, over TeamViewer - periodic dips are mostly log writes:

RAID 10 4HDD 10k
In essence, here controller speed itself is capping the array down.
0
Aaron TomoskySD-WAN SimplifiedCommented:
or the port it's plugged into on the mainboard isn't giving it full bandwidth? According to this it needs an 8x pcie port:
http://www8.hp.com/h20195/v2/GetDocument.aspx?docname=c04111741

also, are you running current firmware?
0
mrmutAuthor Commented:
One of the first things was to check for Firmware. I updated both firmware on the card, and BIOS on all drives to latest versions. However, that didn't change anything.

For the interface, yes, it is 8x. I intend to try diagnosing hardware with HP Array diagnostics tomorrow, but I am not so sur that will help. I also left machine to memtest during the night to see if there is maybe a memory problem.

Do you happen to know of a way I could use to test pci-express bandwidth? What could cause such a slowdown? I haven't seen this before.

One thing that got on my mind is to try installing SCSI controller on another computer, and test how it would perform there. However, the server is on location and the HDD enclosures might pose a problem.


Whatever diagnostics you think of, just shoot :-) I am kind of out of options. I am also on the line with HP service center, but they too seem clueless.
0
Aaron TomoskySD-WAN SimplifiedCommented:
If possible, I'm a big fan of moving things around to play 20 questions. So if you move the drives and controller to another box and it's fast, its the box. if it's not, it's the controller. If you can put a different controller in the original box with the original drives, thats another good test.
0
andyalderCommented:
If the hardware hasn't changed then it may be a software setting. Under device manager disk properties policies tab there's a disk write cache setting (which has no effect as the controller ignores it) plus the flush to disk option, you may get a big performance boost if you turn off the cache flushing option. (it's written as s double negative - try with it ticked assuming you have a UPS).
0
mrmutAuthor Commented:
OK, will do. I think I have some W7 computers that have x8 ports available.

Will write back when I test; probably tomorrow, tho i will need some time (expect update tomorrow evening).
0
mrmutAuthor Commented:
andyalder - you mean this (this is on another server)?

adaptec.png
Tried it, but it didn't work - the controller does not accept changing of caching settings thru Windows. Is there somehting else you had on mind? (What you say is close to the first thing everyone mentions when bad RAID performance is spoken about, however I checked all caching settings that I could think of at least ten times).

Is there anything else I could check?
0
andyalderCommented:
Yes, the greyed out setting at the bottom. In order to change it you have to tick the "enable disk write caching" box even though that has no effect other than ungreying the tickbox below it. Unless HP have changed the driver recently both boxes can be checked.
0
mrmutAuthor Commented:
Well, i suppose the driver is changed, as the boxes can't be ticked. I also tried different drivers, but it was the same.

The system is currently memtesting, so ti is off-line and I can't test right now, but I will.
0
mrmutAuthor Commented:
Update:

   tried booting array from another machine. It worked, here are the results:

4hdd-r10-from-another-system
4hdd-R10-another-machine,-safe-mode
another-machine,-2-hdd-r0-left,-1-hdd-right
   I couldn't boot another controller, the computer did not recognize it.

   It seems to me that the card might be defective, altho I am not completely sure.

   Can anyone point to 146GB 10k HP SAS drive specifications? Maybe the mixing of drives revisions causes this (I have three different revisions. - I base this on measurements on single drives. Few days ago I had measured single drive from system and got this from it:

one-hdd,-few-days-ago
   - as you can see, the difference is huge. 100 vs 70 mb/s average. Of course, this could be due to faulty controller, but if we presume controller is not an issue, than this could be the culprit (thus req for HDD specifications, and most importantly - the sustained transfer specifications data).
0
DavidPresidentCommented:
Are you by any chance benching same logical volume that you are booted to?   If not, are you booted to another volume that is also on the SAME controller?    If so, then all your numbers are suspect because your synthetic load is not the ONLY thing the controller is doing.

As I maintain ... I see nothing wrong here other than misconceptions about benchmarking.
0
mrmutAuthor Commented:
It is absolutely irrelevant if the system is booted or not, especially on systems with a lot of RAM, or if the same controller is used. Server cards have discrete and dedicated processing units and RAM, and internal throughput that is rarely surpassed even by fastest arrays or combination of arrays. On small, CPU dependent controllers, the total throughput can suffer, if the channel/s are saturated with data transfers from multiple drives, but that is not the rule. On servers, the difference in measurement should be under few percent, and is not something I would pay much attention to. The trends are important, and easily deciphered from transfer graphs produced.

My issue here is with controller giving me SEVERAL TIMES slower data transfer than it should.

If you don't have anything constructive to suggest for testing, I would like to kindly ask you to move away to another thread.
0
DavidPresidentCommented:
mrmut - it is ABSOLUTELY relevant.  You have a limited amount of data lanes both internally between the controller and internal disks, and between the PCI bus and the controller.  With this many disks, they compete for shared resources.
0
andyalderCommented:
You may want to check the disk firmware against the drive matrix here - http://h20564.www2.hp.com/hpsc/doc/public/display?docId=emr_na-c00305257

You can also run an ADU report and upload it so we can check there are no problems with the disks.

HP don't list IOPS or MB/S but they do list seek times so you can calculate IOPS for small transfers from the quickspecs - http://www8.hp.com/h20195/v2/GetDocument.aspx?docname=c04111744

Did you try pinching the cache module and battery from the other machine?
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
mrmutAuthor Commented:
That would be true 15 years ago, with R10 with 6-8 10k Ultrastars on on Ultra-160 controller.

P400 is only capped by bus that peaks at 2GB/s, you have 1.2 GB/s per lane, one lange per cage of 4 drives, two cages.
0
mrmutAuthor Commented:
andyalder - thanks.

Will try syncing drive firmwares tomorrow, if possible. I updated drives to latest available firmwares, but as there are three drive revisions, I have different firmware versions on some of them.

Will update ADU report tomorrow. Can't do it now, the system is off.

I haven't tried the cache path (altho I would liked to). I also tried the other controller, but the system refused to recognize it.
0
mrmutAuthor Commented:
Syncing formwares didn't seem a good idea when I wanted to to that today. That was not advised on SUM DVD, due to  specific problems with older firmwares. Still, I decided to reflash *everything* that was possible to reflash on HP SUM DVD, and it seem that has worked.

The system is currently running torture test on 6HDD R10 (several hours now), and everything seems to work fine. The system again runs like it is installed on a big, multi-part and redundant SSD.

So... What caused the problem? I am not sure. We will still see will it be stable, but in general I believe (and hope) it should be.

If any more problems happen with the server, I won't have anything else to do, but to send entire machine to service center.

[I will select andyalder as a solution to the problem, as his suggestions proved most helpful to solve this nightmare.]
0
andyalderCommented:
Thanks. Any cure for piles?
0
mrmutAuthor Commented:
Sorry, what do you mean by "piles"? I don't understand?
0
mrmutAuthor Commented:
BTW the server is 100% using all resources for 3:30 now (CPU, RAM, HDD]. I almost feel sorry for doing this; don't like to run such tests on any computer.
0
andyalderCommented:
Googling shows that HP fixed the performance issue at about firmware level 5.2 although it doesn't mention anything in the release notes, that's about 7 years old firmware though.
0
mrmutAuthor Commented:
Well, the thing is that I have had that firmware on the controller before. I did updated it to 7.24(b), but it seems something didn't work as it should.

The server seems fine. It is at 100% load for ~20 hours now, and it is still snappy.
0
mrmutAuthor Commented:
UPDATE: Everything works fine. :-) Haven't had a single issue with the machine, and I will be putting it in production tomorrow.
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Server Hardware

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.