Link to home
Start Free TrialLog in
Avatar of jcks
jcks

asked on

Storage Server hard disk slowness issue

I have a HP Proliant DL185G5 storage server running Windows Storage Server 2003. It's a 12TB drive, running RAID 1 on one array and RAID 6 on the other. The first array has the OS, the second array is used as a secondary drive for backups.

This storage server is used as a network backup destination for our PCs (180 PCs). Using Symantec Backup Exec System Recovery 2010 taking daily backups spread out through the night.  I stagger the backups and limit the size of the files created to ease up the load on the server. Everything was running OK until the other day when I noticed pretty much all of the backups failed.

Users were reporting a Windows - Delayed Write Failed message. Message states Windows was unable to save all data for the backup file to the network share on this particular server. Seeing a lot of these, I logged onto the server and it was very slow. CPU/memory usage was fine. So I ran HD Tune and my transfer rate was averaging 0.1 MB/sec with 1.7% CPU usage.  I killed the backup jobs and rebooted the server and ran HD Tune again. Got about a 75 MB/sec average. I let the backups run again last night and got the same thing. Backup failures and slow performance from the HD transfer rate.  

The person who set this server up is no longer with the company, so I sort of inherited it. I'm not familiar with what trouble shooting steps I need to take now. I did check the properties of the disk controller to make sure it was set to DMA transfer mode if available.  The built in HP Array utilities don't show any alarms or alerts. I'm kind of at a loss of what to do next.

Avatar of Member_2_231077
Member_2_231077

Latest firmware and drivers for the controller? I think it's a P800 in that model but same package for all controllers anyway.
Hello,

Are there any other batch jobs running that you aren't aware of?
Sry for previous post.  I meant to say- check to see if there other batch jobs running.
Avatar of jcks

ASKER

Haven't installed latest firmware/drivers yet. I usually subscribe to the theory of it ain't broke don't mess with it. Since this was working fine, I never updated any firmware. You are correct, it is a P800.

re: batch jobs, nothing else is set to run on this machine outside of these backup jobs, which are driven by the Symantec software itself.

I'll try the firmware.drivers idea first and report back.
Avatar of jcks

ASKER

well I didn't even have time to update anything yet and I got into work to check on the server and it's now not seen on the network.  I normally remote access into it to work on it, but it's offline now. Going into the server room I see that the power lights are all green, but it's definitely not up as I cannot ping it or anything. Just my luck the KVM switch isn't working either so I will have to attach a monitor physically to it to see what's happening. I'll do that tomorrow. Sigh.
Avatar of jcks

ASKER

Finally was able to get my KVM working. All I saw on the screen was a flashing cursor. Rebooted the storage server and was greeted with the following message:

Slot 1 HP Smart Array P800 Controller

1792 - Slot 1 Drive Array - Valid Data Found in Array Accelerator
Data will automatically be written to drive array

1779 - Slot 1 Drive Array - Replace drive(s) detected OR previously failed drive(s) now appear to be operational:
Port 3I: Box 1: Bays 1,2,3,7,9
Logical drive(s) disabled due to possible data loss.
Select "F1" to continue with logical drive(s) disabled
Select "F2" to accept data loss and to re-enable logical drive(s)


Hitting F1 goes back to the flashing cursor and nothing happens. Selecting F2 brings back up my storage server. This is exactly what happened a few weeks ago and I didn't look into it any further since selecting F2 seemed to make everything ok.  But now twice in 3 weeks is an issue.  

After booting up I go into event viewer and there's a bunch of errors from 6/9/11 when the server went offline. There's also a status alert in the HP array configuration utility. This alert has been there for a while and I'm not sure if it's related or not.  User generated image
I'll post a summary of the event logs shortly.
Avatar of jcks

ASKER

Here's a small sampling of the event errors I received:

Event Type:      Warning
Event Source:      Cissesrv
Event Category:      None
Event ID:      24607
Date:            6/8/2011
Time:            10:06:32 AM
User:            N/A
Computer:      HPSTORAGE01
Description:
The event information received from array controller P800 located in server slot 1 was of an unknown or unrecognized class.

An excerpt of the controller message is as follows: Inconsistent stripe, LDrv=0 LBA=0x0003E1300-0x0003E13FF.
------------------------------------------

Event Type:      Warning
Event Source:      Cissesrv
Event Category:      None
Event ID:      24685
Date:            6/9/2011
Time:            4:10:57 PM
User:            N/A
Computer:      HPSTORAGE01
Description:
Array controller P800 located in server slot 1 has reported an uncorrectable read error during surface analysis operations for logical drive 3. A media error was encountered that is not correctable due to media errors on other physical drive(s) belonging to this logical volume. The uncorrectable media defects are between logical block address 11252035840 and logical block address 11252036095. The host will be unable to read some blocks between this address range until the blocks are overwritten. Capacity expansion operations must be avoided while the blocks are unreadable.
--------------------------------------

Event Type:      Error
Event Source:      Cissesrv
Event Category:      None
Event ID:      24595
Date:            6/9/2011
Time:            4:23:27 PM
User:            N/A
Computer:      HPSTORAGE01
Description:
A drive failure notification has been received for the SATA physical drive located in bay 3.  This drive can be found in box 1 which is connected to port 3I of the array controller P800 located in server slot 1.  The failure reason received from the HP Smart Array firmware is: TIMEOUT.
----------------------------------

Event Type:      Error
Event Source:      Disk
Event Category:      None
Event ID:      11
Date:            6/9/2011
Time:            4:24:10 PM
User:            N/A
Computer:      HPSTORAGE01
Description:
The driver detected a controller error on \Device\Harddisk2\DR2.
-------------------------------------

Event Type:      Error
Event Source:      Cissesrv
Event Category:      None
Event ID:      24606
Date:            6/9/2011
Time:            4:24:29 PM
User:            N/A
Computer:      HPSTORAGE01
Description:
Logical drive 3 configured on array controller P800 located in server slot 1 returned a fatal error during a read/write request from/to the volume.  

 Logical block address 6293266, block count 8 and command 32 were taken from the failed logical I/O request.  

 Array controller P800 located in server slot 1 is also reporting that the last physical drive to report a fatal error condition (associated with this logical request), is located on bus 0 and ID 16.
---------------------------------------------

Event Type:      Error
Event Source:      Cissesrv
Event Category:      None
Event ID:      24595
Date:            6/9/2011
Time:            4:25:59 PM
User:            N/A
Computer:      HPSTORAGE01
Description:
A drive failure notification has been received for the SATA physical drive located in bay 7.  This drive can be found in box 1 which is connected to port 3I of the array controller P800 located in server slot 1.  The failure reason received from the HP Smart Array firmware is: TIMEOUT.
---------------------------------------------

Event Type:      Error
Event Source:      Cissesrv
Event Category:      None
Event ID:      24595
Date:            6/9/2011
Time:            4:27:32 PM
User:            N/A
Computer:      HPSTORAGE01
Description:
A drive failure notification has been received for the SATA physical drive located in bay 9.  This drive can be found in box 1 which is connected to port 3I of the array controller P800 located in server slot 1.  The failure reason received from the HP Smart Array firmware is: TIMEOUT
-------------------------------------------

Event Type:      Error
Event Source:      Cissesrv
Event Category:      None
Event ID:      24600
Date:            6/9/2011
Time:            4:27:34 PM
User:            N/A
Computer:      HPSTORAGE01
Description:
Logical drive 3 of array controller P800 located in server slot 1 has encountered a status change from:  

Status: INTERIM RECOVERY MODE  
to  
Status: FAILED
-----------------------------------------------

Event Type:      Warning
Event Source:      HpCISSs2
Event Category:      None
Event ID:      129
Date:            6/9/2011
Time:            4:22:11 PM
User:            N/A
Computer:      HPSTORAGE01
Description:
Reset to device, \Device\RaidPort0, was issued.
-----------------------------------------



There's a bunch like these all around the time the server went down. From what I can tell it looks like it's telling me multiple drives are failing. But are they really? I can't imagine that many failing at once. Plus the lights on the front of each drive is green. I did have a drive failure months ago, but that one showed an amber/orange light on the failed drive and I replaced that one.

So are all these things related to my initial issue of slowing down?
It's quite likely that's the reason for it to slow down and crash. Could be power related or just the disk backplane or controller, it's not a disk since lots dropping out. Is it under warranty still?
Avatar of jcks

ASKER

No, it is not. For some reason the guy who bought it didn't get the extended warranty from CDW. How can I narrow down the issue here? I agree that it's not the disks since so many of them are reporting failing, but because of time outs.
ASKER CERTIFIED SOLUTION
Avatar of Member_2_231077
Member_2_231077

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of jcks

ASKER

that is true. Unfortunately this is serving as my backup destination server for my enterprise backups. I'm offsite working now, but I'll try to  post some logs later. Thanks for the reply. So even if my OS is up and running okay there could still be something wrong with the hardware somewhere?
99% sure it's hardware, but some of the HP storage servers have a couple of small disks in a rear-mounted tray at the back for the OS so that may keep going even if the disk subsystem at the front craps out.
Avatar of jcks

ASKER

Here's the zipped file from ADU. The report.zip is from the ADU. The next one is the ACU, Diagnostics report. Not sure if they say the same thing or not. This is all pretty foreign to me at this point. I appreciate looking at them even if it might not help.
report.zip
report-fd86423a-00000728-0000000.zip
That's a horrible log :(

No sign of the controller or backplane expander crashing, just the disks falling asleep. Certainly looks like lack of powerr to the disks.

Logical drive 1 will have to have the data backed up, delete the logical drive and recreate, restore. I don't think there's a way to make it skip past the parity mismatch.
Avatar of jcks

ASKER

horrible as in lack of information to troubleshoot or horrible because lots of things wrong? i'm thinking the former.

Okay, forgive me if there's are dumb questions as I am relatively new to working on a server like this.

- oddly enough the parity message has disappeared today. But it was reported on logical drive 3, not 1. Was the suggestion to delete logical drive 1 based on this parity message?

- If still need to delete it, logical drive 1 is my OS. Should I take an image back up of it and save to an external drive or something? To delete the logical drive, do I just go into ACU?
Horrible as lots wrong, you're bound to have some minor data corruptions but you won't even know what backup to restore.

"Logical Drive 1 Consistency Check Failed" was what was worrying me, I'd like to see what a full image backup does, not sure if it will succeed or not.
Avatar of jcks

ASKER

oh....I see. I thought you were talking about the previously posted error message about parity logical drive 3. My apologies.

So bad logs huh :(  

So after looking at the logs are you thinking it's not hardware (like the backplane) related?

There's no data on logical drive 1, just the OS. All my data resides on logical drive 3. I am confused on how to delete logical drive 1 as normally I would use Windows Disk Management, but logical drive 1 is my OS/system so I can't delete it while I'm in Windows. Sorry man, I'm usually pretty good with computers, but this is all new to me.

Can't be anything else but hardware - time to start swapping bits or replace the box. Let's see how the image goes before thinking of having to delete and recreate the logical drive.
Avatar of jcks

ASKER

Ok. I'll work on that tomorrow. Thanks
Avatar of jcks

ASKER

I got approval from the boss to buy a new box. Should I just do that and forget this? Or will the consistency issue follow to a new box if I use the same drives. I was under the impression from one of your previous posts that I could just pull out all these drives and plug it into a new box and be off and running. Is this the case or did seeing the logs change that?

I didn't have time to try and back it up today. Hopefully time permits tomorrow.
Avatar of jcks

ASKER

My attempts to backup the OS failed. Not sure why, but it the backup program would lock up before I even had a chance to start. I use Ghost 2003 via bootdisk. It runs clean normally and doesn't require the machine being imaged/backed up to boot to the OS (that's why I prefer it).  Tried 6 times to get it running and it locked up every time. I find this odd because the HDD aren't even being accessed yet when this happens.

Anyway, does this mean it's for sure a hardware issue? I was hoping you (andyalder) could follow up with my last questions. You've been helpful and no one else is chiming in :)   Thanks
A new box with a compatible Smart Array controller is what I'd do too, I think there's a 12 bay DL180 g6 storage server although I can't find it on HP's web. Your reseller will be able to help more there.

I think your backup software locking up is something else, although if it's on the server rather than a stand-alone CD it may be corrupt.
Avatar of jcks

ASKER

Ok. Thanks for your help man. I'm going through a discussion with my CDW contact now to get a replacement.