Want to protect your cyber security and still get fast solutions? Ask a secure question today.Go Premium

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 1508
  • Last Modified:

external storage array crashes server

I need help in determining why a Windows 2003 file server -- HP ML110 G3 with  attached eSATA storage -- has crashed for the past two nights with system error messages that cite problems with the external storage array.

The external storage device is model "CFI B4043ER" 4-Bay Port Multiplier.  It connects to an add-on eSATA card in the server. The storage drives are two 500GB Western Digital RE3 WD5002ABYS drives configured as software RAID 1 mirror. (These mirrored drives hold only data files, not OS files.)

After the first crash (the server shuts down and  fails to auto-restart), I used Windows Disk Management and saw that the mirrored drives were no longer sychronized. Then I checked the System Event Viewer and found this sequence of errors:

Event Source:      Disk
Event ID:      15
The device, \Device\Harddisk2, is not ready for access yet.

Event Source:      dmio
Event ID:      30
dmio: Harddisk2 write error at block 6160455: status 0xc000009d

LDM - I/O interrupted, cabling and disk status need to be checked
(This event indicates that Logical Disk Manager encountered I/O failures on a disk. Possible causes include: A hardware failure that prevents communication with a disk (for example, a loose cable, a loose disk controller card, or a cable failure). Unexpected removal of a disk. Uncorrectable bad sectors on a disk.)

Event Source:      dmio
Event ID:      4
dmio: write error on Plex Volume1-02 of volume Volume1 offset 6160392 length 8

Event Source:      dmio
Event ID:      5
dmio: Plex Volume1-02 detached from volume Volume1

Event Source:      dmio
Event ID:      5
dmio: Disk2-01 Subdisk detached from volume Volume1-02

Event Source:      dmio
Event ID:      31
dmio: Harddisk2 write error at block 976772618 due to disk removal

Event Source:      Application Popup
Application popup: Windows - FT Orphaning : A disk that is part of a fault-tolerant volume can no longer be accessed.

Event Source:      Si3132r5
The device, \Device\Scsi\Si3132r51, did not respond within the timeout period.

Event Source:      LDM
Event ID:      500
Detached plex Volume1-02 in volume Volume1

Event Source:      PlugPlayManager
Event ID:      257
Timed out sending notification of target device change to window of "LDM Service"
(The specified program is taking too long to respond to notifications about the device. This can decrease system performance. User Action: Restart the program. If this does not improve the response time, consider upgrading to a newer version of the program)
 
Event Source:      Disk
Description: The driver detected a controller error on \Device\Harddisk1.

Event Source:      Si3132r5
Event ID:      9
The device, \Device\Scsi\Si3132r51, did not respond within the timeout period.

--------------------------------
The first day this happened, I ran a simple WD disk scan, discovered nothing unusual, and simply re-sychronized the two mirrored disks. Today, I removed one of the disks (it had "!" on it in windows Disk Mgt), attached it to another computer and ran "HDD Regenerator 2011," which detected no errors. I then inserted the drive into a different bay in the external array and again resynchronized the mirror.

I guess my next move will be to remove and scan the second drive. And if the problem occurs again, I can swap the old eSATA cable for a new one to see if the cable is a problem. But I'm stumped beyond that. I don't know how to test the external storage device itself or the eSATA add-on card in the HP server.

Any suggestions? Thanks!
0
sfschool
Asked:
sfschool
  • 2
  • 2
1 Solution
 
Robin CMSenior Security and Infrastructure EngineerCommented:
Check that your drivers and firmware are up to date.
Could try running a disk stress testing tool, something like http://www.nu2.nu/bst/
0
 
sfschoolAuthor Commented:
1. The drivers and firmware already were up-to-date.
2. Right now, everything is working OK. Yesterday afternoon I moved one of the disks to a different bay on the storage device (we have two disks and four bays, so there was room). That is the only change I've made so far.
3. This morning I started Bart's Stuff Test on the mirrored drives, but I halted the test after about 3 minutes because I wasn't sure whether I would do more harm than good in a production environment.
Questions:
1. Did the Bart's Stuff Test write data to the drive that I now should try to delete? If so, how do I delete it?
2. In reading other posts this board, I saw that genius member "dlethe" once said "the best, most reliable software to test bad hardware is nothing.  To get the most reliable results for testing hardware is to get a test board specifically designed to test hardware."
Would a test board work in my case? If so, please be specific.
3. As it seems that no one but "robincm" wants to tackle my issue, might anyone at least speculate on what could cause the System Event log messages I listed above? Specifically:

Event Source:      Disk
Event ID:      15
The device, \Device\Harddisk2, is not ready for access yet.

Event Source:      dmio
Event ID:      30
dmio: Harddisk2 write error at block 6160455: status 0xc000009d

Event Source:      dmio
Event ID:      4
dmio: write error on Plex Volume1-02 of volume Volume1 offset 6160392 length 8

Event Source:      dmio
Event ID:      5
dmio: Plex Volume1-02 detached from volume Volume1

Event Source:      dmio
Event ID:      5
dmio: Disk2-01 Subdisk detached from volume Volume1-02

Event Source:      dmio
Event ID:      31
dmio: Harddisk2 write error at block 976772618 due to disk removal

Event Source:      Application Popup
Application popup: Windows - FT Orphaning : A disk that is part of a fault-tolerant volume can no longer be accessed.

Event Source:      LDM
Event ID:      500
Detached plex Volume1-02 in volume Volume1

Event Source:      Disk
Description: The driver detected a controller error on \Device\Harddisk1.

Event Source:      Si3132r5
Event ID:      9
The device, \Device\Scsi\Si3132r51, did not respond within the timeout period.
0
 
Robin CMSenior Security and Infrastructure EngineerCommented:
The Barts Stuff test should remove all the test data when you stop it.
The event log messages are basically self explanatory. The last one is interesting as it seems to indicate the controller itself locked up.
I installed 5 x WD2002FYPS (RE4) drives hanging off a 3Ware 3DM 2 controller. These started randomly going offline to the point where the RAID5 was nearly destroyed. I contacted 3ware support and was told about a WD firmware update for the drives, not listed on the WD website, which took the drives from v04.05G04 to 04.05G05. I emailed WD support and was sent the update a day or so later. So despite what the site says, there clearly are sometimes firmware updates produced for drives. As of 15th Nov 2009 they started shipping all new drives of that model with the new firmware. In your case, strange that the drives should have only just started doing this, but who knows... Perhaps contact WD and see if there is anything.
Being a G3 I'm assuming the server itself is out of warranty, otherwise it'd be worth contacting HP about the issue too. Is there not some kind of hardware diagnostic available from HP? I think the ML110 is a fairly basic model so possibly not, but worth checking.
If the drives only have data on them you could buy a new controller card and new drives and transfer the data onto this. Not so easy if they also contain the OS.
0
 
sfschoolAuthor Commented:
Thanks very much for responding and being my lone source of support. Your ideas gave me room to explore further. I haven't found one "smoking gun" or solution; but the problem has ceased for the time being, and I'll be better prepared if the trouble returns.
0

Featured Post

Free Tool: Subnet Calculator

The subnet calculator helps you design networks by taking an IP address and network mask and returning information such as network, broadcast address, and host range.

One of a set of tools we're offering as a way of saying thank you for being a part of the community.

  • 2
  • 2
Tackle projects and never again get stuck behind a technical roadblock.
Join Now