asked on

Failing hard disk?

Hi,

Im running Centos on an old HP DL140 1U server.

Before installing centos, I added a brand new WD Black 1TB drive for data. THe OS is installed on a 160GB drive that was already in the server.

3 months on, and the WD drive has started to develop problems. Centos keeps putting it into read-only mode and I have to totally power down the server to get the drive to work again. A standard reboot doesnt seem to help, and when i do try rebooting, i get a bios warning that the drive is faulty.

However, when i power the server down fully, it boots with no problem.

Here is the log from centos:-

Jul 10 00:47:58 backupserver kernel: end_request: I/O error, dev sdb, sector 614323705
Jul 10 00:47:58 backupserver kernel: end_request: I/O error, dev sdb, sector 614325649
Jul 10 00:47:58 backupserver kernel: end_request: I/O error, dev sdb, sector 1430681601
Jul 10 00:47:58 backupserver kernel: end_request: I/O error, dev sdb, sector 1430681705
Jul 10 00:47:58 backupserver kernel: end_request: I/O error, dev sdb, sector 1430683593
Jul 10 00:47:58 backupserver kernel: end_request: I/O error, dev sdb, sector 1430685545
Jul 10 00:47:58 backupserver kernel: JBD: Detected IO errors while flushing file data on sdb1
Jul 10 00:47:58 backupserver kernel: EXT3-fs (sdb1): error: ext3_journal_start_sb: Detected aborted journal
Jul 10 00:47:58 backupserver kernel: EXT3-fs (sdb1): error: remounting filesystem read-only
Jul 10 00:49:49 backupserver named[1118]: zone keshercomms.com/IN: refresh: non-authoritative answer from master 37.128.190.75#53 (source 0.0.0.0#0)
Jul 10 00:52:01 backupserver named[1118]: zone averwood.co.uk/IN: serial number (2012051611) received from master 37.128.190.75#53 < ours (2012051803)
Jul 10 00:52:15 backupserver named[1118]: zone essaproperties.co.uk/IN: serial number (2012051611) received from master 37.128.190.75#53 < ours (2012051803)
Jul 10 00:53:31 backupserver named[1118]: zone cdd.uk.com/IN: serial number (2012051615) received from master 37.128.190.75#53 < ours (2012051803)
Jul 10 00:53:48 backupserver ntpd[4315]: synchronized to 178.79.150.93, stratum 3
Jul 10 00:54:45 backupserver named[1118]: dumping master file: tmp-CNUzHVGo43: open: permission denied
Jul 10 00:55:22 backupserver named[1118]: zone tamid.co.uk/IN: serial number (2012051615) received from master 37.128.190.75#53 < ours (2012051803)
Jul 10 00:55:30 backupserver kernel: __ratelimit: 27 callbacks suppressed
Jul 10 00:55:30 backupserver kernel: sd 1:0:1:0: [sdb] Unhandled error code
Jul 10 00:55:30 backupserver kernel: sd 1:0:1:0: [sdb] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jul 10 00:55:30 backupserver kernel: sd 1:0:1:0: [sdb] CDB: Read(10): 28 00 6f f6 3e c1 00 00 08 00
Jul 10 00:55:30 backupserver kernel: __ratelimit: 27 callbacks suppressed
Jul 10 00:55:30 backupserver kernel: EXT3-fs error (device sdb1): ext3_find_entry: reading directory #58695697 offset 0
Jul 10 00:55:53 backupserver kernel: sd 1:0:1:0: [sdb] Unhandled error code
Jul 10 00:55:53 backupserver kernel: sd 1:0:1:0: [sdb] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jul 10 00:55:53 backupserver kernel: sd 1:0:1:0: [sdb] CDB: Read(10): 28 00 28 01 ff 19 00 00 08 00
Jul 10 00:55:53 backupserver kernel: EXT3-fs error (device sdb1): ext3_find_entry: reading directory #20971528 offset 0
Jul 10 00:56:00 backupserver kernel: sd 1:0:1:0: [sdb] Unhandled error code
Jul 10 00:56:00 backupserver kernel: sd 1:0:1:0: [sdb] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jul 10 00:56:00 backupserver kernel: sd 1:0:1:0: [sdb] CDB: Read(10): 28 00 00 00 6e e1 00 00 08 00
Jul 10 00:56:00 backupserver kernel: EXT3-fs error (device sdb1): ext3_find_entry: reading directory #11 offset 0
Jul 10 00:56:11 backupserver named[1118]: zone jemsmanchester.co.uk/IN: serial number (2012051615) received from master 37.128.190.75#53 < ours (2012051803)
Jul 10 00:56:32 backupserver kernel: sd 1:0:1:0: [sdb] Unhandled error code
Jul 10 00:56:32 backupserver kernel: sd 1:0:1:0: [sdb] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jul 10 00:56:32 backupserver kernel: sd 1:0:1:0: [sdb] CDB: Read(10): 28 00 33 d2 b4 d1 00 00 08 00
Jul 10 00:56:32 backupserver kernel: EXT3-fs error (device sdb1): ext3_find_entry: reading directory #27164879 offset 0
Jul 10 00:57:01 backupserver kernel: sd 1:0:1:0: [sdb] Unhandled error code
Jul 10 00:57:01 backupserver kernel: sd 1:0:1:0: [sdb] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jul 10 00:57:01 backupserver kernel: sd 1:0:1:0: [sdb] CDB: Read(10): 28 00 28 02 be b1 00 00 08 00
Jul 10 00:57:01 backupserver kernel: EXT3-fs error (device sdb1): ext3_find_entry: reading directory #20971822 offset 0
Jul 10 00:57:01 backupserver kernel: sd 1:0:1:0: [sdb] Unhandled error code
Jul 10 00:57:01 backupserver kernel: sd 1:0:1:0: [sdb] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jul 10 00:57:01 backupserver kernel: sd 1:0:1:0: [sdb] CDB: Read(10): 28 00 6f f5 be e9 00 00 08 00
Jul 10 00:57:01 backupserver kernel: EXT3-fs error (device sdb1): ext3_find_entry: reading directory #58695700 offset 0
Jul 10 00:57:01 backupserver kernel: sd 1:0:1:0: [sdb] Unhandled error code
Jul 10 00:57:01 backupserver kernel: sd 1:0:1:0: [sdb] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jul 10 00:57:01 backupserver kernel: sd 1:0:1:0: [sdb] CDB: Read(10): 28 00 28 08 41 49 00 00 08 00
Jul 10 00:57:01 backupserver kernel: EXT3-fs error (device sdb1): ext3_get_inode_loc: unable to read inode block - inode=20989173, block=83951697
Jul 10 00:57:01 backupserver kernel: sd 1:0:1:0: [sdb] Unhandled error code
Jul 10 00:57:01 backupserver kernel: sd 1:0:1:0: [sdb] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jul 10 00:57:01 backupserver kernel: sd 1:0:1:0: [sdb] CDB: Read(10): 28 00 28 5c 3f b1 00 00 08 00
Jul 10 00:57:01 backupserver kernel: EXT3-fs error (device sdb1): ext3_get_inode_loc: unable to read inode block - inode=21160397, block=84639774
Jul 10 00:57:01 backupserver kernel: sd 1:0:1:0: [sdb] Unhandled error code
Jul 10 00:57:01 backupserver kernel: sd 1:0:1:0: [sdb] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jul 10 00:57:01 backupserver kernel: sd 1:0:1:0: [sdb] CDB: Read(10): 28 00 33 d9 ff 29 00 00 08 00
Jul 10 00:57:01 backupserver kernel: EXT3-fs error (device sdb1): ext3_find_entry: reading directory #27181070 offset 0
Jul 10 00:57:01 backupserver kernel: sd 1:0:1:0: [sdb] Unhandled error code
Jul 10 00:57:01 backupserver kernel: sd 1:0:1:0: [sdb] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
Jul 10 00:57:01 backupserver kernel: sd 1:0:1:0: [sdb] CDB: Read(10): 28 00 28 54 40 89 00 00 08 00
Jul 10 00:57:01 backupserver kernel: EXT3-fs error (device sdb1): ext3_get_inode_loc: unable to read inode block - inode=21144443, block=84574265

Open in new window

Is this a hardware problem (faulty hard disk) or is this something software related (since power cycling seems to temporarily fix it)?

I've tried running fsck which doesnt show any problems at all.

Also, is the WD Black ok in a server that runs 24/7? Someone on IRC commented that I'm having these problems because its a desktop hard disk and therefore shouldnt be used in a server.

If its not ok, what SATA hard disk should I be looking at for a server?
The server only supports SATA.

Thanks
Dan

David

First, you need a new HDD as your have corruption, read errors, drive timeouts, and a few other problems no doubt that don't show up in the list. (Plus a 160GB disk drive on that machine pretty much guarantees it is way past it's expected life)

Get yourself any "server class" or enterprise class SATA drive. The WD Black is not designed for 24x7x365. If you like WD, look at the RE3 and RE4 series.

The WD black is not suitable, but if you use the md driver to mirror it, then at least you won't be too bad off.

The WD black is designed for annual duty cycle of 2400 hours, light-medium use, so you can do the math.

SOLUTION

jamietoner

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

athomsfere

I would also agree with it being a bad drive, most likely. It could also be the controller, but the drive is more likely.

As for the drive, I have not been impressed with WD blacks recently. I am having a disproportionate amount of them fail in laptops and desktops at the office and at work. Luckily they have the 5 year warranty, otherwise I would be buying blues or Seagate / Samsungs at this point.

DanJourno

ASKER

Thanks for the comments. I'm going to go out and buy a WD RE4 and see if that holds up better. Will send the WD Black back and try to get a refund.

Thanks for the advice.

David

It isn't the controller. The CDBs are quite clear it is the disk. Bad controllers can't possibly cause these types of read errors, not on a SATA drive.

DanJourno

ASKER

On a related subject, if I go two WD RE4's, how easy would it be to set up software raid 1 in centos?

Would I have to rebuild the OS, or can I copy the OS from the 160GB drive to one of the RE4 drives and then activate raid 1?

David

If it is a server, and budget is limited to choice of a RE4 or a pair of WD Blacks, then personally I would go with the pair of blacks as long as you configure the md driver for software RAID1, and disable write cache.

Statistically speaking, the pair of blacks will give you better overall read performance, similar write performance .. but the difference is that you will have better protection against both bad blocks and catastrophic data failure.

If this is heavy usage 24x7x365, then personally, neither a single RE4 or a pair of blacks are suitable. You need a pair of RE4s.

DanJourno

ASKER

A pair of RE4's are fine. The budget is there if its required to provide good data protection.

My question was.... if I wanted to install 2 x RE4s, would I have to rebuild the OS from scratch? Or is there a way to copy the OS from the current 160GB drive onto the RE4s?

SOLUTION

David

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

DanJourno

ASKER

Ok, thanks. Not sure if there are available SATA connections for 3 drives. I think only 2.

I'll check in the morning.

Thanks for your help.

SOLUTION

David

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

DanJourno

ASKER

I may have found the problem. I was checking the specs to see how many SATA sockets there are.

the server specs say that it support a maximum of 2 x 500GB.
The drive is a 1TB drive, and the usage is currently just over the 500GB mark.

Could it be a hardware limitation thats causing Centos to believe its corrupt?

Is there any way to cheaply upgrade the server to allow for larger drives? Or is it a case of replacing the server?

DanJourno

ASKER

p.s... the bios lists the drive as a 1TB drive.

SOLUTION