asked on

Smart Event for bad HD

I have a friend who has a desktop with 2HD's running RAID 1. I was told they got a "Smart Event" error message that said the HD was going to fail.

Now since there is 2 drives how do I know which one is going bad? If I find the bad HD can I unplug that one and plug in a new one and boot normal?

Thanks

TheTechEase

ASKER

This id the log I just found from Intel

System Information

Kit Installed: 6.0.0.1022
Kit Install History: 6.0.0.1022
Shell Version: 6.0.0.1022

OS Name: Microsoft Windows XP Professional
OS Version: 5.1.2600 Service Pack 3 Build 2600
System Name: LAURA
System Manufacturer: Intel Corporation
System Model: DQ965GF
Processor: Intel(R) Pentium(R) D CPU 2.80GHz
BIOS Version/Date: Intel Corp. CO96510J.86A.5844.2007.0302.0258, 03/02/2007

Language: ENU

Intel(R) RAID Technology

Intel RAID Controller: Intel(R) ICH8R/DO/DH SATA RAID Controller
Number of Serial ATA ports: 6

RAID Option ROM Version: 6.1.0.1002
Driver Version: 6.0.0.1022
RAID Plug-In Version: 6.0.0.1022
Language Resource Version of the RAID Plug-In: 6.0.0.1022
Create Volume Wizard Version: 6.0.0.1022
Language Resource Version of the Create Volume Wizard: 6.0.0.1022
Create Volume from Existing Hard Drive Wizard Version: 6.0.0.1022
Language Resource Version of the Create Volume from Existing Hard Drive Wizard: 6.0.0.1022
Modify Volume Wizard Version: 6.0.0.1022
Language Resource Version of the Modify Volume Wizard: 6.0.0.1022
Delete Volume Wizard Version: 6.0.0.1022
Language Resource Version of the Delete Volume Wizard: 6.0.0.1022
ISDI Library Version: 6.0.0.1022
Event Monitor User Notification Tool Version: 6.0.0.1022
Language Resource Version of the Event Monitor User Notification Tool: 6.0.0.1022
Event Monitor Version: 6.0.0.1022

Array_0000
Status: No active migration(s)
Hard Drive Write Cache Enabled: Yes
Size: 149 GB
Free Space: 0 GB
Number of Hard Drives: 2
Hard Drive Member 1: WDC WD800JD-22LSA0
Hard Drive Member 2: WDC WD800JD-22LSA0
Number of Volumes: 1
Volume Member 1: XP-PRO SP2

XP-PRO SP2
Status: Normal
System Volume: Yes
Volume Write-Back Cache Enabled: No
RAID Level: RAID 1 (mirroring)
Size: 74.5 GB
Number of Hard Drives: 2
Hard Drive Member 1: WDC WD800JD-22LSA0
Hard Drive Member 2: WDC WD800JD-22LSA0
Parent Array: Array_0000

Hard Drive 0
Usage: Array member
Status: SMART event
Device Port: 0
Device Port Location: Internal
Current Serial ATA Transfer Mode: Generation 2
Model: WDC WD800JD-22LSA0
Serial Number: WD-WMAM9JJ66130
Firmware: 06.01D06
Native Command Queuing Support: No
Hard Drive Write Cache Enabled: Yes
Size: 74.5 GB
Number of Volumes: 1
Volume Member 1: XP-PRO SP2
Parent Array: Array_0000

Hard Drive 1
Usage: Array member
Status: Normal
Device Port: 1
Device Port Location: Internal
Current Serial ATA Transfer Mode: Generation 2
Model: WDC WD800JD-22LSA0
Serial Number: WD-WMAM9JE48837
Firmware: 06.01D06
Native Command Queuing Support: No
Hard Drive Write Cache Enabled: Yes
Size: 74.5 GB
Number of Volumes: 1
Volume Member 1: XP-PRO SP2
Parent Array: Array_0000

Unused Port 0
Device Port: 2
Device Port Location: Internal

Unused Port 1
Device Port: 3
Device Port Location: Internal

Unused Port 2
Device Port: 4
Device Port Location: Internal

Unused Port 3
Device Port: 5
Device Port Location: Internal

SOLUTION

David

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

TheTechEase

ASKER

I did find Intel's software and posted the log report above.

David

Above is more of a configuration, it is not an event log. Nothing here indicates any SMART errors

TheTechEase

ASKER

Ok

If one drive does go bad will the computer still boot and work like normal with just one drive? Do I "need" to replace the bad drive?

David

Yes, that is one of the benefits of a RAID1 and a controller. One isolated error is of little concern, but just in case, always a good idea to have a backup. Nothing is infallible. The RAID controller you have is low-end, maybe a $5.00 chip. Compare that to a $500 RAID controller, with battery backup and it's own processor and about 10x more firmware. So you should be wary.

ASKER CERTIFIED SOLUTION

johnb6767

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

CrystalMethod

If you're using an onboard raid controller. You should be able to set the SATA controller to "non-raid". Then boot from a WD diag CD, and test one drive then the other. It's important that you do not let the system boot from the hard drives. Disable booting from the hard drive in the bios if you have to. The diag will list the serial number of the drive. Not down the s/n (serial number) of the defective drive, then pull the hard drives out and check the serial numbers with the labels to determine which is the failed unit.

nobus

i agree with John - the problem will be how to find disk 0 imo/
if unsure, take a disk, connect it to another PC, and run the disk diag on it; then on the other one.
then you know their status : http://support.wdc.com/download/#diagutils

David

After re-reading, it says Hard drive0, that has the smart event is serial#WD-WMAM9JJ66130
That is the disk. serial number is on the label of the HDD

nobus

good eyes - dlethe !

ocanada_techguy

Experts here identified which one, but not why, via the config tool clues. FYI a free S.M.A.R.T. specific tool you could also try is: http://www.beyondlogic.org/solutions/smart/smart.htm

It depends somewhat on what the S.M.A.R.T. error is/are. If there was overheating, possibly due to failing bearings, the drive would definitely be near end-of-life. Often it's simply a case of many bad tracks/sectors/blocks, it's "normal" for bad tracks to occurr over time. Some drives (enterprize grade) have elaborate extensive bad track handling, some (consumer) have simple. Generally, the drive has a "spare" area set aside and when bad tracks occurr the logic on the drive will detect it, set it aside, and try if possible to read the contents of the block "one last time" and move the content to one of the spare blocks, all under the auspices of the logic board on the drive itself. It does "record" these events in the S.M.A.R.T. info, and should the thresholds of either a) reaching the maximum going to run out of spare blocks or b) too many in a short period of time indicates a more serious problem or immenent failure, then the SMART errors are SET and during the next boot POST (self test) the computer BIOS will pause to say "uh oh smart errors, drive may soon fail"

Just to clarify, some tools can deal with individual drives BUT when the drives are part of a RAID set then not so much. CrystalMethod has suggested turning RAID off but that is dangerous, Crystal warns to turn it off only to use a bootable diag tool but do NOT boot the hard drives with the OS expecting the RAID configuration or else your RAID could go all to heck and "broken" or worse. nobus suggests a safer approach, using a completely different machine removing/attaching your disk(s) over there strictly for the purposes of diag tools. Not everyone has multiple machines at the ready of course.

It is suggested that you "swap out" the bad drive for a new/good replacement. Depending on the RAID configuration, the drive contents will be rebuilt from mirror or stripe if you're using a raid level with redundancy. For people who don't have the redundancy but have drives in simple mode, a backup, verify backup, replace and then restore from backup approach could be used.

Another approach could be to clone the drive to a new drive, and then remove the suspect drive leaving the new cloned drive to take it's place. HOWEVER typical third-party tools, especially ones you "boot" independantly, expect to deal with the disks driectly raw ATA and not the RAID drivers or protocols, expect the disk to contain regular OS volumes that can be seen as folders and files, NOT the scrambled content of a RAID stripe set. Mirror sets are never "re-joined" they're always re-built (one overwrites other). That leaves wither the backup/restore approach, OR ideally, a well crafted RAID (sub)BIOS with it's own utilities will provide some or all the capabilities you need to replace one drive with another. Some raid system utilities are easier to work with than others.

Theoretically, the drive with the errors (depending if it's simply bad tracks/blocks/sectors) can be "recuperated". SOme might be tempted to try these tools on your disk and not even bother replacing it. HOWEVER the fact that it's part of a RAID adds complexity. Hard disks are so inexpensive, your first best option is go about replacing the suspect drive. After removal, you could try HDDRegenerator or SpinRite to recuperate the drive. SpinRite is great for being able to read data off bad sectors, and also as a PREVENTATIVE tool it can set aside sectors that are slightly bad before they go bad thus saving your data before the bad tracks happen. On the other hand , for recuperating a drive that has reached/exceeded thresholds, HDDRegenerator in particular is said to be able to re-do the bad tracking, create whole new "spare" area, and reset the S.M.A.R.T. settings.

One more thing.... BEFORE you put a new replacement drive in, some experts recommend you "scrub" it first. Many drives come "pre-formatted" but it is a good idea to format AND check for bad blocks. You can then re-erase and put it into RAID for clone/replace reconstruction, having already checked it for bad sectors at least once.

Did I mention, have a backup and verify it? Regularly, but especially before a major undertaking. Perhaps even have Windows make a system-state backup as well.