Smart Event for bad HD

I have a friend who has a desktop with 2HD's running RAID 1. I was told they got a "Smart Event" error message that said the HD was going to fail.

Now since there is 2 drives how do I know which one is going bad? If I find the bad HD can I unplug that one and plug in a new one and boot normal?

Thanks
TheTechEaseAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

TheTechEaseAuthor Commented:
This id the log I just found from Intel


System Information

Kit Installed: 6.0.0.1022
Kit Install History: 6.0.0.1022
Shell Version: 6.0.0.1022

OS Name: Microsoft Windows XP Professional
OS Version: 5.1.2600 Service Pack 3 Build 2600
System Name: LAURA
System Manufacturer: Intel Corporation
System Model: DQ965GF
Processor: Intel(R) Pentium(R) D CPU 2.80GHz
BIOS Version/Date: Intel Corp. CO96510J.86A.5844.2007.0302.0258, 03/02/2007

Language: ENU



Intel(R) RAID Technology

Intel RAID Controller: Intel(R) ICH8R/DO/DH SATA RAID Controller
Number of Serial ATA ports: 6
 
RAID Option ROM Version: 6.1.0.1002
Driver Version: 6.0.0.1022
RAID Plug-In Version: 6.0.0.1022
Language Resource Version of the RAID Plug-In: 6.0.0.1022
Create Volume Wizard Version: 6.0.0.1022
Language Resource Version of the Create Volume Wizard: 6.0.0.1022
Create Volume from Existing Hard Drive Wizard Version: 6.0.0.1022
Language Resource Version of the Create Volume from Existing Hard Drive Wizard: 6.0.0.1022
Modify Volume Wizard Version: 6.0.0.1022
Language Resource Version of the Modify Volume Wizard: 6.0.0.1022
Delete Volume Wizard Version: 6.0.0.1022
Language Resource Version of the Delete Volume Wizard: 6.0.0.1022
ISDI Library Version: 6.0.0.1022
Event Monitor User Notification Tool Version: 6.0.0.1022
Language Resource Version of the Event Monitor User Notification Tool: 6.0.0.1022
Event Monitor Version: 6.0.0.1022
 
Array_0000
Status: No active migration(s)
Hard Drive Write Cache Enabled: Yes
Size: 149 GB
Free Space: 0 GB
Number of Hard Drives: 2
Hard Drive Member 1: WDC WD800JD-22LSA0
Hard Drive Member 2: WDC WD800JD-22LSA0
Number of Volumes: 1
Volume Member 1: XP-PRO SP2
 
XP-PRO SP2
Status: Normal
System Volume: Yes
Volume Write-Back Cache Enabled: No
RAID Level: RAID 1 (mirroring)
Size: 74.5 GB
Number of Hard Drives: 2
Hard Drive Member 1: WDC WD800JD-22LSA0
Hard Drive Member 2: WDC WD800JD-22LSA0
Parent Array: Array_0000
 
Hard Drive 0
Usage: Array member
Status: SMART event
Device Port: 0
Device Port Location: Internal
Current Serial ATA Transfer Mode: Generation 2
Model: WDC WD800JD-22LSA0
Serial Number: WD-WMAM9JJ66130
Firmware: 06.01D06
Native Command Queuing Support: No
Hard Drive Write Cache Enabled: Yes
Size: 74.5 GB
Number of Volumes: 1
Volume Member 1: XP-PRO SP2
Parent Array: Array_0000
 
Hard Drive 1
Usage: Array member
Status: Normal
Device Port: 1
Device Port Location: Internal
Current Serial ATA Transfer Mode: Generation 2
Model: WDC WD800JD-22LSA0
Serial Number: WD-WMAM9JE48837
Firmware: 06.01D06
Native Command Queuing Support: No
Hard Drive Write Cache Enabled: Yes
Size: 74.5 GB
Number of Volumes: 1
Volume Member 1: XP-PRO SP2
Parent Array: Array_0000
 
Unused Port 0
Device Port: 2
Device Port Location: Internal
 
Unused Port 1
Device Port: 3
Device Port Location: Internal
 
Unused Port 2
Device Port: 4
Device Port Location: Internal
 
Unused Port 3
Device Port: 5
Device Port Location: Internal





0
DavidPresidentCommented:
Whatever mechanism reported the SMART error should provide a disk identifier.  The bad news is that you don't know which one it is without it, or w/o hooking things up to another computer and run diagnostics.  the good news is that at least both disks are functioning at the moment.

Intel has a windows-based utility on their website for this motherboard that you can run to look at the health and configuration of the logical/physical drives.  Install it, and you will soon know which disk is suspect.

You *should* be able to just  install the replacement disk, and it will rebuild the RAID1.  But before doing that, I would disable the write cache.  Your system is in stress, and write cache enabled can cause data loss in event of a failure or power loss.  
0
TheTechEaseAuthor Commented:
I did find Intel's software and posted the log report above.
0
Keep up with what's happening at Experts Exchange!

Sign up to receive Decoded, a new monthly digest with product updates, feature release info, continuing education opportunities, and more.

DavidPresidentCommented:
Above is more of a configuration, it is not an event log.  Nothing here indicates any SMART errors
0
TheTechEaseAuthor Commented:
Ok

If one drive does go bad will the computer still boot and work like normal with just one drive? Do I "need" to replace the bad drive?
0
DavidPresidentCommented:
Yes, that is one of the benefits of a RAID1 and a controller.   One isolated error is of little concern, but just in case, always a good idea to have a backup.  Nothing is infallible. The RAID controller you have is low-end, maybe a $5.00 chip.   Compare that to a $500 RAID controller, with battery backup and it's own processor and about 10x more firmware.  So you should be wary.
0
johnb6767Commented:
Hard Drive 0
Usage: Array member
Status: SMART event

vs.....

Hard Drive 1
Usage: Array member
Status: Normal

Looks like drive 0 to me.....
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
CrystalMethodCommented:
If you're using an onboard raid controller. You should be able to set the SATA controller to "non-raid". Then boot from a WD diag CD, and test one drive then the other. It's important that you do not let the system boot from the hard drives. Disable booting from the hard drive in the bios if you have to. The diag will list the serial number of the drive. Not down the s/n (serial number) of the defective drive, then pull the hard drives out and check the serial numbers with the labels to determine which is the failed unit.
0
nobusCommented:
i agree with John - the problem will be how to find disk 0 imo/
if unsure, take a disk, connect it to another PC, and run the disk diag on it; then on the other one.
then you know their status  : http://support.wdc.com/download/#diagutils
0
DavidPresidentCommented:
After re-reading, it says Hard drive0, that has the smart event is serial#WD-WMAM9JJ66130
That is the disk.  serial number is on the label of the HDD
0
nobusCommented:
good eyes - dlethe !
0
ocanada_techguyCommented:
Experts here identified which one, but not why, via the config tool clues.   FYI a free S.M.A.R.T. specific tool you could also try is:  http://www.beyondlogic.org/solutions/smart/smart.htm

It depends somewhat on what the S.M.A.R.T. error is/are.  If there was overheating, possibly due to failing bearings, the drive would definitely be near end-of-life.  Often it's simply a case of many bad tracks/sectors/blocks, it's "normal" for bad tracks to occurr over time.  Some drives (enterprize grade) have elaborate extensive bad track handling, some (consumer) have simple.  Generally, the drive has a "spare" area set aside and when bad tracks occurr the logic on the drive will detect it, set it aside, and try if possible to read the contents of the block "one last time" and move the content to one of the spare blocks, all under the auspices of the logic board on the drive itself.  It does "record" these events in the S.M.A.R.T. info, and should the thresholds of either a) reaching the maximum going to run out of spare blocks or b) too many in a short period of time indicates a more serious problem or immenent failure, then the SMART errors are SET and during the next boot POST (self test) the computer BIOS will pause to say "uh oh smart errors, drive may soon fail"

Just to clarify, some tools can deal with individual drives BUT when the drives are part of a RAID set then not so much.  CrystalMethod has suggested turning RAID off but that is dangerous, Crystal warns to turn it off only to use a bootable diag tool but do NOT boot the hard drives with the OS expecting the RAID configuration or else your RAID could go all to heck and "broken" or worse.  nobus suggests a safer approach, using a completely different machine removing/attaching your disk(s) over there strictly for the purposes of diag tools.  Not everyone has multiple machines at the ready of course.

It is suggested that you "swap out" the bad drive for a new/good replacement.  Depending on the RAID configuration, the drive contents will be rebuilt from mirror or stripe if you're using a raid level with redundancy.  For people who don't have the redundancy but have drives in simple mode, a backup, verify backup, replace and then restore from backup approach could be used.

Another approach could be to clone the drive to a new drive, and then remove the suspect drive leaving the new cloned drive to take it's place.  HOWEVER typical third-party tools, especially ones you "boot" independantly, expect to deal with the disks driectly raw ATA and not the RAID drivers or protocols, expect the disk to contain regular OS volumes that can be seen as folders and files, NOT the scrambled content of a RAID stripe set.  Mirror sets are never "re-joined" they're always re-built (one overwrites other).  That leaves wither the backup/restore approach, OR ideally, a well crafted RAID (sub)BIOS with it's own utilities will provide some or all the capabilities you need to replace one drive with another.  Some raid system utilities are easier to work with than others.

Theoretically, the drive with the errors (depending if it's simply bad tracks/blocks/sectors) can be "recuperated".  SOme might be tempted to try these tools on your disk and not even bother replacing it.  HOWEVER the fact that it's part of a RAID adds complexity.  Hard disks are so inexpensive, your first best option is go about replacing the suspect drive.  After removal, you could try HDDRegenerator or SpinRite to recuperate the drive.  SpinRite is great for being able to read data off bad sectors, and also as a PREVENTATIVE tool it can set aside sectors that are slightly bad before they go bad thus saving your data before the bad tracks happen.  On the other hand , for recuperating a drive that has reached/exceeded thresholds, HDDRegenerator in particular is said to be able to re-do the bad tracking, create whole new "spare" area, and reset the S.M.A.R.T. settings.

One more thing.... BEFORE you put a new replacement drive in, some experts recommend you "scrub" it first.  Many drives come "pre-formatted" but it is a good idea to format AND check for bad blocks.  You can then re-erase and put it into RAID for clone/replace reconstruction, having already checked it for bad sectors at least once.

Did I mention, have a backup and verify it?  Regularly, but especially before a major undertaking.  Perhaps even have Windows make a system-state backup as well.
0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Components

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.