Solved

Critical Impact Alert <Bad Block on HD>

Posted on 2010-08-16
16
634 Views
Last Modified: 2012-05-10
Hello and thank you for your time...

One of my clients is running a MS server 2003 sbs sp2. Our monitoring agents has reported:
"bad block on the Hard Drive".
My question is what is the best way to run a disk repair utility?
Should I run it from MS-Dos? (if so what command should i use /c /r for check and repair?
Or should i run it form within windows or some other way?
0
Comment
Question by:loshdog
  • 5
  • 5
  • 3
  • +2
16 Comments
 

Expert Comment

by:buckobilly
Comment Utility
I've always ran them through windows check disk and have never had a problem.

I check the fix errors box and let things go.  I would also perform a full backup prior, use a product like ghost or something like that.
0
 
LVL 23

Expert Comment

by:Dr. Klahn
Comment Utility
Do not try to repair a drive that is throwing bad blocks if it is being used in a server.  Replace it.  When a drive starts showing bad blocks, it means that it is unable to revector the block in question, and deterioration is well underway.
0
 

Author Comment

by:loshdog
Comment Utility
Hello and thank you...

They have a raid 5 configuration using 4 drives. There are two partitions C; for OS and D: for storage.
How would i determine which drive is going bad?
0
 
LVL 14

Expert Comment

by:athomsfere
Comment Utility
The reporting tool should be able to tell you what channel the bad drive is on, and I agree with the above poster. If the drive is starting to show bad blocks, replace it as it will fail soon even if you bypass those bad blacks now.
0
 

Author Comment

by:loshdog
Comment Utility
Where would i find the reporting tool. I check Event Viewer but did not mention anything about HD bad sectors.
0
 
LVL 14

Expert Comment

by:athomsfere
Comment Utility
What exactly was used to get this from your original post?

Our monitoring agents has reported:
"bad block on the Hard Drive".

0
 

Author Comment

by:loshdog
Comment Utility
We have a SAAZ agent installed on the server which monitors all different aspects. It reports to our Network Operation Center. The center then forwards the message or ticket to a tech or owner.

Hope this answers your question...
0
 
LVL 14

Expert Comment

by:athomsfere
Comment Utility
Can you attach the full message?

It should say something like Hard Drive 0, channel 1... or at worst can we get a screen cap from the app?
0
How to improve team productivity

Quip adds documents, spreadsheets, and tasklists to your Slack experience
- Elevate ideas to Quip docs
- Share Quip docs in Slack
- Get notified of changes to your docs
- Available on iOS/Android/Desktop/Web
- Online/Offline

 

Author Comment

by:loshdog
Comment Utility
Thank you.. Screen Shot below...


Untitled-1.jpg
0
 
LVL 14

Accepted Solution

by:
athomsfere earned 225 total points
Comment Utility
Wow, not a very helpful message! Gives site and lots of decent stuff, but not what is actually failing!

Next hope would be that the server has a decent RAID controller with hot swap and LEDs to warn if the drive is dead... but a bad block probably won't throw a status indicator...

I hate to recommend this for a server, but you might try speedfan to check the SMART errors. Its free but something you must install.

http://www.almico.com/sfdownload.php
0
 
LVL 23

Assisted Solution

by:Dr. Klahn
Dr. Klahn earned 150 total points
Comment Utility
Agree with athomsfere.  That message is not helpful.

You need to see the SMART status for the individual drives to see if there is a problem and on which drive it exists.  A good RAID controller and its management software can display this information -- check the system, find the management software, and look at the SMART status for each drive.

If the management software is missing or doesn't show SMART status for individual drives, there are a couple of other things to try, depending on your cash and courage.

---- "The easy way" -- If the system was purchased as a standard configuration from Dell, HP, or whoever, call the pay-per-call tech support line, explain the problem, and have them walk you through the method to check drive failure status.

---- "For manly men" -- Take the system down, LABEL ALL THE DRIVES IN PLACE, and remove the drives.  Take them to a workstation.  Connect ONE drive to an unused drive port on the workstation and start the workstation.  Ignore all messages about unknown drive formats, etc.  Install PassMark Disk Checkup (free) on the workstation.

http://www.passmark.com/products/diskcheckup.htm

Run Disk Checkup and look at the SMART status for the drive in question.  If it is clean, shut down the system, install the next drive, and continue checking SMART status until the offending drive is found.

---- "Time on your hands" -- Depends on the RAID set truly being RAID 5, on having a long time to find the problem, and on only one drive being bad.  Obtain a replacement drive and swap out the first drive in the RAID set.  Rebuild the set.  Run the system as usual.  If the error messages don't persist "after some reasonable time," you've found the bad drive.  If not, replace the original drive.  Continue swapping drives until the problem goes away.
0
 
LVL 11

Assisted Solution

by:ocanada_techguy
ocanada_techguy earned 125 total points
Comment Utility
You'd likely have to shut the system down for this.  The RAID controller should have an "extended BIOS" which "hooks" into the standard BIOS of the booting machine.  While booting there would be a brief message on screen about press Ctrl-__ or F__ or something to go into the RAID card's built-in utilities mode.  There, there should be simple text menus, one being status information, and for scanning the drives for bad sectors, and reconstructing sets if a drive is replaced (as opposed to just relying on hot-swapping)
Better RAID controllers will let you hot-swap, that is, change out the bad drive without shutting down.  For thoses, there is usually some utilities added under Windows (or OSes it supports) for administering the RAID card to find out which drive, and/or as suggested, some system have indicator lights and possibly the drives installed in hot-swappable pluggable bays/rail-kits.  Even on hot-swap systems, you should still be able to do it the "cold swap" way.
You should find manuals and step-by-step guides on the support website of the manufacturer of the model of RAID controller, or the system manufacturer if bundled, and/or as suggested use their telephone support to guide you.
0
 

Author Closing Comment

by:loshdog
Comment Utility
Wow... Thank you all for your input.

Last time i booted into the controller card i remember seeing a SMART tab.. I will check that first and see if it provides me w/ any useful info. If that does not work..
I will go to DELL website and see what type of controller card it has and see if it's hot swappable. After gaining that info. I will take the manly man way, remove the driver and connected to a workstation and scan it that way.... This will be time consuming....

Once again thank you all...
0
 
LVL 14

Expert Comment

by:athomsfere
Comment Utility
For future reference I mentioned the Speedfan because you to check the SMART without real down time, and sometimes that matters more then not installing little freeware apps. It does allow you to check each drive as well.
0
 
LVL 11

Expert Comment

by:ocanada_techguy
Comment Utility
I'd be MOST interested in what the RAID controller status says about the drives.  There's also very likely a health monitoring cleanup process you can invoke.

s.m.a.r.t. is ok, but it's not a perfect way to diagnose.  For example, recent QnA HDD Sentinel says drive is poor, but s.m.a.r.t. reader shows "ok"  http://www.experts-exchange.com/Storage/Hard_Drives/Q_26407063.html#a33457490
 http://www.passmark.com/forum/showthread.php?t=1723

Basically different RAID cards handle bad sectors differently, to say nothing of the myriad of RAID interpretations and implementations.  Some enterprise-class ECC systems WILLl automatically badtrack, that is automatically set aside bad sectors and remap them to "spare" sectors on-the-fly, whereas some RAID cards just leave the bad sector and read the data from the redundant sister/twin/brother, whether that's the mirror or parity or dual parity.  The problem with the latter is those spots are half at-risk, in that the same sector goes bad on a sister, and unless it's "dual-parity" you're in trouble (that's how dual-parity came to be "invented").   It's not great that it might NOT "do" anything about the bad sector on a drive UNTIL THAT IS you run the occaisional housekeeping that you're supposed to do with the RAID controller utility (which, we'll assume in this case, has not been done for some indeteminate length of time).   This might be your situation, and raid is warning there's quite a few unhandled bad sectors that need dealing with.

Do NOT, do NOT handle bad sectors on the drives "directly individually" by connecting them elsewhere and doing low-level bad-sectoring/badtracking.  Only use the RAID controller utilities for that.  Otherwise, you can completely corrupt your stripe/mirror/parity so unless you want to rebuild your raid arrays from scratch and recover from last known good backup, so don't.

See, many RAID cards do NOT handle bad sectors the same as single drive operation (that you'd see measured on each drive's s.m.a.r.t.) in another surprising way.  Briefly put, what many RAID cards do when "preparing" the drives is set aside more space than the normal "spare sectors" for badsectoring, because, when there's a bad sector on one spot one of the drives in a "set", EITHER automatically or else as part of the raid "cleanup" maintenance process discusssed, many RAIDs set aside that sector on ALL the drives in the set if so much as one of them has it bad, the reason being, then, it can keep ALL the drives sectoring/blocking mirror/stripe/parity TABLES "In SYNC" so-to-speak, marching in lock-step, rather than dealing with odd exceptions for each drive individually.  But the result of that is, whether it's two, or three, or four, five or six drives in a set, a bad sector on one is "set aside" on them all, so the spare area has to be much bigger because it's going to get used up much quicker.  Thus, A RAID controller might be raising errors becaise the "volume" is almost out of spare sectors, and yet individually connected directly the drives seem fine.

Would it be better if bad sectors were relocated to spares on an individual disk basis? Yep.  Some do, but many DON'T preferring to simplify constant operation and performance versus the resulting "different behaviour" that seems like a complication but it's not and yes is slightly less efficient but these are after all supposed to be "redundant array of inexpensive disks" (or independant, both meanings are used)

0
 
LVL 11

Expert Comment

by:ocanada_techguy
Comment Utility
Also, just hot-swapping drives, does NOT "reclaim" what had been bad sectors on evey disk in the set and somehow make them good sectors again, not once they've been set aside by the RAID.  So you could put in a perfectly good disk with no badsectors and as soon as RAID has reconstructed the set, for all intents and purposes the drive already has a whackload of bad sectors set aside.  On the other hand a hot-swap will help IF and only IF the bad sectors haven't been set aside yet (automatically, it'll be too late, manually, okay you may not have done it yet) BUT ONLY IF you happen to be swapping out the drive with the bad sectors in question.  BUT, BUT lets say sector 1234567 is good on the drive you exchange out, and bad on one of the sisters, well then the raid cannot/may not be able to reconstruct the data on sector 1234567 because on it's twin that's a bad sector that is unreadable and hasn't been remapped yet, and you just pulled the drive that had the other copy of the data for 1234567.  Mind you, if there is parity or dual-parity then the reconstruction should nevertheless be successful thanks to reverse engineering 1234567 from the extra redundancy.
So you see ultimately, the maintenance should be done (unless it's automatic).  It's definitely best to read the status conditions before any actions, and it's typically more preferable to do the maintenance before a swap out rather than not.
And ultimately, at some point, the array may have to be rebuilt because no amount of hot-swapping is going to make the spares area bigger if it's almost full.  (unless your RAID will do that maybe if you "extend" the volume by replacing with larger drives)
0

Featured Post

PRTG Network Monitor: Intuitive Network Monitoring

Network Monitoring is essential to ensure that computer systems and network devices are running. Use PRTG to monitor LANs, servers, websites, applications and devices, bandwidth, virtual environments, remote systems, IoT, and many more. PRTG is easy to set up & use.

Join & Write a Comment

This article is an update and follow-up of my previous article:   Storage 101: common concepts in the IT enterprise storage This time, I expand on more frequently used storage concepts.
Data center, now-a-days, is referred as the home of all the advanced technologies. In-fact, most of the businesses are now establishing their entire organizational structure around the IT capabilities.
This video Micro Tutorial explains how to clone a hard drive using a commercial software product for Windows systems called Casper from Future Systems Solutions (FSS). Cloning makes an exact, complete copy of one hard disk drive (HDD) onto another d…
Here's a very brief overview of the methods PRTG Network Monitor (https://www.paessler.com/prtg) offers for monitoring bandwidth, to help you decide which methods you´d like to investigate in more detail.  The methods are covered in more detail in o…

744 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

8 Experts available now in Live!

Get 1:1 Help Now