Link to home
Start Free TrialLog in
Avatar of diallo0024
diallo0024

asked on

Openfiler - Can not initialize disk

My environment consists of:

1 Openfiler v2.3 server (clean install - no upgrades performed to bring up-to-date)
1 Dell T105 server w/ free ESXi 4 installed
Windows 7 Ultimate virtual machine
1 iSCSI Target for Windows backup (200GB)
1 iSCSI Target for file storage (1.5TB - external WD My Book) (*Yes, I know this is large.  It is used for media storage.)

************

I have two separate iSCSI Target IQNS for two different initiators.  I tried to connect Windows 7 virtual machine to the Openfiler target.  Initially, it connected fine and I was able to initialize the disk (via disk management) and assign a drive letter to begin its use.  All was fine.

One day the drive hosting the media (1.5TB) was rebooted by the surge protector it was plugged into.  Now the Windows 7 VM will not show the drive letter for the iSCSI target.  The drive (in Disk Management) reads "unallocated".  When I attempt to initialize the disk (the only option available to me), I get "Data Error - Cyclic Redundancy Check" message and it doesn't allow me to initialize it.  The disk is still showing actual size/space remaining, so I believe the data is not lost.  Is there a way to get this disk to be recognized?  And what is causing this issue?  This is the second time it has happened...  Luckily, I have a backup, but I stii need would like to prevent from happening in the future.  Thanks for everyone's help!  
Avatar of David
David
Flag of United States of America image

Looks like a catastrophic drive failure.   Showing the drive space remaining does not involve performing full media checks.  Go to the WD site and download/run diagnostics.  Be prepared for the worst however.
Avatar of diallo0024
diallo0024

ASKER

Thanks for the reply, dlethe!  After doing some more digging late last night, I discovered a couple of things:

1.  The actual version of openfiler is version 2.3 (with all updates and patches as of 3/1/2010).
2.  The drive still works fine when plugged into a Windows 7 machine (physical, not virtual machine)

I'm not exactly sure, but I think the drive is still good (relatively new...purchased this year 2010).  I may have to rebuild openfiler (which is what I did to resolve the issue last time).  I was just trying to avoid this step.  

Whereas, I believe the problem to reside in openfiler, I'm open to suggestions on resolving/preventing this issue from reoccurring.  

Thanks again for your reply!  Any other assistance is greatly appreciated.
It's not supposed to happen; but, about 4 times each year, I see a drive that rudely shut down in the midst of a sector write.  When this happens, the checksum (CRC) for that sector is wrong resulting in a CRC error.
Thee first thing to do is to confirm that as the problem.  Go get the free version of http://www.hdtune.com , install it, run it, and select the WD.  Choose the Health tab and inspect the Reallocated Sector value.  If it is anything but zero, there may be more serious things going on.  Next, run the error scan and expect it to take several hours.  You should get at least one red box.  If you get more than two or three, there is something more serious going on; but, if it is a very low number, post back with the result and we'll roll up our sleeves and fix it.
Thanks for the suggestions, DavisMcCarn.  I will try what you've recommended and post the results when the run is complete.

By the way...   Is this problem that you see (about 4 times each year) related to the drive, openfiler, or windows?
I disagree with Davis' assumption that reallocated sector count is indicator of "more serious things going on".  The parameter indicates that a disk drive had a bad block, and that the block has been reallocated.  This is a normal thing for a disk drive to do.  Modern high-density disk drives have tens of thousands of spare sectors, and are designed to do this.

RAID 1/5/6 controllers reallocate sectors on regular basis, and this is a design point.  When you use windows to format/initialize a disk drive, it checks for bad sectors and tells disk to reallocate unreadable ones.  This is all a natural thing to do.

Now you should keep an eye on this value, and if it jumps from 4 to 400 in a month, then try to get a warranty replacement, but if you have a dozen or so then it is nothing of concern, as it represents less than 0.0000000001%  of your data.

A probable reason why DavisMcCarn sees disks shut down in midst of a sector write is that he is using consumer-class disks rather than enterprise drives.  One of the reasons why enterprise disks cost 3x more money then consumer class is that their error recovery algorithms, ECC hardware/reserved block topology is tweaked to insure 5-10X faster recovery of bad blocks.   If a disk can't remap fast enough then some rAID controllers will think the disk is unreliable and shut it down.

These days, there ought to be enough stored power in the drive to finish the write operation prior to shutting down.  That is an engineering problem that could only be fixed by the drive manufacturers.
Having the power rudely go out; though, is, at least, forgivable......
Half of those I see with this problem occur when Windoze spontaneously reboots or freezes and those are unforgivable.  The drive should ignore any and all commands while it is writing so it can finish the block/sector.
dlethe,
What I have done for 34 years now is fix what was brought to me and data recovery started in the late 70's on cassette tapes or floppies.  Too many of the drives I see had started screaming about SMART failures weeks before they finally fell off the cliff.
It is just not practical to do this, Davis.  Neither battery circuitry nor using an electrolytic capacitor can provide enough power to do this without busting out of the physical footprint of the disk drive.  (Well, a specially designed battery that is flat and covers the top of the HDD may be possible, but then you would have to change the batteries every few years, and pay a big premium.    That is why people buy a UPS.

But you are still missing the point. The drive is being kicked off because in eye's of the controller, the disk is unreliable. The pending write is the one that timed out due to unacceptably long sector rewrite.  If you have SCSI/SAS/FC disks, then this is tunable via mode page editor.  SATA/ATA disks generally don't provide such configurability.  That is why RAID manufacturers spend big bucks certifying disks.  This is also why the tier 1/2 NAS/SAN appliance vendors often have special firmware built for disk drives. They hardcode such settings.
(FYI - worked for 20+ years for RAID manufacturers, designing firmware/configurators, etc  ... so I speak with experience here).
As for SMART errors, we are not discussing SMART, that is something different. Reallocated sector counts are part of the vendor/product specific algorithm that can trigger a S.M.A.R.T. alert. In as of itself, sector errors will not ALWAY trigger this.   I have NDAs with Seagate, had NDA with WD, so believe me, I know the internals of some algorithms, and I can not get specific due to NDA.  

But I certainly agree with you, if you get a S.M.A.R.T. error, then the prudent thing to do is replace the drive.  But if the reallocated sector count increases, this will not necessarily trigger the S.M.A.R.T. alert.  
Okay....so I ran the HD tune software.  Interesting, I didn't get any data to appear on the Health tab for any of my external drives.  And I did a quick scan for errors....no red blocks.  The "slow" scan is still running.  Looking at the conversations taking place, should I be approaching this differently?
No; the first order of business is still precluding a hardware error, and yes, HDTune often has issues with retrieving the SMART data on some drive subsystems.
Let it finish the long error scan and, if it comes up clean, we'll want to lokk at some partition recovery/repair utilities.  Luckilly, NTFS keeps a duplicate table at the end of the drive so that path is often pretty painless.
Unless you are at least using HDTune Pro, then you won't get much in the area of diagnostics.  What you fail to realize is that S.M.A.R.T. tells you about whether the disk is in a degraded condition.  it will NOT pick up bad blocks that have not had a read attempt.

Reallocated sector counts are only applicable to reallocated sectors, not bad sectors which you could very well have.

Again, I have been writing diagnostics and have NDAs with drive manufacturers, there is a great deal of difference between using diagnostics and writing them, and being able to talk to seagate engineers and have discussions about such things one on one.  

Your diagnostic program needs to invoke the ATA_OP_READ_VERIFY or the ATA_OP_READ_VERIFY_EXT, op code 0x40 or 0x42, depending on the ATA level of compliance and block count.  From what I saw reading the specs of HDTune and HDTune pro, it doesn't use this opcode.  HDTune is a nice, cute package, but professionals don't use it. HDTunePro is better, but still very much a consumer product.
The usefulness of HDTune is that it will not write (in the free version) and will let you know if there are any bad blocks, even though they are sometimes CRC errors created by a partial write.
The point is, running any repair utility on a drive with a chunk of sequential error blocks (which usually results from a physical shock) or a large number of of errors is, 9 times out of 10, going to finish off any hope of recovering the data.
If, on the other hand, no read errors are found, we can be fairly confident that fixing the error which is causing his drive to state it is not formatted, most probably won't make matters worse.
Or, if only one or two errors are found, we can either force a rewrite of those sectors or see what normal repair utilities will do.
Yes, there are numerous other utilities for examining hard disk drives; but, they are a digression from our goal.
Seagate and WD drives, BTW, build a list of pending reallocations and don't actually perform the operation until that block is next written.
ASKER CERTIFIED SOLUTION
Avatar of diallo0024
diallo0024

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial