mirrored drive freezing system and constantly resynching

I have a drive that is constantly freezing the system. When I reboot the system it does a resynch that takes for ever. It renders the server almost useless. I am running server 2003.

The error I get is: The device, \Device\Ide\IdePort2, did not respond within the timeout period.
and warning: The driver has detected that device \Device\Ide\IdePort2 has old or out-of-date firmware. Reduced performance may result.

Can I turn off mirroring untl the new drive arrives to increase performance?

How do I know which drive is bad?

Where can i get updated firmware?
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Is this RAID0 or RAID1?
Raid0, you do not want to turn off.
Raid1, just unplug one drive and try to boot.  Then try the other one.

What is the make and model of the motherboard, if onboard RAID?
if not, what is the make and model of the RAID Controller?
Random timeouts are classic examples of bad blocks.  It is not the firmware.  Your disk needs to be replaced.  It is throwing out errors faster than they can be repaired.  The drive is not long for this world.  
Have you tried turning it off and then on again?
SolarWinds® IP Control Bundle (IPCB)

Combines SolarWinds IP Address Manager and User Device Tracker to help detect IP conflicts, quickly identify affected systems, and help your team take near instantaneous action. Help improve visibility and enhance reliability with SolarWinds IP Control Bundle.

zwig2002Author Commented:
NOt sure which raid this is running. I did not build the server. How would I find that out?

Unfortunately it locked up on me again I have no access at all. I will have to manually reboot in the morning.

how do I know which drive is on ide2?

Do you think the controller is bad also?

MOBO is D915GAG  it does not support onboard raid
It is an intel controller.

What size drives does it have?  If the PC drive is the size of one of them, then RAID1.  If it it double the size, then RAID2.

Chances are, it's a hard drive issue.
zwig2002Author Commented:
The drives are both 200GB and they are mirrored.
Then you can pull one of the drive cables and try to boot.  Try this with both drives and you will know which one is the faulty drive.  When in the intel manager, look for which drive is on which port.  That way you will now what drive to clone when you put in the new drive.
zwig2002Author Commented:
Do I need the exact same drive (make and model) to replace the broken drive?
Best practices is ordinarily same make/model, but in your case, a better practice would be to get a quality drive.  Go with something more than one of those $50 cheapo consumer drives, and get an enterprise class drive designed for 24x7.
As dlethe said, If can't find the same make/model, get the same size.
zwig2002Author Commented:
I am onsite the mobo is a d915gag.

It is a desktop Sata board. There is no controller so my gues is either it is integrated or this is a software raid.

zwig2002Author Commented:
Both drives boot. How doni know which one is bad?
Here is a script that will display everything you need to know about the disk drives you have.  

zwig2002Author Commented:
THere is a new error with the single drive running: event id 11 The driver detected a controller error on \Device\Harddisk0.

ANy thoughts

Could this be a controller issue and not a drive issue?
It could be, but since we established the drive needs replacing, then this could just be a message relating to that.  
If both boot, they may not be bad.  Pull one of them and run the diagnostic from the appropriate manufacturer.  The easiest way is to use a usb to ide/sata adapter like:
Run the program from your pc or laptop.
WD Drive = WDDiag

Seagate = Seatools

If the drives test ok, other options are:
1. Bad windows driver
2. Bad MB, look for bulging caps.  

zwig2002Author Commented:
Last night I got a bunch of disk errors from one disk. Today I booted with the other disk and I am getting errors as well. Either both disks are bad or it is the controller.
Disk errors like bad sectors or just bad data?  If data, this should be expected as data written to one disk will be written to the other.
Have you ran a test from the manufacturer on the drive(s)?
zwig2002Author Commented:
They do not make a server 2003 driver as this is a desktop board.

Can I use the xp pro driver?
RAID is handled by the MB and not the Operating system.  
The drivers to update would be a BIOS update.  More than likely, this is not the problem.
Have you tested the drives?  Did they pass or marked defective?
zwig2002Author Commented:
The same exact errors in the system log as stated above. On both drives.
You need to test the drives before I can guide your further.
Look, when the first drive became unstable it started freezing up. This disk that you have takes too long to remap a sector, so the rebuild gives up.   This results in data corruption and inconsistency between the 2 drives.   A cheap unqualified drive will work "fine" ... UNTIL there is a problem.  You are seeing it in action.  

firmware will not fix this, a windows command will not fix this. Both disks are now throwing out errors saying effectively, "Hey, just letting you know that I lost some more data of yours".   When you get such errors during a rebuild, then the data is gone forever.  

The combination of controller + disk drives is NOT compatible when a drive gets remapped.  That is root cause and you are going to have to replace something.  One of the drives probably needs to be replaced as well.
zwig2002Author Commented:
I flashed the latest bios version and I am still recieving the same errors.

Disk past the quick test.

Replace MOBO?
Maybe not,  Run the extended test on the drive.  
Can you see any blown or bulging caps?
If replacing the MB you must have the same chipset or your drives won't work.
The disk will pass that sort of test. Heck, it can have a few bad blocks, and will pass anything but a full media test.  I am getting redundant now.  You have my expert opinion.  Good luck to you.

By the way, is this a critical server?  If so, look at replacing with hardened server hardware.
zwig2002Author Commented:
I am testing one drive now on an xp machine in a usb drive dock.

How can I view the data on this disk from an xp machine. I need to copy some files.
zwig2002Author Commented:
Visually the MOBO looks fine. There are no blown or bulging caps.
With the usb drive dock, the drive should just show up as another drive.  Just copy and paste.

If the drive doesn't show up, how is it listed in the disk manager? (right click on My Computer, click on Manage, Click on Disk Management)  What is the Type and File System?
zwig2002Author Commented:
The single drive that I was running stopped booting.  I put the other drive in and and it boots no problem.

I ran the extended test on the drive that stopped booting and it passed.

This is a tough problem to troubleshoot. I am not sure were to go with this.

It appears that the drive has some data corruption.  
If the drive tested ok, then the drive is fine.  It is the data on it that has an issue.

I assume that this is the drive that you could not read any data when attached to the XP system.
I would format this drive and then rebuild the RAID.  It will act just like if you would have replaced it with a new drive.

Did the other drive also test ok?
When attached to the XP system, can you see data on it?
Do you have a good full backup of the system?  If not, I would do that immediately.
zwig2002Author Commented:
You cannot read a mirrored server drive on an xp machine. This is based on other EE posts.

Let's not forget that the same errors come up on both drives. Even when run singularly. the drive that won't boot now ran fine for a week.

Came in this morning the server was down and will not get past applying computer settings.

replaced with the second drive and it booted right up. I am not sure how long this one will last though.

Should I replace the mobo or not? if so, do i need the exact same build?

You need the same chipset.  Getting the same board will ensure that you have that.
Have you tested the memory?
run memtest.
zwig2002Author Commented:
I have replaced the motherboard which resolved one issue.

I have found a bad block on the second drive. My issue is that the bad block must be in the boot partition. I cannot boot to the mirrored disk. It only boots to the corrupted disk. I installed a new drive and was able to mirror all the data except the boot partition. It will not mirror. Probably because of the bad block. What are my options?
zwig2002Author Commented:
also the original two drives are s.m.a.r.t drives.
zwig2002Author Commented:
I tried running windows backup to create an asr image it failed.

WHat is the best way to create and copy and image to the new drive?
How did you find the bad block?  What did you use?
Put the source disk in a non-RAID controller.  ( A usb enclosure will be fine). Find the physical block#0 that corresponds to logical block0 when disk is behind the raid controller using a binary editor.  Copy that block to the same logical block number to a data file.   Dismount, put the target disk in the same enclosure.  run binary editor, paste the block to the new drive.   By writing data to that block, the disk will automatically remap the bad sector using the new data you supplied.

You can do above from windows and all you need is a USB enclosure.  Plenty of freeware binary editors to choose from that support raw physical block I/O

zwig2002Author Commented:
the new motherboard is posting errors of disk 0 has a bad block.

I than used WD to scan the SMART HDD and it failed.
zwig2002Author Commented:
Unfortunately when I put the disk in a usb dock XP cannot read the data. It sees the disk but that is it because it is a server 2003 mirrored volume.
Know your benchmarks ... If block 0 is bad, does the motherboard stop looking? maybe block 0-n is bad?  Same with the WD utility.  That is danger of not knowing the internals.  Maybe the entire disk is bad instead of just block 0.

Maybe the HDD is formatted to something other than 512 bytes, so it thinks it is bad, when you have something that is correctable.   I do not know, just suggesting you should not make any assumptions beyond block0 being unreadable.
I never said you could mount it or see the file system.   You have RAID metadata at the beginning, not a filesystem header.  A binary editor that works on the physical disk, rather than a mounted file system is necessary.   You asked how to clone block #0, this will let you clone block #0.  

But you probably meant you wanted to clone logical block #0 within the RAID1 array.  That is not the same as physical block 0.  You will have to find the metadata markers.   It is difficult to talk somebody through all of this, and I hesitate going further because there are just too many things going on, and really need to see the raw blocks myself, logs, and everything else to advise.

You have had both physical & logical damage & corruption, and it just is dangerous to advise one message at a time w/o running decent diagnostics to assess both parity health, unreadable blocks, file system/partition headers, and the metadata for controllers.  For me or anybody else to get you going, it is going to require hands-on. Without raw hex dumps from the right software using an appropriate controller, and ability to run appropriate diagnostics, and assess the RAID1 health, I can't tell you what needs to be done.  

Gut feeling is that you had a basic drive failure, had inconsistency in the mirror (due to never running check/restores to insure they matched), so now you have both file system corruption and a broken RAID.  The other things just muddied it up.

You need to either call in a pro to get this going, or put in a spare drive, rebuild the RAID1 so it is healthy, then perform a standard windows recovery, and live with the damage, or pay maybe $2000 to somebody like ontrack).

zwig2002Author Commented:
I didn't say that block 0 was bad. I said that disk 0 has a bad block. I do not know which block is bad.
zwig2002Author Commented:
can you recommend a software to use?
For consumer software,  runtime.org's RAID reconstructor PLUS NTFS recovery software, together, MAY be able to get things back in order, but only then with some hand-holding.  Even then, you will need a non-RAID controller, and one scratch disk.  Neither of these products are designed to provide information that could reveal any root cause, but you are way beyond needing curiosity satisfied.  YOu need data back.

Even then, you would end up with a volume that would then probably have to be recovered with standard NTFS windows recovery techniques  (i.e, boot the CDROM, and start a recovery where it rebuilds the O/S, registry, etc, and then you have to re-install your apps. But at least the "data" would still be there, mostly.

Have you backed up your data yet?

What was the fail code that WD gave you?
zwig2002Author Commented:
WD diag: Raw Read Error Rate id=1 value= 182 threshold =51 worst =1 warranty=1
Data is backed up but I am trying to back up an image so I do not have to reinstall windows, all programs and recreate the domain from scratch.
zwig2002Author Commented:
does that mean it is uncer warranty?
zwig2002Author Commented:
WHy do you suppose the second raid drive will not boot? It booted for a few days on its own and then i stopped booting. The disk tested good. Any thoughts. This is one big headache!!!.
The diagnostic does not factor in when you bought it, and whether or not you bought it retail where it comes with a warranty. Only WD can tell you if a warranty applies.

Plus, this does not make a lot of sense that the RRER is even problematic.  If a number goes BELOW the threshold it is a problem.  If your value was UNDER 51 then you would have something.  I interpret this as meaning a value of 1 can be exchanged for a warranty replacement, possibly something between 50-2 also, but 182 is of no concern.  
file system damage, not hardware damage.  Erase all your DLL files and you will pass every diagnostic, you just won't be able to boot windows.
zwig2002Author Commented:
Is there a way to just repair that disk?
Your best chance is to run SpinRite

The $89.00 is a great investment for this product.
Spinrite will make SOME blocks readable.  It is file system agnostic. It does not even know what blocks your operating system uses, nor can it repair file system damage.   You require repaired file system.    

Not only that, but you have no idea (nobody does, not without inspecting both disks) about the state of the RAID1.   Did one disk run degraded?  You could be repairing stale blocks on a disk that does not have even current data.

Sorry, to fix it properly, you need the whole deal.  Parity testing/repair, file system repair, windows O/S recovery.  Unless you know for sure what block(s) are even valid then you are wasting your time, and can even make things worse.
zwig2002Author Commented:
chkdsk gets to 18% and fails. I will try spinrite.
FYI, at this point you have completely blown the opportunity for a guided recovery where disk A uses data from disk B to repair unreadable blocks, and vice versa.  You very well may have avoided anything but a tiny amount of data loss.  At this point all you can do is try spinrite and hope for the best on each disk, recover both manually and see which one is best.  

zwig2002Author Commented:
WD extended drive test results on failing hdd.

Model Number: WDC WD2000JD-22HBC0
Unit Serial Number: WD-WCALL1705103
Firmware Number: 08.02D08
Capacity: 200.05 GB
Test Result: PASS
Test Time: 23:04:03, April 27, 2010
The SMART test failed, but the disk passed hardware diagnostics.   The disk is currently fine
zwig2002Author Commented:
So what do you suggest. At this point I have no idea what is going on with this system.
Here is a little background on SMART

Did you run Spinrite?
At this point, because you kicked off the chkdsk best you can hope for is to install a fresh copy of windows on another HD, and then mount one of these disks and try to just copy files over that you need as best as you can.  Alternately, if you know how to run a windows recovery CD, do that.  The third option is to hire somebody.  Really just no way to walk you through it at this point.  Had you run the runtime software to repair the RAID, then you could have fixed the filesystem (or would have at least been better off)
zwig2002Author Commented:
Thank you for your help. I tried all your solutions, but I had to replace server all together. Both dirves were corrupt.

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Server Hardware

From novice to tech pro — start learning today.