RAID 10, Unrecoverable medium errors during recovery

A while back a hard drive failed on our exchange server.  The Server in an Intel S5500BC board running RAID imbedded technology II.  The RAID is configured as a RAID 10 with four drives creating two spans and one virtual drive.  All of the drives are the same 250 GB Seagate Barracude ES 2 drives running at the same speed. Drive 2 failed and was replaced.  It is on the same span as drive 3.   Windows Server 2008 R2 Standard 64 bit.

I RMA'd the failed drive and when I got a new one I put it back into the RAID and let Windows boot up and the drive began to rebuild.  The rebuild got about half way through fine but then started getting media errors.  The rebuild did complete though and it says the RAID is optimal.

I use Symantec Backup Exec 2010 for my server backups.  After the RAID rebuilt full system backups would not run.  It would encounter media errors.  I will paste an example below.

Controller ID:  0   Unrecoverable medium error during recovery:   PD   0:3      Location   0x16c4a1c4
Controller ID:  0   Unrecoverable medium error during recovery:   PD   0:2      Location   0x16c4a1c4

I noticed I could backup everything except where the exchange store and database sits which is on a different partition than the OS.  There are other user files on that partition that I was able to backup.  I called Microsoft about the huge number of exchange logs I had and they had me create a new database and get rid of the old log files because we believed several of them were corrupt.

I cleared a lot of log files and now am able to get full system backups to run.  These backups take a lot longer to complete  now though.  They take 3 and 1/2 hours where they used to take about 1/2 an hour.  I also get a bunch of the same type of media errors at several different disk locations but always on disk 2 and 3:

Controller ID:  0   Unrecoverable medium error during recovery:   PD   0:2      Location   0x16c4a1c4
Controller ID:  0   Unrecoverable medium error during recovery:   PD   0:2      Location   0x16c4a1c4

Symantec Backup Exec says the full system backups are completing though.  These errors concern me and I wonder if there are other files on the disk that may be corrupt and need to be fixed/deleted in order to not get these media errors and get the backup to perform like it used to.  The RAID is managed by the Intel Embedded Technology II Web Console.  I have run consistency checks on the Virtual Drive.  It fixes some errors but says it is unable to fix others.  Interestingly the media error count on drive 3 resets in the web console after a reboot.  I would really like to find a way to fix this.
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Download Seatools and test the affected drives with the standalone diagnostic on a different system.

I don't know what the embedded controller is on your board,but if it is in the ICH family,it's a software RAID controller should be avoided in a mission critical server.

I would stick with a separate HW RAID,from LSI  or Adaptec..
Based on the information that you have provided, you have a double fault within the array. This means that there is not enough information within the stripe to read the data within the stripe. As you can see the logical block location is un-readable at the same location for drive 02 and 03.

Although you are getting a good backup, the data within any stripe that has a double fault or puncture will not be accessable at all.

The fix, unfortunatly, at some point (the sooner the better) is to ensure you have all good drives with no media errors, recreate the Array and reload your system.
ercaseyAuthor Commented:
This will probably sound weird and I may be completely wrong, but I have been going through the files on the drives and trying to identify where the corruption is.  I know that the OS partition and the part of the next partition where Exchange is, backing up with no problems, so I have been going through the rest of the drive.  I am finding files that seem to have all ownership stripped of them and are not able to open at all.  I go into the properties of the files re-assign ownership and set securities then I am able to open these files.  Following that method I was able to get a full system backup last night and it didn't produce any media errors.

Charlestasse when you say reload do you mean reload the OS and set all the roles and everything else up all over again, or are you talking restore from backup?  With these errors occurring I'm not completely confident in those backups right now.

I run full system backups once a week and incremental every week day other than Monday.  The full system backups are the ones that produce the errors.  But like I said the full system backups ran with no problems last night.  I file size of the backups looks accurate, I would like to believe these backups are solid.
IT Pros Agree: AI and Machine Learning Key

We’d all like to think our company’s data is well protected, but when you ask IT professionals they admit the data probably is not as safe as it could be.

Have you run a chkdsk on the affected drives?
Yes i am refering to a full system reload / restore from backup.
It is possible that chkdsk can fix some file level damage, however it cannot fix a corrupted stripe

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
ercaseyAuthor Commented:
Do you know of any tools I can use to diagnose which disks are having errors?  How about a way to test these backups?  This server is the most critical one in our organization.  My fear is that even if I get good drives and start to restore from backup, that the restore may fail and then I would be left with nothing.  Right now the server is up and running, everything seems to be functioning properly expect for the occasional failed backup.

I did have to select the option in Symantec Backup Exec system recovery to ignore bad sectors during copy.

We have had a few hard drives fail this year alone.  The hard drives in this server come from a company called Equus.  We get them through our reseller, who gets them from Equus.  The drives we get when we RMA drives are certified re-manufactured hard drives.  I will be contacting Equus to see if it is always their policy to send repaired replacement drives instead of new drives.

So I'm not sure if the problem is that we are getting bad replacement drives, if there is a problem with the embedded controller or if there is a problem on the board.  If anyone know how I could isolate the problem here I would be grateful.
ercaseyAuthor Commented:
Thank you for the help, you were right I ended up getting another server and doing a restore to it.
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Server Hardware

From novice to tech pro — start learning today.