Link to home
Start Free TrialLog in
Avatar of darnitol500
darnitol500

asked on

Server 2003 Corrupting MFT, Corrupting Files, how should I proceed?

I have a server that is corrupting files - especially as relating to SQL.

The server is running Server 2003 Enterprise R2, Xeon E5405 w/ 4GB RAM.  Intel mobo using the onboard SATA RAID.  4 Seagate 1TB HDD's, 3 in a RAID 5 with a Hot Spare.

It has had a couple of forced shutdowns by a well-meaning but misinformed maintenance man, and was showing some files that could not be deleted or changed due to corruption - we had first noted the corruption on our backup reports.  I had backed up everything imaginable, even dcpromoing their Exchange server in order to have a good backup of AD in case their server needed recovery.  I ran chkdsk, which complained of MFT corruption as well as a plethora of crosslinked files and the like.  After rebooting again and running a 2nd chkdsk, which showed good, the server booted fine, a "sfc /scannow" ran without incident, I fixed some broken DCOM settings as well as some services permissions, and the server seemed to be running as good as ever with clean backup reports for two nights.

Then it started corrupting files again.

So on 2/26 and 2/27 we received:

Backup started on 2/26/2011 at 1:48 AM.
Warning: Unable to open "C:\WINDOWS\assembly\NativeImages_v2.0.50727_32\PresentationBuildTa#\20ef773b20f6ce721ae60e5c2c2e8f80\PresentationBuildTasks.ni.dll" - skipped.
Reason: The file or directory is corrupted and unreadable.

Then on 2/28 it became:

Backup started on 2/28/2011 at 1:28 AM.
Warning: Unable to open "C:\WINDOWS\assembly\NativeImages_v2.0.50727_32\PresentationBuildTa#\20ef773b20f6ce721ae60e5c2c2e8f80\PresentationBuildTasks.ni.dll" - skipped.
Reason: The file or directory is corrupted and unreadable.


Error: Could not access portions of directory C:\WINDOWS\inf\01F\.NET Data Provider for SqlServer.
You may not have permission to open the file, or the directory may be missing or damaged.
Please contact the owner or administrator.

Warning: Unable to open "C:\WINDOWS\inf\01F\.NET Data Provider for SqlServer" - skipped.
Reason: The file or directory is corrupted and unreadable.

 Errors in the System Log (only seem to happen while the backup is running):

 Event ID:  9 Source: MegaSR

The device, \Device\Scsi\MegaSR1, did not respond within the timeout period.

Event ID :  55  Source: NTFS

The file system structure on the disk is corrupt and unusable. Please run the chkdsk utility on the volume \Device\HarddiskVolume2.


Opinions?


Since this will likely involve having to reload their server, does anyone have any nuggets of wisdom or astounding timesavers for reloading the OS without hosing AD and my Data after the inevitable reformat?






Thanks!
Avatar of Davis McCarn
Davis McCarn
Flag of United States of America image

Unless there are disk errors or the MegaRaid utility reports health problems, the most common cause of disk corruption is flakey ram.  Get the free ISO from http://www.memtest86.com and boot it to run the memory test.  Let it run for at least a few hours.
ASKER CERTIFIED SOLUTION
Avatar of pgm554
pgm554
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of darnitol500
darnitol500

ASKER

Tell me more about this 30 percent chance of failure during rebuild, please, is it a logical or mechanical issue?  Do you have any external references that provide additional details?

This is not a Raid rebuild situation, but this sounds like important info nonetheless.
Seagate st310034as 7200 rpm sata hdd's
Seagate st310034as 7200 rpm sata hdd's
Also,download an eval copy of Backup Exec 2010 System Restore and create an image out to a USB drive.

Use the restore disk to test on another system using the restore anywhere option and see if your corruption goes away.

If so,I would rebuild the RAID using RAID 6 and restore the image to the new config.
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
These are excellent resources, pgm554!

Thank you for going above and beyond and "teaching a man to fish."

I am busy shifting operations to an alternate server while I troubleshoot this one - I will likely be replacing the mobo RAID with a 3Ware controller (a recurring theme from other forums is that the Intel mobo RAID is not known for reliability), and I will be reusing the existing drives to construct two RAID 1 volumes - I know those Seagate drives are consumer drives, but the customer cannot afford Enterprise-class drives.  They would have afforded them if they knew how much the cheap ones really cost over the long run!.

Before I do that, however, I will be running diagnostics against the RAM, RAID, and MOBO to see if there are any overt failures.

I will keep you posted!

The failure is snowballing - I'm glad I'm on top of it - this error began appearing in this mornings Application log:

Source:  ESENT  Event ID:  508

wins (3672) A request to write to the file "C:\WINDOWS\system32\wins\j50tmp.log" at offset 0 (0x0000000000000000) for 4096 (0x00001000) bytes succeeded, but took an abnormally long time (91 seconds) to be serviced by the OS. This problem is likely due to faulty hardware. Please contact your hardware vendor for further assistance diagnosing the problem.

I'm 75% certain its a failing drive, and 25% certain that it is a failing controller.  Both are going to be ordered!
I can't find Spinpoint F1's, are Spinpoint F3's acceptable?
I have analyzed the hard drive space requirements and have found that 1TB is way too much space, the server OS + it's data is under 100GB.

SO:

I have ordered the following:

3ware 9650SE-4LPMLl + it's 4 drive breakout cable
4 x Western Digital RE4 WD2503ABYX 250GB 7200 RPM

I will create two RAID 1 volumes, 1 for the OS, and 1 for the data.

I am planning on using the Backup Exec 2010 System Restore for the move.
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Found the issue, one of the drives in the existing array had bad blocks but it was never reported to the controller - cheap Intel RAID - had to figure it out by doing a consistency check then watching the lights on the hdd's - the flaky one stayed almost constantly illuminated as it ruminated on its bad sectors.  An error, however, was never reported during the in-OS consistency checks, I had to do the BIOS check instead.

Since I can recover the existing system I'm going to replace the failed drive, get it going soundly, then migrate to the new (more reliable) array.

The Symantec solution is crippled unless I buy it, so I'm going to set aside the repaired array and use ASR + NTBackup for the restore to the new array.
There is an excellent article on Tech Republic regarding RAID 6 and why we shouldn't jump on it too quickly.

http://www.techrepublic.com/blog/datacenter/raid-6-do-you-really-want-it/119