Link to home
Start Free TrialLog in
Avatar of Jsmply
Jsmply

asked on

Keep geting lots of "The system failed to flush data to the transaction log. Corruption may occur" errors in Windows 7 Pro

Hi Experts,

In our Windows 7 Pro system (relatively new), we keep getting this error "The system failed to flush data to the transaction log. Corruption may occur."  Google, Microsoft, and EE searches haven't yielded much help based on the details of the event.  When it occurs, it will occurs SEVERAL times in a row for an hour or so.  Then, it just stops for several hours, then starts again.  We don't see much correlation as to what's happening at those times.  Can anyone shed any light?  Thx Here is the full code of the error below
Log Name:      System
Source:        Ntfs
Date:          5/12/2010 3:34:37 PM
Event ID:      57
Task Category: (2)
Level:         Warning
Keywords:      Classic
User:          N/A
Computer:      mc1
Description:
The system failed to flush data to the transaction log. Corruption may occur.
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="Ntfs" />
    <EventID Qualifiers="32772">57</EventID>
    <Level>3</Level>
    <Task>2</Task>
    <Keywords>0x80000000000000</Keywords>
    <TimeCreated SystemTime="2010-05-12T07:39:52.469080600Z" />
    <EventRecordID>9095</EventRecordID>
    <Channel>System</Channel>
    <Computer>mc1</Computer>
    <Security />
  </System>
  <EventData>
    <Data>
    </Data>
    <Binary>0000000001000000020000003900048000000000100000C000000000000000000000000000000000</Binary>
  </EventData>
</Event>

Open in new window

ASKER CERTIFIED SOLUTION
Avatar of Gregor Lambert
Gregor Lambert
Flag of Australia image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of Jsmply
Jsmply

ASKER

Thanks.  So to clarify, even if the USB drives are setup to allow you to remove them without that option, you would get this error?  Also, how long would you keep getting this error for?
seems to be afew dif senarios that cause this issue hard to put your finger on which one it is but if somthing was attempting to access the drive and the drive was removed it may occur several times in the event log you'd need to perform some testing with the event view up and also have a read bellow may give you some more info ie, acronis is known cause for this event id I have seen in several websites
http://www.eventid.net/display.asp?eventid=57&eventno=2197&source=Ftdisk&phase=1
this is at the base of the page in the above link I guess its worth a try

I/O requests issued by the file system to the disk subsystem might not have been completed successfully.

User Action
If this message appears frequently, run Chkdsk to repair the file system.

To repair the file system
Save any unsaved data and close any open programs.
Restart the computer.
The volume is automatically checked and repaired when you restart the computer.
Alternatively, you can run the Chkdsk tool from the command prompt without shutting down the computer first. Click Start, click Run, and then type cmd. At the command prompt, type “chkdsk /R /X Drive”. Chkdsk runs and automatically repairs the volume. Repeat this step for each volume on the disk. If the following message appears, type Y. “Cannot lock current drive. Chkdsk cannot run because the volume is in use by another process. Would you like to schedule this volume to be checked the next time the system restarts?” The next time the computer is started, Chkdsk will automatically run.
Also worth performing a chkdsk on the removable drive that you are using after you've done the system drive
Avatar of Jsmply

ASKER

I don't think it's the filesystem on drive C, we have been getting the issue since the machine was freshly setup, and after a complete format.  
still maybe worth a try I would also check the manufacturer of the mainboard for updated sata controller drivers and maybe even check to see if there is a bios update and what it fixs from the version your mainboard is using.
Avatar of Jsmply

ASKER

Is this relating to the same error though?  That one mentions "ftdisk" as the source.  This one mentions "ntfs" as the source.  
yeah I'm sure they are the same issue I'd perform the chkdsk on the system drive from everything I've been reading its an issue with ftdisk and ntfs that raises this event id

And its nothing that will casue any damage by performing these tasks..

Explanation
NTFS could not write data to the transaction log. This could affect the ability of NTFS to stop or roll back the operations for which the transaction data could not be written. NTFS could not write data because of one or more of the following reasons:

I/O requests issued by the file system to the disk subsystem might not have been completed successfully.
Avatar of Jsmply

ASKER

Thanks.  We will run that soon when we are on-site.  Question, from the description, it does not sound like it will cause a problem does it?  It is however filling the event log.
The only issue at a guess could be a loss of data or damage to the partition table and file system if left to continue in its current state... once you have performed the chkdsk also perform SFC /SCANNOW command to make sure there are no damaged system files as it may also be part of the problem.
Avatar of Jsmply

ASKER

Both came back clean.  This particular machine runs a raid array and even under Windows XP (before converted to Windows 7) it logged some odd events in the log.  Possibly a driver glitch?
Yeah I'd say so maybe even requires a bios update
Avatar of Jsmply

ASKER

Haven't been able to find a useful solution yet.  Anyone have any experiance with this error actually causing proiblems?  It doesn't seem to interefer with any operations and all other tests show no issues.
Is the bios on the pc upto date and are you using upto date raid drivers?
Avatar of Jsmply

ASKER

We will have to verify, I believe so.
Avatar of Jsmply

ASKER

Bios is up to date, updated the latest Intel chipset today also.  Any other possible causes?  We are still getting the messages but the system is stable.
Avatar of David
Your disk drive (either physical, or logical on a RAID controller) is timing out.  If this is a single disk, then you most likely have some bad blocks, and the timeout is due to the 15-30 secs or so it takes to perform a deep cycle recovery from a read error.

chkdsk won't fix this, UNLESS you also click the scan/repair bad blocks.  As for the I-just-formatted-so-there-shouldnt-be-any-errors ... nope.  Drives have tens of thousands of spares. The manufacturers expect this.  You should too.  Also, statistically you have higher probability of having these defects show up when drive is new.

If you have the standard cheapo-desktop drives like seagate 'cudas, then you are just looking for trouble.   Their unrecovered error bit rate is basically about the same as the number of bits on a TB drive.

If you want reliable data, get a reliable disk drive.  You probably have a lemon.  Download the manufacturer's freebie diagnostics, look at the unrecovered sector count, also look at the S.M.A.R.T. status, and perform a media verify (if this feature is available. You can pay $$ for diagnostics, but hopefully the freebie from wdc.com, seagate.com, or whatever) is sufficient to identify and confirm this problem.
Avatar of Jsmply

ASKER

Hi Dlethe, thanks for the reply.  Chkdsk was run with the scan/repair bad blocks.  It took about 1.5 hours but returned 0 bad sectors.  These drives are behind a raid controller though so I believe this is expected to find 0 failures.   The manufacturer diagnostics come back with no errors at all.

We've spoken with Dell, Microsoft, and Seagate.  Dell makes the comptuer, Microsoft makes the OS, and Seagate makes the external hard drives we use on this machine.  All three point the finger at someone else.

Dell's diagnostics checked the raid controller and the hitachi internal drives.  They all come back fine.  Seatools checked the external drives (three different drives, all new) and they all come back fine.  The system is stable and is used daily and runs backups, etc . . . but these "warnings" still show up in the event log and no one can find why.  

Any ideas on where to go next?
the problem is that chkdsk does NOT return an error unless the block is unrecoverable.  based on 1.5 hrs, it is obvious the disk performed internal block recovery.  this proves you had recoverable errors, unless the Hdd had a lot of application  IO adding load.  run it again with similar load and compare times.

there are som commercial tools that can actually report soecifics of the bad and recovered blocks, but probably not a need now.  the disk recovered and rewrote all the bad ones.  problem solved IF another check finishes significantly faster.

you should get decent enterprise disks.  they cost a lot more money and performance gain is nominal, but they have 100 X more ECC and many of them autorepair these recoverable blocks in the background..   these are some of the reasons why you should pay the extra money for the premium drives
Avatar of Jsmply

ASKER

Well chkdsk will never find back sectors on a raid array 1 will it?  That's presumably handled at the raid level.  Chkdsk found 0 bad sectors, just cleaned up a few indexes.  Regardless, the warnings are still being logged in the event viewer.  Chkdsk always takes about the same tiime on this machine if run to check for bad sectors, about 1.5 hours.  From research and experience, that's about right for a 500 GB drive (actually two 500 GB drives in a raid 1 array, so chkdsk still see's it at one 500 GB drive).
Well, then this is different, I didn't see you had a RAID controller.   (Actually chkdsk will interact with a RAID controller a little differently).  Specifically what is make/model of RAID controller and make/model of disk drive(s).  Be specific.   I have a good idea what it is but will research to confirm suspicion before I respond.
Avatar of Jsmply

ASKER

It's a Dell Precision 1500 with the built in raid option.  Don't recall the raid properties off hand, would have to verify.  It's just the standard raid offering on this model with Intel Storage Matrix utility as the diagnostic/admin tool.  The hard drives are hitachi 500 GB drives.  Again, would have to verify the exact model.  
I need exact models, might as well add firmware revision to the list of requirements, seeing how you may have to reboot to get it (unless you installed the windows-based matrix mgr software).
Avatar of Jsmply

ASKER

Will get that asap for you.  Curious as to what your thinking.
OK, read this article I wrote.  

https://www.experts-exchange.com/articles/Storage/Misc/Disk-drive-reliability-overview.html

If these are the consumer-class Hitachi drives, then they will do the deep-recovery which is usually 15-30 secs.  While the disk is in the recovery cycle, the disk will not do any I/O.  The matrix controllers can somewhat compensate by routing I/O to the other drive, but only a few requests.  Then it hangs.   Google "TLER" and Matrix controller

Buy an enterprise class drive, and they will abort a deep recovery after only 1-2 secs max.  The controller will remap the bad block and move on before it causes any significant I/O penalty.  

Intel does NOT certify any non enterprise class drives for use with the matrix controllers.   This is one of the reasons.   The server/enterprise/RAID class disks are designed to give up quickly to let the RAID controller handle them.  The desktop drives just hang because they figure you are only using one disk and don't ever want to loose a file, and don't have a mirrored copy or even a backup.


Avatar of Jsmply

ASKER

Thanks.  Will get the exact models soon.  This is the default configuratin straight from Dell.  Based on your article, if I'm follwoing you, the warnings may be normal under this config?

It may be noteworthy that when this same machine was under Windows XP (first two weeks of ownership) it did NOT throw the same error.  However, it did report event ID 51 which is a paging error warning whenever the external hard drives were connected.  That's why Dell was thinking this is Windows 7's version of that error and not related to the raid.  Spent quite a while chasing that event 57 error on XP and concluded it was "normal" based on talks with Seagate and MS.
Dell unfortunately offers both types of disks, so model number is everything.  A rule-of-thumb, if this is a  rackmount system, then they will only offer enterprise class storage (never seen anything but those drives, let's say) , but if it is a desktop, you'll get desktop class drives.

There are also some known issues with several versions of the Matrix firmware & drivers, so go to the dell site (not Intel, because they won't have the firmware that matches your motherboard .. usually), and make sure they are current. Same goes for drivers, so you can do that in the interim.  Also download the windows-based utility and install it, as it can be used to monitor for a RAID1 drive failure.

The warnings are normal, in the sense it is a common problem, but it still means data loss.  Get disks that are designed to "give up" and let the RAID controller handle errors & retries, and you don't have these problems.  You will also have far fewer of them.  The improved ECC error rate and background scrubbing features common in the enterprise disks also insures far fewer recoverable and unrecoverable blocks to begin with
Avatar of Jsmply

ASKER

Thanks.  It's a tower machine, not a rack.  Unfortunitely, it's a charity with a limited budget so enterprise class equipment isn't available right now.  So when you say "The warnings are normal, in the sense it is a common problem, but it still means data loss" do you mean that the disks are going to literally lose the data stored on them because it's a common problem with this model/config . . . or just a common "problem" with this setup but still useable?

The Intel drivers are up to date via Dell (this week) and the Windows utility is installed and showing the raid array is healthy.
Avatar of Jsmply

ASKER

Okay got the info you requested (as much as is available via software, can't view information outside of Windows without being on-site):

Hard drives are in a Raid 1 and are both
Model: ST3500418AS
Firemware: CC45

Intel(R) ICH8R/ICH9R/ICH10R/DO/5 Series/3400 Series SATA RAID Controller
Intel(R) 5 Series/3400 Series Chipset Family USB Enhanced Host Controller - 3B34
Intel(R) 5 Series/3400 Series Chipset Family USB Enhanced Host Controller - 3B3C
Those drives will not work properly with the Matrix family UNLESS you disable the RAID and go with windows dynamic disks and RAID1.   The deep recovery timeout is greater than what the ICHxR array is prepared to wait for.  

Yes, yes, I know.  Lots of people use them, and why would Dell sell it, and it has worked well until now.  Heard it all.  Bottom line is that it doesn't matter.  The TLER recovery timing is too long, so it will lock up and cause the problem.  If you go with Windows native software RAID, then that will wait longer because it is designed to do that in order to be compatible with low-cost disk drives such as this.

BTW, those disks are going for $39 whole dollars now.  Think about it.
Dumping the hardware-based RAID and making sure that the disks are in standard ATA, non-RAID mode will work, but you will need to do full backup & recovery as the matrix controller carves out some metadata at beginning of disk. when you disable the RAID controller those extra blocks show up starting at block 0.
That disk is also rated for 2400 power-on hours per year duty cycle.  As for losing data, when the O/S tells you that "I/O requests issued by the file system to the disk subsystem might not have been completed successfully." ... then what does that tell you?

If the O/S says it doesn't know if the data is getting read or written correctly, then it is a pretty safe bet that it isn't.  
Avatar of Jsmply

ASKER

Thanks Dlethe.  One quick question though.  It seems when the external hard drives are removed, these errors do NOT occur.  That's why Dell and MS seem to both say the warning is coming from the interaction between the system and the external drives.  Any ideas there?  
Avatar of Jsmply

ASKER

Okay Dlethe.  Hear us out here.  New theory.  Remember above we said that Dell and MS say that it has something to do with the interaction between the system and the external drives?  A colleague mentioned seeing a similiar error (event 57 but from fdisk instead of ntfs as the source) on a Windows XP machine recently.  We took a look at that machine just now.  What's the common factor?  Norton Ghost 15 . . .  in fact, that machine experienced that error for the very first time 15 days ago at 4:15 PM.  Norton Ghost 15 was installed that same day at around 3:50 PM . . . and the very first thing done was define the external drives as a backup destination.  

Admittingly this is a totally different direction than the raid drivers.  But the other machine that is throwing the exact same error (except ftdisk as the source instead of ntfs) has no RAID, basic single internal HDD setup with a SEAGATE external drive (like the system we have been troubleshooting) and first saw the error when Ghost was installed.   Keep in mind Ghost keeps accesing the external drive as a backup destination the whole time it's turned on (the second you remove the drive, Ghost pops up and says some recovery points are not avaialble).  

It's a bit of a wild goose chase at this point, but can you hypothesize any correlation?  To troubleshoot, we will stop Ghost running on the machine in question for 12 hours or so and see if the errors stop.  
Avatar of Jsmply

ASKER

Okay, did some more digging.  Literally looking through every machine that runs Ghost (or a symantec/veritas variation) that we can look at.  Have found the machine on several machines running the software.  Have also found a few machines that don't have the error.  What's the similiarity?  Well, all of the machines that DO have the error are machines that run Ghost with multiple ext hard drive destinations.  Not all drives are present all the time, but Ghost is always reporting what destinations are available or not.  

Again, this might be a wild goose chase and far off the original Raid issue we were chasing.  But so far, the theory has proved true.  We will have to see what happens when the Ghost services are stopped for 12 hours on the machine this thread is about.

Look forward to your thoughts tomm.  Thanks again.
Avatar of Jsmply

ASKER

Btw, the thread above keeps saying ghost, but it also is refering to symantec/veritas equievelents of the software.
Avatar of Jsmply

ASKER

Logs clear for over an hour so far since stopping the service.
Avatar of Jsmply

ASKER

I believe we are correct, the log is clear so far.  Please see this website.  http://support.microsoft.com/kb/938940.

It seems to support the theory here.  The drives are optimized for quick removal, but Norton Ghost seems to lock the drives in use when it's service is running.  

It would be helpful if Windows event viewer would let you know WHICH drive the ntfs error was for.  Woudn't you think that would be useful info?  Regardless it's pointing away from being the internal drives.
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of Jsmply

ASKER

Well, we have identified what causes the problem.  We haven't solved how to stop it yet though, perhaps a new thread might be more appropriate?  

https://www.experts-exchange.com/questions/26267675/Cannot-safely-remove-USB-drives-when-Norton-Ghost-is-installed.html?anchorAnswerId=33015248#a33015248
Yep, another thread.  It sucks that ghost is architected this way, it doesn't have to be.  But I don't think there can be a work-around that doesn't involve programming changes. This can't possibly be anything other than a bug.  There is no reason for ghost to interfere with disk I/O on devices it is not directly using ..

So a ghost solution outside of checking for a patch, is to see if you can set some runtime options to exclude what you are not using, and to include only the source disk.   Perhaps if a drive is not specifically excluded (or included), the lazy developer locks it, then looks at the runtime parameters for the files it needs.
Avatar of Jsmply

ASKER

It seems that if it's ever used as a backup destination, Ghost locks it.  We have even tried "logging off" Windows to see if perhaps it's just because the GUI is running and display info on destinations.  No dice, if the service is running, the drive is "in use."   Supposedly it's being elevated to Norton's Senior Engineers.  
Avatar of Jsmply

ASKER

Well I believe we solved it.  The culprit was actually two-fold.  We were running into the same situation on two different machines (one being Windows XP and one being Windows 7).  Turns out, Ghost behaves differently on both.  On Windows Xp, it does indeed lock the ext drives as long as the service is running.  This was causing the event id 57 errors when removing.  On the Windows 7 machine (that this thread is based on), Ghost will not lock the drives when it runs alone.  However, the machine was running Carbonite also.  Now, both companies said there is no problem running side by side as long as they run at different times and don't overlap.  This doesn't appear to be the case.  With Carbonite not installed (we setup a test machine to make sure) VSS shuts down when idle and Ghost then shuts down it's services and the drive can be unlocked, thus stopped the event 57's.  The issue now becomes can Carbonite and Ghost truly run side by side (even if scheduled differently) but that's a new thread.  Thx all!  Glad to see the hardware is in good shape!
Avatar of Jsmply

ASKER

pacsadminwannabe actaully had it right in the begining (just because of a different reason than normal safe removal).  Spreading some love to dlethe also for all his/her help.  Thx all!