Link to home
Start Free TrialLog in
Avatar of jeff1946
jeff1946

asked on

ESE Read Errors on Exchange priv1.stm Database During NTBackup

The context is SBS Standard 2003 R2 SP2 running on a Dell PE 2950 w/4 GB RAM and a RAID 1 volume on a PERC 5/i controller. All the MS updates, Dell drivers & firmware are up-to-date. The RAID volume is divided into C: and D: partitions, both of which have plenty of free space.

The SBS backup tool performs a full backup every night. From time to time (approx 10 times in the last 45 days) the backup fails with errors like these in the backup log:

The 'Microsoft Information Store' returned 'Error returned from an ESE function call (d).
' from a call to 'HrESEBackupRead()' additional data '-'The 'Microsoft Information Store' returned 'Error returned from an ESE function call (d).
' from a call to 'HrESEBackupRead()' additional data '-'

(Kinda garbled, but that's what it says.)

The app event log contains 10-20 repetitions of errors like this:

Event Type:      Error
Event Source:      ESE
Event Category:      Logging/Recovery
Event ID:      478
Date:            10/26/2009
Time:            11:47:35 PM
User:            N/A
Computer:      DC
Description:
Information Store (4980) The streaming page read from the file "D:\Exchsrvr\MDBDATA\priv1.stm" at offset 1370796032 (0x0000000051b4b000) for 4096 (0x00001000) bytes failed verification due to a page checksum mismatch.  The expected checksum was 3956519118 (0x00000000ebd3b0ce) and the actual checksum was 3956519116 (0x00000000ebd3b0cc).  The read operation will fail with error -613 (0xfffffd9b).  If this condition persists then please restore the database from a previous backup.

Followed by one like this:

Event Type:      Error
Event Source:      ESE
Event Category:      Logging/Recovery
Event ID:      217
Date:            10/26/2009
Time:            11:47:48 PM
User:            N/A
Computer:      DC
Description:
Information Store (4980) First Storage Group: Error (-613) during backup of a database (file D:\Exchsrvr\MDBDATA\priv1.stm). The database will be unable to restore.

Or sometimes the 478 errors will be followed by one of these:

Event Type:      Error
Event Source:      ESE
Event Category:      Logging/Recovery
Event ID:      493
Date:            11/1/2009
Time:            12:00:59 AM
User:            N/A
Computer:      DC
Description:
Information Store (4692) A read operation on the file "D:\Exchsrvr\MDBDATA\priv1.stm" at offset 2083717120 (0x000000007c330000) for 65536 (0x00010000) bytes failed 1 times over an interval of 0.141 seconds before finally succeeding.  More specific information on these failures was reported previously.  Transient failures such as these can be a precursor to a catastrophic failure in the storage subsystem containing this file.

In which case the backup does not fail (event 217 is not logged).

This looks like hardware errors, except the array passes the PERC's consistency check, CHKDSK D: /F /R doesn't show anything worse than "minor inconsistencies", and ESEUTIL /G reports no problems. Furthermore these ESE errors only ever occur during NTBackup and only on the priv1.stm database. The offset is different each time. (That is on any given backup there may be multiple 478s and sometimes a 493 all at the same offset, but on another backup the errors will all refer to a different offset.)

The server is generally quite stable, neither the OS nor Exchange have ever crashed, and users do not complain of any lost or corrupted emails.

Does anyone know what's ahppening here? Can anyone suggest a strategy for narrowing down the problem to either hardware or software?

Thanks in advance for your attention!
ASKER CERTIFIED SOLUTION
Avatar of Julian123
Julian123

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of jeff1946
jeff1946

ASKER

Doesn't ESEUTIL /G do a "dry run" of ESEUTIL /P and/or ESEUTIL /K?
/G does the repair portion. If you already did /G and it reported no errors then you can skip /P
Excuse me, I mean "/P does the repair portion. " If you already did /G and it reported no errors then you can skip /P
I knew what you mean 8^) . But ESEUTIL /G also reports that it's validating the checksums, so I'm not sure that ESEUTIL /K would do any good either ... ???
Microsoft has unfortunately never been clear that /G does the same checks as /K. If you are looking to fully rule out Exchange database corruption as the cause, running /K will help you be 100% sure so you can focus on other areas (such as hardware).
Microsoft has unfortunately never been clear that /G does the same checks as /K. If you are looking to fully rule out Exchange database corruption as the cause, running /K will help you be 100% sure so you can focus on other areas (such as hardware).
Dell gave me different advice to the same end: make ExMerge backups of all the mailboxes, "dialtone" the Exchange database, then repopulate it from the ExMerge PST backups. I've scheduled a maintenance window tonight to do it.
I did as Dell suggested and there haven't been any more errors in over a week. The problem was always intermittent, so the results so far are not really definitive, but they are certainly hopeful. I'm going to close this question, but if the error rears its ugly head again, I'll re-open it.
Julian123's solution probably would have worked too, but Dell's solution was simpler.