Exchange 2010 - Disk IO errors causing database issues on both servers in DAG

Edman555
Edman555 used Ask the Experts™
on
We are in the process of migrating exchagne 2003 to 2010.  We purchased two new IBM x3650 m3 servers, installed Win Server 2008r2 enterprise, installed exchange 2010, configured DAG.  Everything is happy, the new servers are even sending out e-mail.
Added integration with Cisco Unity 7, everything still working fine.
I started migrating a handful of users, and saw the the process stopped halfway though a user.  I checked and the secondary database says failed and suspended.  Right click on it and say resume, and it re-synches.  Users again start moving from the old 2003 to the new 2010 servers.  This happens 3 or 4 times in the process of moving 10 users.  And then I notice that it's happening throughout the day as well, sometimes to the active database, and then it fails over to the backup.

IN event logs, I'm getting a ton of IO errors.  I open a case with IBM, they check it over and say there is no hardware issue.  I'm inclined to believe them because i'm getting errors on both servers.  How can this be?  These are some of the errors I"m getting in the application event log.

Log Name:      Application
Source:        MSExchange Mailbox Replication
Date:          5/16/2011 3:32:13 PM
Event ID:      1007
Task Category: Service
Level:         Warning
Keywords:      Classic
User:          N/A
Computer:      PGTEXCHANGE1.pgt.com
Description:
The Mailbox Replication service was unable to determine the set of active mailbox databases on a mailbox server.
Mailbox server: PGTEXCHANGE1.pgt.com
Error: MapiExceptionNetworkError: Unable to make admin interface connection to server. (hr=0x80040115, ec=-2147221227)
Diagnostic context:
    ......
    Lid: 10648   dwParam: 0x6D9      Msg: EEInfo: Generating component: 2
    Lid: 14744   dwParam: 0x6D9      Msg: EEInfo: Status: 1753
    Lid: 9624    dwParam: 0x6D9      Msg: EEInfo: Detection location: 501
    Lid: 13720   dwParam: 0x6D9      Msg: EEInfo: Flags: 0
    Lid: 11672   dwParam: 0x6D9      Msg: EEInfo: NumberOfParameters: 4
    Lid: 8856    dwParam: 0x6D9      Msg: EEInfo: prm[0]: Unicode string: ncacn_ip_tcp
    Lid: 8856    dwParam: 0x6D9      Msg: EEInfo: prm[1]: Unicode string: PGTEXCHANGE1.pgt.com
    Lid: 12952   dwParam: 0x6D9      Msg: EEInfo: prm[2]: Long val: -1988875570
    Lid: 12952   dwParam: 0x6D9      Msg: EEInfo: prm[3]: Long val: 382312662
    Lid: 24060   StoreEc: 0x80040115
    Lid: 23746  
    Lid: 31938   StoreEc: 0x80040115
    Lid: 19650  
    Lid: 27842   StoreEc: 0x80040115
    Lid: 20866  
    Lid: 29058   StoreEc: 0x80040115
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="MSExchange Mailbox Replication" />
    <EventID Qualifiers="32772">1007</EventID>
    <Level>3</Level>
    <Task>1</Task>
    <Keywords>0x80000000000000</Keywords>
    <TimeCreated SystemTime="2011-05-16T19:32:13.000000000Z" />
    <EventRecordID>2987756</EventRecordID>
    <Channel>Application</Channel>
    <Computer>PGTEXCHANGE1.pgt.com</Computer>
    <Security />
  </System>
  <EventData>
    <Data>PGTEXCHANGE1.pgt.com</Data>
    <Data>MapiExceptionNetworkError: Unable to make admin interface connection to server. (hr=0x80040115, ec=-2147221227)
Diagnostic context:
    ......
    Lid: 10648   dwParam: 0x6D9      Msg: EEInfo: Generating component: 2
    Lid: 14744   dwParam: 0x6D9      Msg: EEInfo: Status: 1753
    Lid: 9624    dwParam: 0x6D9      Msg: EEInfo: Detection location: 501
    Lid: 13720   dwParam: 0x6D9      Msg: EEInfo: Flags: 0
    Lid: 11672   dwParam: 0x6D9      Msg: EEInfo: NumberOfParameters: 4
    Lid: 8856    dwParam: 0x6D9      Msg: EEInfo: prm[0]: Unicode string: ncacn_ip_tcp
    Lid: 8856    dwParam: 0x6D9      Msg: EEInfo: prm[1]: Unicode string: PGTEXCHANGE1.pgt.com
    Lid: 12952   dwParam: 0x6D9      Msg: EEInfo: prm[2]: Long val: -1988875570
    Lid: 12952   dwParam: 0x6D9      Msg: EEInfo: prm[3]: Long val: 382312662
    Lid: 24060   StoreEc: 0x80040115
    Lid: 23746  
    Lid: 31938   StoreEc: 0x80040115
    Lid: 19650  
    Lid: 27842   StoreEc: 0x80040115
    Lid: 20866  
    Lid: 29058   StoreEc: 0x80040115</Data>
  </EventData>
</Event>


Log Name:      Application
Source:        MSExchangeRepl
Date:          5/16/2011 3:02:40 PM
Event ID:      4082
Task Category: Service
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      PGTEXCHANGE1.pgt.com
Description:
The replication network manager encountered an error while monitoring events. Error: Microsoft.Exchange.Cluster.Replay.AmClusterApiException: An Active Manager operation failed. Error: An error occurred while attempting a cluster operation. Error: Cluster API '"OpenCluster(PGTEXCHANGE1.pgt.com) failed with 0x6d9. Error: There are no more endpoints available from the endpoint mapper"' failed.. ---> System.ComponentModel.Win32Exception: There are no more endpoints available from the endpoint mapper
   --- End of inner exception stack trace ---
   at Microsoft.Exchange.Cluster.Replay.NetworkManager.DriveMapRefresh()
   at Microsoft.Exchange.Cluster.Replay.NetworkManager.TryDriveMapRefresh()
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="MSExchangeRepl" />
    <EventID Qualifiers="49156">4082</EventID>
    <Level>2</Level>
    <Task>1</Task>
    <Keywords>0x80000000000000</Keywords>
    <TimeCreated SystemTime="2011-05-16T19:02:40.000000000Z" />
    <EventRecordID>2987707</EventRecordID>
    <Channel>Application</Channel>
    <Computer>PGTEXCHANGE1.pgt.com</Computer>
    <Security />
  </System>
  <EventData>
    <Data>Microsoft.Exchange.Cluster.Replay.AmClusterApiException: An Active Manager operation failed. Error: An error occurred while attempting a cluster operation. Error: Cluster API '"OpenCluster(PGTEXCHANGE1.pgt.com) failed with 0x6d9. Error: There are no more endpoints available from the endpoint mapper"' failed.. ---&gt; System.ComponentModel.Win32Exception: There are no more endpoints available from the endpoint mapper
   --- End of inner exception stack trace ---
   at Microsoft.Exchange.Cluster.Replay.NetworkManager.DriveMapRefresh()
   at Microsoft.Exchange.Cluster.Replay.NetworkManager.TryDriveMapRefresh()</Data>
  </EventData>
</Event>


Log Name:      Application
Source:        Application Error
Date:          5/16/2011 3:02:36 PM
Event ID:      1005
Task Category: (100)
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      PGTEXCHANGE1.pgt.com
Description:
Windows cannot access the file C:\Windows\Cluster\clussvc.exe for one of the following reasons: there is a problem with the network connection, the disk that the file is stored on, or the storage drivers installed on this computer; or the disk is missing. Windows closed the program Microsoft Failover Cluster Service because of this error.

Program: Microsoft Failover Cluster Service
File: C:\Windows\Cluster\clussvc.exe

The error value is listed in the Additional Data section.
User Action
1. Open the file again. This situation might be a temporary problem that corrects itself when the program runs again.
2. If the file still cannot be accessed and
      - It is on the network, your network administrator should verify that there is not a problem with the network and that the server can be contacted.
      - It is on a removable disk, for example, a floppy disk or CD-ROM, verify that the disk is fully inserted into the computer.
3. Check and repair the file system by running CHKDSK. To run CHKDSK, click Start, click Run, type CMD, and then click OK. At the command prompt, type CHKDSK /F, and then press ENTER.
4. If the problem persists, restore the file from a backup copy.
5. Determine whether other files on the same disk can be opened. If not, the disk might be damaged. If it is a hard disk, contact your administrator or computer hardware vendor for further assistance.

Additional Data
Error value: C0000185
Disk type: 3
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="Application Error" />
    <EventID Qualifiers="49152">1005</EventID>
    <Level>2</Level>
    <Task>100</Task>
    <Keywords>0x80000000000000</Keywords>
    <TimeCreated SystemTime="2011-05-16T19:02:36.000000000Z" />
    <EventRecordID>2987696</EventRecordID>
    <Channel>Application</Channel>
    <Computer>PGTEXCHANGE1.pgt.com</Computer>
    <Security />
  </System>
  <EventData>
    <Data>C:\Windows\Cluster\clussvc.exe</Data>
    <Data>Microsoft Failover Cluster Service</Data>
    <Data>C0000185</Data>
    <Data>3</Data>
  </EventData>
</Event>

Log Name:      Application
Source:        ESE
Date:          5/16/2011 10:33:07 AM
Event ID:      482
Task Category: General
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      PGTEXCHANGE1.pgt.com
Description:
edgetransport (804) Transport Mail Database: An attempt to write to the file "C:\Program Files\Microsoft\Exchange Server\V14\TransportRoles\data\Queue\trn.log" at offset 3878400 (0x00000000003b2e00) for 512 (0x00000200) bytes failed after 0 seconds with system error 1117 (0x0000045d): "The request could not be performed because of an I/O device error. ".  The write operation will fail with error -1022 (0xfffffc02).  If this error persists then the file may be damaged and may need to be restored from a previous backup.

For more information, click http://www.microsoft.com/contentredirect.asp.
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="ESE" />
    <EventID Qualifiers="0">482</EventID>
    <Level>2</Level>
    <Task>1</Task>
    <Keywords>0x80000000000000</Keywords>
    <TimeCreated SystemTime="2011-05-16T14:33:07.000000000Z" />
    <EventRecordID>2987335</EventRecordID>
    <Channel>Application</Channel>
    <Computer>PGTEXCHANGE1.pgt.com</Computer>
    <Security />
  </System>
  <EventData>
    <Data>edgetransport</Data>
    <Data>804</Data>
    <Data>Transport Mail Database: </Data>
    <Data>C:\Program Files\Microsoft\Exchange Server\V14\TransportRoles\data\Queue\trn.log</Data>
    <Data>3878400 (0x00000000003b2e00)</Data>
    <Data>512 (0x00000200)</Data>
    <Data>-1022 (0xfffffc02)</Data>
    <Data>1117 (0x0000045d)</Data>
    <Data>The request could not be performed because of an I/O device error. </Data>
    <Data>0</Data>
  </EventData>
</Event>
Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Are you using an iSCSI SAN for the mailbox databases?

Author

Commented:
No, we have five 10k SAS drives locally on the machine.  two in RAID1 for the OS and exchange logs, and three in raid5 for the exchange database.  RAID controller is IBM ServeRAID m1015 SAS/SATA Controller.  I have updated the firmware and am using the latest drivers.
Ok, well that rules that out :)

Have you run perfmon on the server to gather some performance metrics?

Processor Queue Length, Threads, Logical Disk metrics etc.
Success in ‘20 With a Profitable Pricing Strategy

Do you wonder if your IT business is truly profitable or if you should raise your prices? Learn how to calculate your overhead burden using our free interactive tool and use it to determine the right price for your IT services. Start calculating Now!

Author

Commented:
Avg. Disk Queue Length gradually increases until it's maxed out, then this error happens


Log Name:      Application
Source:        ESE
Date:          5/17/2011 1:48:48 PM
Event ID:      482
Task Category: General
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      PGTEXCHANGE1.pgt.com
Description:
Information Store (4184) PowerUsers: An attempt to write to the file "D:\powerusers\E03.log" at offset 179200 (0x000000000002bc00) for 512 (0x00000200) bytes failed after 0 seconds with system error 1117 (0x0000045d): "The request could not be performed because of an I/O device error. ".  The write operation will fail with error -1022 (0xfffffc02).  If this error persists then the file may be damaged and may need to be restored from a previous backup.

For more information, click http://www.microsoft.com/contentredirect.asp.
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="ESE" />
    <EventID Qualifiers="0">482</EventID>
    <Level>2</Level>
    <Task>1</Task>
    <Keywords>0x80000000000000</Keywords>
    <TimeCreated SystemTime="2011-05-17T17:48:48.000000000Z" />
    <EventRecordID>2988869</EventRecordID>
    <Channel>Application</Channel>
    <Computer>PGTEXCHANGE1.pgt.com</Computer>
    <Security />
  </System>
  <EventData>
    <Data>Information Store</Data>
    <Data>4184</Data>
    <Data>PowerUsers: </Data>
    <Data>D:\powerusers\E03.log</Data>
    <Data>179200 (0x000000000002bc00)</Data>
    <Data>512 (0x00000200)</Data>
    <Data>-1022 (0xfffffc02)</Data>
    <Data>1117 (0x0000045d)</Data>
    <Data>The request could not be performed because of an I/O device error. </Data>
    <Data>0</Data>
  </EventData>
</Event>
This may be service affecting, but you can try a utility called Atto disk benchmark.

Sounds to me like a problem with the raid controller, possibly drivers?

Author

Commented:
I'll give it a whirl.  I verified with IBM's tech support that I have the latest firmware and drivers.
how many users do you have currently moved?

Author

Commented:
10 active users, and 20 other system related mailboxes.  
Can you run the Exchange best practises analyzer and see if you get any warnings or problems??  

This definately sounds hardware / driver related to me, especially with 10 active users.  Exchange 2010 is very lean on the number of IOPS required.

Author

Commented:
BPA gave a few warnings.  Two orphaned mailboxes, but those were recent due to failed user moves (on test accounts thankfully).


ExBPA.expex.201105171553505028.d.xml

Author

Commented:
Yea, this is a server issue, not an exchange issue.  We had a simmilar IO issue just doing a backup.  This also caused the exchange databases to dismount.

Log Name:      System
Source:        volsnap
Date:          5/18/2011 8:55:39 AM
Event ID:      14
Task Category: None
Level:         Error
Keywords:      Classic
User:          N/A
Computer:      PGTEXCHANGE1.pgt.com
Description:
The shadow copies of volume C: were aborted because of an IO failure on volume C:.
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="volsnap" />
    <EventID Qualifiers="49158">14</EventID>
    <Level>2</Level>
    <Task>0</Task>
    <Keywords>0x80000000000000</Keywords>
    <TimeCreated SystemTime="2011-05-18T12:55:39.174658600Z" />
    <EventRecordID>8009</EventRecordID>
    <Channel>System</Channel>
    <Computer>PGTEXCHANGE1.pgt.com</Computer>
    <Security />
  </System>
  <EventData>
    <Data>\Device\HarddiskVolumeShadowCopy13</Data>
    <Data>C:</Data>
    <Data>C:</Data>
    <Binary>0000000003003000000000000E0006C009000000850100C005000000000000000000000000000000</Binary>
  </EventData>
</Event>

Author

Commented:
Still having this issue.  Server performs just fine until exchange is installed.  Just to verify that's it's not me, I grabbed a desktop computer and installed using the same media, and that exchange server has no issues.  I added it to the DAG, and every morning it has all of the databases on it becuase the two real servers fail at some point throughout the night.

Been working with IBM's support, but can't seem to figure out what is happening.  Any assistance would be appreciated.

Author

Commented:
And the solution was, the Ibm 1015m controller just wouldn't work in this configuration.  We replaced both servers with 5015 controllers and it works perfectly, even without the cache enabled.  I don't believe it was a performance issue, I think it just had to do with running exchange DAG with local drives attached to the 1015M controller.  

Author

Commented:
RAID controller was the problem.  The fix was to just replace it with a different controller on both servers.

Commented:
Have you check the Exchange server disk performance....you can use " ExPerfWiz" PowerShell base script has all required disk parameters for Exchange.
 .\experfwiz.ps1 -duration 02:00:00 -interval 5 -filepath D:\Logs

Script output file you can import into PAL to have batter understanding of server performance. Accordingly take further decision.

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial