We have a bizarre case where we had a bunch of email delivered that was being stored on a fault tolerant server during an Exchange server downtime. Once it was delivered, however, the users can't see any of it, even though we can confirm delivery to the Information Store through the message tracking center. Following is the sequence of events.
Yesterday at 11am, we received a hardware error on a Dell PowerEdge 2650 running a RAID disk array with 5 drives. The server is running Windows Server 2003 Standard, Exchange 2003 Enterprise Edition and is the only server in the domain. It is also the DC, clearly. From the time the hardware error was reported, email stopped working, and we started queuing on our email to a our fault tolerance service provider Postini. Over the next 24 hours, we did the following:
1. Robooted the server, which confirmed that the array was degraded with what appeared to be the loss of a hard disk;
2. Waited for the array to rebuild for approximately 5 hours;
3. Once completed, we performed a consistency check, shut down the server, re-seated the drives and rebooted;
4. When the server rebooted, the RAID messages in the BIOS confirmed that the RAID was once again back in optimal state. We confirmed that all hardware was reporting as healthy and no error messages existed.
5. Immediately noticed that the operating system was extremely sluggish, and that the server would take approximately 50 minutes to get past the "Applying Computer Settngs" window during boot.
6. Finally, at 11pm we called Microsoft escalation team, and they worked on the problem from 11pm through 7:30am this morning, starting with the applications group (to fix the services that were dying -- anything that I tried to start or stop would freeze the machine), then moving to the IIS group, which ended up being the application responsible for the severe delay in the boot sequence.
Having fixed IIS, I logged into Postini, and seeing that mail appeared to be operational, started to unspool the messages to be delivered to our mail server. However, I noticed that no one seemed to be receiving email, but that instead, people were getting an error message in Outlook that read (0x80004005) 'The Operation Failed'. However, we started tracing the error in Exchange Message Tracking, and could clearly see that all of the messages had been delivered and gone to the right people with the message that it was delivered to the information store.
As a result of this error, we continued to work with Microsoft (Exchange team this time) from 7:30am until 11am, when we were able to resolve the problem. It turned out to be a problem with the Default Global Address List. We were finally able to resolve it through a series of KB articles and EE articles. At that time, mail started *actually* flowing again. However, we are now working to get back the email that was delivered -- and still shows up as having been delivered -- to the information store. Basically, we are unable to access any email from the time the hard drive errored out yesterday through the time when mail started flowing successfully again, a period of 24 hours.
We have tried almost everything imaginable. We ran a backup at about 2pm yesterday right when we could, but of course, mail was being held at Postini at that time so we don't have copies of it. Here's what we are thinking of doing next unless someone happens to have seen this before. Microsoft reports that they have not.
1. Back up the information store and try to restore it to a different server;
2. Use ExMerge to attempt to export all mailboxes to PSTs to see if that helps;
3. Run ESEUTIL to see if a defrag would help at all.
The only concern for me is that we are really grabbing blindly at ideas that seem pretty far-fetched and are not based in best practices or personal experience, it's just that both we and Microsoft are stumped.
Thank you for your collective help!