Windows 2008 R2 cluster + exchange DAG node issues after reboot

Hi guys,

We are currently running exchange 2010 in a DAG setup using 2 servers(windows 2008R2 ent) in a cluster and have been for over a year now.
We now have a problem, where we have rebooted one of the 2 nodes in an win 2008R2sp1 exchange 2010 DAG setup, and once rebooted, the node is failing to join the cluster!

Event logs reported issues with not being able to see the share file witness server, but in failover cluster manager under Cluster Core Resources, is is showing as both cluster name - DAG and witness server online! but the node under nodes, is unavailable.

If someone could please help, as this is a production server so we are keen to get this working again

We have not attempted to reboot the other node in case we end up in a worse situation than we are now.

Many thanks

Jim
macleandataAsked:
Who is Participating?

[Product update] Infrastructure Analysis Tool is now available with Business Accounts.Learn More

x
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

Larry LarmeuPrincipal ConsultantCommented:
Have you tried removing the node from the DAG and adding it back?

Also make sure you can ping/browse the file share witness server from the affected node.
macleandataAuthor Commented:
Hi,

Thanks for your prompt response.

We can ping/browse the file share, but we have not yet removed the node from the dag, to be honest, as our knowledge of cluster/DAG is limited, we did not want to make the situation any worse!  So to clarify, from EMC remove each failed copy from each database (7 in total) then in manage database availability group memebership remove the node?

Thanks

Jim
Larry LarmeuPrincipal ConsultantCommented:
Yes, that would be the process to remove the node from the cluster.  Should not affect your other node that is running the active copies.  Before you do that you may want to try flipping your primary and secondary witness servers or specifying a new witness server to see if that helps.
Simplify Active Directory Administration

Administration of Active Directory does not have to be hard.  Too often what should be a simple task is made more difficult than it needs to be.The solution?  Hyena from SystemTools Software.  With ease-of-use as well as powerful importing and bulk updating capabilities.

macleandataAuthor Commented:
Ok thanks,  we do not currently have the secondary witness server setup, something we were looknig to do.  Ok I'll try this first then remove/add the node back in.
Thanks
macleandataAuthor Commented:
Hi,

Ok, both options did not work, when removing the node we had this error generated:

Summary: 1 item(s). 0 succeeded, 1 failed.
Elapsed time: 00:00:08


MHMEXCH20
Failed

Error:
There was a problem changing the quorum model for database availability group DAG1. Error: An Active Manager operation failed. Error: An error occurred while attempting a cluster operation. Error: Cluster API '"SetClusterQuorumResource() failed with 0x1725. Error: A quorum of cluster nodes was not present to form a cluster"' failed..
Click here for help... http://technet.microsoft.com/en-US/library/ms.exch.err.default(EXCHG.140).aspx?v=14.1.285.0&t=exchgf1&e=ms.exch.err.Ex7B51A5

Warning:
The operation wasn't successful because an error was encountered. You may find more details in log file "C:\ExchangeSetupLogs\DagTasks\dagtask_2012-09-05_16-05-42.023_remove-databaseavailabiltygroupserver.log".


Exchange Management Shell command attempted:
Remove-DatabaseAvailabilityGroupServer -MailboxServer 'MHMEXCH20' -Identity 'DAG1'

Elapsed Time: 00:00:09


Thanks

Jim
macleandataAuthor Commented:
Just to update you:

Just loked into this error and found this site - http://exchangeserverpro.com/unable-remove-failed-server-dag-exchange-server-2010

we have managed to remove the node from the DAG sucessfully now, we are just adding it back in
Larry LarmeuPrincipal ConsultantCommented:
Try this from the shell:

Remove-DatabaseAvailabilityGroupServer -Identity DAG1 -MailboxServer MHMEXCH20 -ConfigurationOnly
Larry LarmeuPrincipal ConsultantCommented:
Ah - beat me to it.
macleandataAuthor Commented:
Good old google ;-) ok thanks

Ok adding the node failed :-(

Error + I have attached the log file mentioned in the error, not sure you you could look throguh to see if you spot anything that sticks out as an issue:


MHMEXCH20
Failed

Error:
A server-side database availability group administrative operation failed. Error: The operation failed. CreateCluster errors may result from incorrectly configured static addresses. Error: An error occurred while attempting a cluster operation. Error: Cluster API '"AddClusterNode() (MaxPercentage=100) failed with 0x5b4. Error: This operation returned because the timeout period expired"' failed. [Server: MHMEXCH21.themovefactory.com]

An Active Manager operation failed. Error: An error occurred while attempting a cluster operation. Error: Cluster API '"AddClusterNode() (MaxPercentage=100) failed with 0x5b4. Error: This operation returned because the timeout period expired"' failed..

This operation returned because the timeout period expired
Click here for help... http://technet.microsoft.com/en-US/library/ms.exch.err.default(EXCHG.140).aspx?v=14.1.285.0&t=exchgf1&e=ms.exch.err.ExC9C315

Warning:
The operation wasn't successful because an error was encountered. You may find more details in log file "C:\ExchangeSetupLogs\DagTasks\dagtask_2012-09-05_16-14-16.997_add-databaseavailabiltygroupserver.log".


Exchange Management Shell command attempted:
Add-DatabaseAvailabilityGroupServer -MailboxServer 'MHMEXCH20' -Identity 'DAG1'



Many thanks

Jim
logfile.log
Larry LarmeuPrincipal ConsultantCommented:
Looking at the log and doing some research it looks like most people's recommendation for this error is to dissolve the DAG and create a new DAG with a different name.  Seems like the cluster configuration has some kind of corruption.
macleandataAuthor Commented:
;-( ok sounds quite drastic, do you have any suggestions for a clean dissolve?  we also have a problem at the moment where we can't backup either at the moment as backup exec 2012 only sees the DAG.

If you're able to assist i anyway, that would be appreciated

Thanks for your help so far, really is appreciated!

Jim
Larry LarmeuPrincipal ConsultantCommented:
Why are you not able to backup the working node?
Simon Butler (Sembee)ConsultantCommented:
What is the situation with the DAG at the moment? Does Exchange still see the DAG? Does it see the members?
You need to get the members out, which means you need to remove the database copies.
Once you have got the DAG out then you can recreate it.

A lot of these problems are due to DNS issues, where the DAG name doesn't resolve correctly.

Simon.
macleandataAuthor Commented:
Backup exec fails on backing up the DAG selection, but if you drill into the physical server, you only see 3 folders Address messagemanager and replay, although  have just restarted the BE agent on the active DB server and I'm now seeing database locations and logs files. Not the information store though which I can now see on the node which is now out of the DAG, so I'll backup the DB and logs tonight, at least we will have a DB copy of some sorts.

If we were to reboot the Active node, how will this affect the db's +dag when rebooted?  Iknow a dag can withstand a single node fail (with witness and second node live) but not sure if a reboot of the remaining node would cause problems?

Wading through info re - disolving the dag and rebuilding again!!

Jim
Larry LarmeuPrincipal ConsultantCommented:
Are you sure your DAG networks are set up correctly?  Can you ping the DAG cluster IP addresses and ping DAG1?
macleandataAuthor Commented:
Hi both,

Yes DAG1 resolves to IP and pings, which is currently looking at the arp table is as expected, pinging the active server.

Simon - current status of the DAG, is that it can see the DAG and the only node now remaining, we removed the failing node, first deleting the db copies.

With regards to the backup, I would've expected the dag to continue to backup the remaining node with active db on which points to an issue with the DAG.

Thanks

Jim
Larry LarmeuPrincipal ConsultantCommented:
I hate to tell you to proceed without a backup.  Do you have a way of doing a snapshot or something like that before you proceed?
macleandataAuthor Commented:
Stilling looking at another way using BE2012 to backup the live node another way, but for tonight the only thing I can see is to backup the .edb and log files just so we at least have I hope something if anything should go wrong!
macleandataAuthor Commented:
Small update,  I can now see the information store in the DAG in BE2012 :-) which is looking at the active node phewwww, so at least we can now get a clean backup + logs will be flushed.

Will spend time today diagnosing the current DAG to see if we can re-introduce the failed node, although it's not like we are adding a clean rebuilt node it has been apart of the DAG.

Thanks so far for everyones help, if anyone has any other ideas, last resort we will rebuild the DAG I guess.

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
macleandataAuthor Commented:
MAnaged to enentually break the DAG and rebuild
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Exchange

From novice to tech pro — start learning today.