We help IT Professionals succeed at work.

HELP !!! Echange 2010 SP1 DAG+cluster broken

Hi,

We are in panic mode ! Our conf is (was) :
- all Exchange servers running under Vmware Esxi hosts,
- 2 CAS servers, W2K8 SP1,
- 4 mailboxes servers, W2K8 SP1,
- 1 DAG, 3 databases, copied across the 4 mailboxes servers, each having 1 iscsi drive for data, 1 iscsi drive for logs
- 2 SAN, under Nexenta to store iscsi drives.

We have tried to add a new member to the DAG. It has completely crashed the DAG, in fact, the cluster under it. We supposed...

We have stop/start several dozens time the mailboxes servers, but nothing is coming back. We are now at the iscsi level, where we are not enable to reactivate each drive :
- on each mailbox server, the permanent list of iscsi LUN/devices is not consistent with the list of drives under "Disk Management". This last tool sees more than its supposed drives (so it list some LUN of an other mailbox server), and all drives are "offline", "reserved". We have tried to put them online with diskpart, and clearing the attributes, but ever failed.

We have temporary been able to access each .EDB file, and each logs folder, and copy all that stuff on a NAS.

Of course, we cannot use our BackupExec backups, as the DAG and the mailboxe servers are all inconsistent.

Any idea ? How can we check disks and remount them correctly ? How can we make the cluster consistent ?
Comment
Watch Question

Maen Abu-TabanjehNetwork Administrator, Network Consultant
Top Expert 2011

Commented:
Maen Abu-TabanjehNetwork Administrator, Network Consultant
Top Expert 2011

Commented:
Install the following latest updates to fix known issues with network services

   979101 The command "netsh interface ipv4 dump" does not export the subnet mask setting in Windows 7, in Windows Server 2008 R2, in Windows Server 2008, and in Windows Vista

   http://support.microsoft.com/default.aspx?scid=kb;EN-US;979101

   981889 A Windows Filtering Platform (WFP) driver hotfix rollup package is available for Windows Vista, Windows Server 2008, Windows 7, and Windows Server 2008 R2

   http://support.microsoft.com/default.aspx?scid=kb;en-US;981889

 

i suggest to adjust the default heartbeat settings by the following commands:

If cluster nodes are in same subnet

cluster /prop SameSubnetDelay=2000 (The default value is 1000 milliseconds, we could set it to 2000 milliseconds.)

cluster /prop SameSubnetThreshold=10 (The default value is 5, we could set it to 10.)

If your cluster nodes are on separate subnets, you would adjust the following values instead:

cluster /prop CrossSubnetDelay=2000

cluster /prop CrossSubnetThreshold=10

Check that all the Network devices are configured correctly and are using same settings (Speed, etc)

also other suggestion to try :
 Disabled TCP chimney on the Operating System by running netsh int tcp set global chimney=disabled
ran netsh int tcp set global rss=disabled on both members.
disabled NetDMA by following steps on http://support.microsoft.com/kb/951037
disabled ipv6 on both the DAG members.
Disabled Checksum Offload on all the Windows Server 2008 machines by going to properties of the LAN card.

Author

Commented:
Thanks !

For your first point, I have read the 2 links, but I have no corresponding event id in all event viewers. In fact, cluster service sometimes stop on one mailstore server, but cluster service on all mailstore server do not generate errors.

I have applied the settings of your second point. Regarding the hotfixes, there are no download links...

Current situation :
On 4 mailstores server, mbox1 to mbox4 :
- mbox1 is dead (windows 2008 dead), but it was removed from DAG before dying,
- mbox2 to mbox4 have started, cluster service is started,
- dag1 which contains mbox2, 3, 4 and sees them as online
- mbox3 has 5 iscsi devices mounted, instead of 3 (2 mounted, belonging to mbox2),
- the 2 mounted, are mounted under letters not managed by us, and drives contain a file named $UpgDrv$
- mbox2 and mbox4 have no iscsi devices mounted, still greyed, reserved,
- mbox2 and mbox4 disk managers lists several devices, included non-concerned by them,
Maen Abu-TabanjehNetwork Administrator, Network Consultant
Top Expert 2011

Commented:
there is hotfix , just look in the page , in the top under heading


Hotfix Download Available
View and request hotfix downloads

click on view and request hotfix ,they will send it to you by email
Maen Abu-TabanjehNetwork Administrator, Network Consultant
Top Expert 2011

Commented:
same idea ,i guess your problem sound like hotfix needed , wish its will work after hotfix applied .. good luck , keep me up-to-date
Commented:
Before you do anything, do not follow Jordannets suggestions. Do not disble TCP Chimney on Windows Server 2008 or later as this will generate huge CPU loads.

This article states you do not change the TCP Offload or Chimney settings in 2008 or R2

http://blogs.technet.com/b/exchange/archive/2011/11/14/time-to-revisit-recommendations-around-windows-networking-enhancements-usually-called-microsoft-scalable-networking-pack.aspx

Back to your issue.

You need to establish the cluster first of all, as you currently have 4 nodes, where is your file share witness? even number of nodes need a FSW to maintain quorum. The best place to have the FSW would be a CAS or HUB server.

Secondly, the fact that your disks are not online will mean that the store service on each node will be failed thus causing the cluster to fail. You need to have the same data paths to each database and logfiles on each server, I usually achive this through the use of mountpoints.

I would create a M: for example and create folders in there for the databases.

M:\DB01
M:\DB02

etc and ensure this is repicated oon each server

If you do decide to make changes to the storage paths then you can use the move-databasepath -configurationonly switch. This command just updates the loction of the files and does move any data. http://technet.microsoft.com/en-us/library/bb124742.aspx

Remember each server needs identical storage paths for databases it holds a copy of, even though the storage type may be different the paths need to be the same. So review the nodes and ensure each node has the storage it should have.

I will monitor this.

Commented:
Also the custering hotfix is for Windows 2008 R2 and not windows 2008, the hotfix is also for network issues, not storage issues.

http://blogs.technet.com/b/exchange/archive/2011/11/20/recommended-windows-hotfix-for-database-availability-groups-running-windows-server-2008-r2.aspx

Author

Commented:
Thanks Radweld.

After mere 24 hours on it, I have a quite-stable situation. It seems that my cluster is out-of-sync or something like that, surely due to iscsi/network problems.

Before I removed MBOX1 from dag, I have had 4 servers in the dag, and so, witness directory was/is still on our first CAS server CAS1. Now, that I have removed MBOX1, dag DAG1 contains 3 servers, but still has problems. After a lot of manipulations, I was able to have MBOX3 as master for 2 databases and MBOX4 for 1 database, and iscsi working on MBOX3 (and drives online !). "Cluster Management" has decreted that only MBOX3 is current server for cluster. So, I was able to "mount-database -force" 2 of them, and "move-activemailboxdatabase " with all "skip...checks" $true.

Now, all is online, including my Public Folder database, but, cluster is still ill, with a lot of drives . I think i will create VMDK/standard disks, instead of iscsi, create copies on them (same letters, I agree), and move active to them. I don't at which moment I can kill/empty the DAG or regenerate it.

I'll also check the "move-databasepath" syntax to see if it could have saved me, at some times.

I hope I have finished to restore all that damned #@EXCH{#...

Thanks again !

The hotfix for DAG is now deployed on all MBOX servers.

I am also going to revert to network settings (tcp offoad + chimney).

Commented:
If your VMware host is also using the San then the loss incurred by using vmdks is minimal really and I would always do it that way.

Incidentally when you have more than two copies of a database you dont need to maintain log isolation, it's actually best practice to co locate your logs with the database. For a single server not In a dag you still need to isolate your logs.

Three cluster nodes do not need the fsw providing that you evicted the dead node. If the failed node is still part of the cluster it still has a vote. Either way you have an fsw on a cas or hub which is the correct thing to do.

I am wondering if the Iscsi networks are being used by the cluster for replication? This could cause all sorts of issues and you should ensure that the Iscsi network is exempt from client connections and cluster communication.

The fact that Luns associations were messed up sounds like the root cause of the issue, once each node has tr correct Luns and the correct databases and log files, this should be all that's needed.

The move-database command is usefull if you want to move data outside of exchange and just use the configurationonly switch to update the config.

If you have more than a few databases I would always use mount points as stated as it makes life so much easier.

Author

Commented:
Thanks Radweld for new informations, and I'll check them tomorrow.

Seems I still have problems, but I don't know where. Incoming messages are accepted by the CAS server, but are not delivered to users. Don't know why... Users have received mails coming during the outage, but since the restart, no more. I am not very fluent in Exchange logs, but SMTP mails seems tracked by CAS in MessageTracking as "STOREDRIVER,DELIVER". But we don't see them in OWA nor Outlook....

Author

Commented:
OK, found... I was short on free space on CAS server, due to the great numbers of incoming mails during this day. As soon as I free more space, the delivery has continued...

Thanks again. I'll check your other points. Now, I am going to sleep...

Author

Commented:
Gives me confident... Thanks for his advices.