Link to home
Start Free TrialLog in
Avatar of MiniBazzer
MiniBazzerFlag for United Kingdom of Great Britain and Northern Ireland

asked on

vSphere 4 Failover Cluster Shutdown Problem

I have recently started at an organisation which has a fairly new virtualised installation. I have taken over from someone else who is in the process of moving to another department in the organisation, so knowledge can be obtained as to the history of the system.

I have a system which consists of two clusters - one Production and one Disaster Recovery. Each cluster is based on it's own set of system hardware - seperate Blade Centre's, SAN's etc etc. I can provide more information on this, if it helps to resolve the issue. :)

My issue is this...

We had a requirement to shutdown the Disaster Recovery hardware over the weekend due to some power work that was occuring. As the system as a whole is in a state where it is running on Production, I thought that we would be able to take the DR system down without much of a hiccup. Okay, the system wouldnt be visable in the Client and vCentre might complain about not being able to communicate. We also run DoubleTake, so thought that it might also complain about not being able to replicate. But generally we expected the system to run without much of an issue.

However, this was not the case. Shortly after shutting down DR, we started getting calls about machines being unable to connect to the system. Rebooting seemed to fix this, so we thought all was well. About an hour or so later, more machines were experiencing issues and even our own machines couldnt connect. We decided to reboot the network switches but to no avail. We then bought the network switches back online in the DR system, thinking that if there was a network config issue in the switches, this would resolve the issue. But no. The last thing to do was bring the BladeCentre back online and boot up the servers. Once everything was back up and running again (after a few minutes) the issue resolved itself.

This therefore would suggest that there is some sort of issue with the vSphere Cluster being down that caused our network outage problems, but we're puzzled as to how. If someone is able to shed any light, I would be most appreciative.

Cheers, Simon
Avatar of Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
Flag of United Kingdom of Great Britain and Northern Ireland image

We really need more information on how it's all setup to assist.

What virtual machines are on which environment, when you shut down DR, did DR have VMs that people needed to connect to?
Avatar of MiniBazzer

ASKER

Thank you for your speedy reply!

Im a bit new to this, so please ask what you need to know. I'll start by answering what you have asked so far.

The DR system has a number of VM's which are for the DoubleTake failover system. DoubleTake monitor's the VM's on the Production system using a heartbeat and in the event of a VM going down DoubleTake fires up the failover VM. The other VM that is on DR is a second Domain Controller VM which runs Active Directory. That will sync with the Domain Controller on the Prod system when both are operational. No user would have been connected to a VM on the DR system as it is for use in the event of a disaster only.

I hope that helps, even if just a little!
So when the DR system was shutdown, what VMs were not available?

Is the DR system completely seperate?

What is the DR system how many ESX hosts? is the vCenter server physical or virtual?

"machines being unable to connect to the system"

what machines, and what were they trying to connect to?

It's not that the Domain Controller was unavailable, and it's DNS, WINS for your network.

If you shutdown a domain controller, you may have some connection issues, whilst it was unavailable, what roles does it have?

Can you move VMs from production to DR, and from DR to Production? (via vMotion)
The VM's on the DR that were not available would have been:

SDI-VSQL04 (Copy of SDI-VSQL01) - used by DoubleTake
SDI-VSQL05 (Copy of SDI-VSQL02) - used by DoubleTake
SDI-VSQL06 (Copy of SDI-VSQL03) - used by DoubleTake
SDI-VDC02 (Domain Controller)

The DR system has 4 hosts (4 Blade Servers). The vCentre Server is a Virtual Server and exists on the Production cluster.

The machines are manufacturing devices which are computer controlled. Basically they read and write data to SQL Databases. They logon to the network as per a normal Windows machine.

I'm not sure about what role the SDI-VDC02 holds - I will check and post that up shortly. I can move VM's around from Production to DR and back again using vMotion.
so users that reported issues?

"machines being unable to connect to the system"

what machines, and what were they trying to connect to?

SDI-VSQL04 (Copy of SDI-VSQL01) - used by DoubleTake
SDI-VSQL05 (Copy of SDI-VSQL02) - used by DoubleTake
SDI-VSQL06 (Copy of SDI-VSQL03) - used by DoubleTake

as all the above are copies, these should not be access directly, I would think!

but....if you shutdown a DC - SDI-VDC02 (Domain Controller), it's likely this could cause issues.

you would have been better to move it to production via vMotion. (than shut it down), when you've got scheduled outage!
Yes, it was the users that reported the issues. They connect to SDI-VSQL01 - 03.

Ahh, okay. SDI-VDC02 has two roles: Active Directory Domain Services and DNS Server. Also, when I logged onto the server it had a box up asking why the server had shutdown unexpectedly. Are you thinking that the most likely reason for our issues is because the second DC went down? I guess it could have caused sync issues?
Avatar of pschakravarthi
pschakravarthi

Sounds like the the problem with DC.    Can you please check event logs on your (or client machines who reported as not working) ?
I'll take a stab in the dark. As has been mentioned it sounds very much like a DC related issue.
Roles etc..
Also I'd check the DC's are replicating correctly..

My 2 bobs worth..
To follow up on pschakravarthi's questions...

calls about machines being unable to connect to the system

What do you mean they couldn't connect?  what were the errors that were being seen and what events in the event logs were being logged?  What applications, if any, were affected when they couldn't connect?
Don't shutdown your DC, you should really vMotion, you have the technology.

If their clients were using DNS, it's possible they couldn't resolve servers and services.

What DNS server is this primary or secondary?

Did server SDI-VSQL01 - 03 shutdown?

I think you need to make a Visio Map or drawing of what you have, and what you must do before shuting down next time, when schedule an outage. It's a good idea to have a DC at a DR site, for just in-case situations, but be careful shutting it down, as they play multiple roles.
The DNS Server that we shutdown was a secondary server, the primary is on the Production cluster. Its an interesting point that you raise about the second Domain server. I thought that the idea behind having the second server was that if you loose the primary one, the secondary can take over as they are syncronized. I thought that there wouldnt be any interruption to the service.

I dont know the finite details about Active Directory and DNS etc, just enough to set stuff up and look after it. What concerns me at the moment is how the system would behave in the event that a real failover situation arose. If the DC on the Production cluster failed, would the DC on the DR cluster be able to support the system? There is the potential that we would not have time to move the DR from the Production cluster before it failed.

This topic may now be heading more into the Active Directory services domain, rather than VMware but it still ties over due to the virtualisation.......
No Active Directory does not work like that. (in the old days, when we had a Primary and Backup Domain Controllers). It's a common misconception, that there is a primary and secondary Domin Controller.

Active directory is multi master replication model. Meaning clients can register their records to any available Active directory domain controller and have access to resources within active directory NTDS.DIT database.

Also one of your Active Directory servers may also be offering DNS and WINS services, which also means if you turn it off, DNS and WINS will not be available.

If you just power off or shutdown a DC, YOU WILL experience issues.

Again it's a good idea to have a DC offsite for DR purposes, but it you were to lose your main site DC, you would then have to use another procedure to sieze the roles that were running on the failed/missing DC, you would still have a copy of the AD database which is important, which is why it's at the DR location, but next time do not shutdown, vMotion it. (if possible).

this URL also gives you some details, as to what roles a DC has

http://www.anas.co.in/2009/04/active-directory-understanding-fsmo.html
Thanks Hanccocka, Im starting to get a better understanding of how Active Directory actually works; the Primary/Secondary misconception seems quite popular.

For info, SDI-VDC01 currently runs the Active Directory Domain Services, DNS Server and DHCP Server. SDI-VDC02 currently runs Active Directory Domain Services and DNS Server. From what I understand this is a typical configuration when running two Domain Controllers.

Just to clarify my understanding (in laymans terms!); if a client has registered with VDC02, they will access the domain via this DC. Although this information will be replicated with VDC01, in the event of VDC02 failing, the client would not be able to access the domain until the client roles were transferred from the failed DC to the operational DC.

Thanks again Hanccocka!
what you also need to understand, is what AD roles your DCs are performing!

Well if clients established a connection with a DC which is no longer there, they will have issues!

Unless they restart, and then they should establish a connection with another DC, but it also depends on what AD roles, which servers have! AD is very complex - best advice - do NOT turn them off, or shut them down, unless you really really have to!
Using the website that you kindly posted a link to, I have been able to find out that the VDC01 AD server currently holds the RID, PDC and Infrastructure "roles" or "operations masters".

Hmm, seems like good advice. The only thing I need to figure out now is what changes I need to make so that in the event that the primary system fails, the secondary system will automatically take over with practically no downtime. We run a 24x7 production facility and therefore downtime is not an option!!

Thanks again Hanccocka.
ASKER CERTIFIED SOLUTION
Avatar of Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)
Flag of United Kingdom of Great Britain and Northern Ireland image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Brilliant. Thanks for your help Hanccocka. That pretty much solves the problem. Discovered another issue, but thats another problem!

Ah, yes! HA.... Something that I have just started looking at. We have vMotion, so using the HA bit of it shouldnt be a problem.

THANK YOU!!
Great member, very helpful and able to communicate well. I thought his solution was well explained and also educational.