Link to home
Start Free TrialLog in
Avatar of David Williamson
David WilliamsonFlag for United States of America

asked on

Backup Hyper V virtual machines for DHCP and DFS boot fine but won't do their jobs

I had a Hyper V host that wouldn't boot, but I had some backups of the VMs on another machine that I fired up.  Two of the VMs are DCs, one is the DHCP server and the other the DFS root.  Once the backup VMs booted, I was able to ping both of them.  Things seemed normal until I realized that users were not getting DHCP.   I tried clearing the arp cache on our switches, thinking that those machines couldn't find the DHCP server, but that didn't work.  I even reset one workstation's NICs, but that didn't work either.  Nothing I tried would cause the machine to get an IP assigned.  Only after I assigned a manual IP did that workstation seem to be back to normal.

I also noticed that some of our DFS mapped drives didn't work either, even though the DFS root was up and running and I could ping it.

In the meantime, I was able to get the original Hyper V host up and running again, so I shut down the backup VMs and started up the original VMs.  DHCP and DFS started working immediately!  Can someone help me understand why this happened, why the backup VM wouldn't hand out DHCP requests?  Or why the DFS root didn't seem to work either?  It's kind of useless to have backup VMs if they won't do the jobs they're supposed to...

I am fully up and running with all the original Hyper V host and VMs once more, so the fire has been put out for now.  But, I'm worried that the next time I actually need those backup VMs to work that I'll be in the same situation once more.  Help?
Avatar of Paul MacDonald
Paul MacDonald
Flag of United States of America image

These "backup VMs" are different machines than the "primary VMs", or are they backups of the primaries?  

In either case, it may be that the "backup VMs" weren't up-to-date with the state of the network when you started them.  DHCP servers in a domain have to be authorized to operate and it may be that your "backup" DHCP servers were not.  The DFS issue may have a similar root cause, if the "backup" DCs didn't have the current DFS topography.
Avatar of David Williamson

ASKER

They are backups of the primaries, so as far as the network is concerned, they are the the same machines (should be at least). The backups happen 2-3 times daily, so they should have been quite current.  I'm using Quest Rapid Recovery, which is an image-based backup system.  RR automatically keeps the backups VMs updated throughout they day (they call it "virtual standby").  The backup VMs are off, but RR keeps their vhds updated each time a new backup of the live machine happens.

Could it have some thing to do with the MAC address of the backup VM? That is why I cleared the arp cache on my switch, thinking that it may have had a stale record, but that made no difference.  I was able to ping them anyway, so that doesn't seem like that was it.
"Could it have some thing to do with the MAC address of the backup VM?"
It's possible, but given the "backup VM" is identical to the live VM, I would expect them to have the same MAC address.

How positive are you these "backup VMs" are (or were) current?  Even being a couple days old might cause problems, though I would expect the problems to go away pretty quickly.  Another question is, were the date and time on the "backup VMs" current once they booted?

It's probably more trouble than it's worth at this point, but I'd like to see a wireshark capture of the D.O.R.A. packets between a client and the backup DHCP server.  This is an unusual problem and you may run into it again some day.
I did not happen to check the date/time, but I could boot them back up without connecting them to the network to see what they say.  I'd have to wait for after hours to get the back on the network to do a packet capture.
as far as the mac address, does that source from the VM or the actual hardware that the VM is using (which is different of course)?  I suppose I could make sure the backup has the same one via the Hyper V settings in the network area.
Hyper-V MAC addresses come from a pool on the host.  You can assign a fixed MAC address, but this shouldn't be necessary.
Still trying to work through this.  I tried moving another machine to a new Hyper V host, but was also having network issues.  I made sure the copy VM had the same MAC address, disconnected the original's network, then connected the copy's network.  The copy seems to be working fine; I can ping everywhere, do nslookups against the domain controller, etc, but I cannot ping the copy.  I'm not even sure how that works!  I can browse the internet, the file server, everything.  But I can't ping it.  The switches somehow don't know where that IP is even though it has the same MAC address as the original. We have two cicso catalyst switches and I cleared the arp table on both of them, still nothing.

This makes having VM copies sort of useless if they can't be reached :-)
'The switches somehow don't know where that IP..."
Switches use the MAC to route packets, and if you can go from the VM to a web page, then packets are travelling in both directions.  
It's possible a firewall rule exists on the copy that suppresses ICMP responses.  That would explain this behavior.
When you ping the VM's hostname, it resolves to the correct IP address?  Does it make a difference if you ping the IP address instead?

"This makes having VM copies sort of useless if they can't be reached :-)"
I agree!
it does resolve the host name, but there's no difference between pinging the name or the IP.  There are some other services on that server (website and printers) and those are not accessible when the copy VM is online.

As far as a firewall rule, there shouldn't be anything because the copy should be just that, a copy.  The original doesn't block icmp, so the copy shouldn't.  I will double-check, but I'm confident the domain firewall is disabled.

One of the things I didn't mention is that the copy VM is on a different host.  The original VM is on Server 2008 R2 and the copy is on Server 2016.  I've just started using 2016, so could there be something I'm missing about how the newer version of Hyper V works?  Maybe the virtual switches behave differently, or have VLANs predefined or something odd like that?
Ok, some progress!  I disconnected the original VM and fired up the copy.  Still no network access to it from the outside, while on the machine itself, everything appeared normal: web browsing, file server browsing, pinging anything, etc.  So, I tried physically plugging and unplugging cables, I tried disabling virtual adapters, etc, whatever I could think of, no effect.  Next, I decided to run netsh int ip reset and netsh winsock reset on the copy VM, then rebooted.  Wha-lah!  I could now ping the machine normally, the sites it hosts came up, everything went to 100% functional.

Nearest I can figure is that when the copy VM came online and installed a new virtual NIC (because it was on different physical hardware), something got jacked up in the tcp/ip stack which allowed it to do everything normally, but nothing could get in to it from the outside (except what it initiated itself first).  Doing the reset seemed to have cleared it, bringing it back to normal.

That's one for the record books, never encountered it before...
ASKER CERTIFIED SOLUTION
Avatar of David Williamson
David Williamson
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial