asked on

Detecting when a VM is vMotioned

Here's the problem - we have a series of RHEL and Windows VMs that are hosted in a VMware DRS/HA cluster configuration. As 'guests' of the hosted environment, we do not have any access to the actual ESXi hosts nor can we use the vCenter console client to see the ESXi hosts nor the VMs on them. In the past few weeks, we have seen a number of our Windows and RHEL VMs encounter some heavy performance issues that last a second to as much as ten, similar to that of when a VM gets vMotioned. Our provider says that nothing like that has happened, however when we ask for an actual vMotion 'test' to be performed in real-time, the actual and seen result comes back just like before.

All of our RHEL and Windows VMs have the VMware Tools installed on them. On the RHEL VMs, we have been toying around with the VMware Guest SDK, but it's rather limited on documentation and examples.

So, from within a RHEL or Windows Guest VM, is it possible to monitor and/or detect when it is vMotioned? If so, are there any script or code examples available? We were planning on using something like Solar Winds or Nagios to alert us when such a detection is discovered.

Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)

If the vSphere environment which your VMs reside in is under DRS or HA, your machines can be removed from the DRS environment. (HA - High Availability is only used when a host goes down, so you wouild have 1min-2min of outage before it powered up on the other host).

However, under DRS control it would move the VM to another host automatically use the vMotion technique.

DRS rules could be setup and your machines could be locked to a host, however, your provider may not want to do this, because the benefit of DRS it to load balance all hosts memory and cpu.

Detecting vMotion inside guest, it's maybe typical to lose, maybe on of two pings whilst monitoring the VM externally, but this is not always true.

You could maybe check if the processor has changed, or changed speed, if they are using different proceesors in the fam.

Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)

Check the Eventlogs for VSS entries, caused by VMware Tools VSS Sync Driver, because a snapshot is created prior to vMotion, and merged afterwards.

Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)

it's possible what you are see-ing is the Snapshot Process, Create Snapshot, Move VM, Merge Snapshot.

If you had a very busy, I/O intensive VM, this is possible, as you've proved by having a manual test, this can be datastore related.

anyway this is besides the point here, it's not your hardware, and you require excellent Service.

Ask for your VMs to be removed from the DRS pool.

Michael Worsham

ASKER

Talked with the hosting provider. As per their caution response, taking our VMs out of the DRS pool is not an option our Infrastructure lead is willing to take. We also checked for VSS entries in the Event Logs on the Windows server, but found nothing (no VSS entries or anything of that nature).

So there original question stands: So, from within a RHEL or Windows Guest VM, is it possible to monitor and/or detect when it is vMotioned?

Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)

Its' difficult, you could monitor CPU % Ready, but you would need a basline, when a vMotion event occurs, to know what the CPU % Ready will be, because this should increase, when the vMotion event is happening.

As you have provded that a vMotion event causes you issues, how does this benefit you.

I think it's very poor service of the hosting provider. I would be considering moving.

Michael Worsham

ASKER

We are on a government contract, pre-paid for the next 5 years so moving is not an option. The host is the only FIPS 140-2 compliant cloud/virtualization provider.

Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)

Well time for your IT Manager and your Account Manager to have words if its affecting Service, and look at the SLA and Contract that you have agreeded to.

PsiCop

"...[F]rom within a RHEL [...] Guest VM, is it possible to monitor and/or detect when it is vMotioned?"

I've explored this very issue, with RHEL v4 and v5 on ESX v3.5 and ESX v4. I have yet to find a way that will work. The UUID doesn't change, nor does there seem to be anything that changes in information reported by dmidecode. I suppose, if the config option in the VM were set to Automatic and there was a one-in-a-billion clash, the MAC address of the NIC(s) could change - but that's likely to cause a great deal more problems than it solves. With modern versions of the memconf.pl script, you can detect the underlying CPU architecture, but since all the CPUs in the ESX Cluster have to be the same, that doesn't help.

I've toyed with the idea (but never even come close to actual implementation) of using Nagios and a combination of active checks (e.g. some sort of PING) plus passive checks on the host (e.g. some sort of check where the host says "I'm here") to detect when a vMotion probably occurred. Not sure how to work out the timing, nd of course a "Host Down" situation or loss of network connectivity would create a false-positive.

"We are on a government contract, pre-paid for the next 5 years so moving is not an option."

Isn't outsourcing wonderful? Of course, outsourcing is about lazy management.

PsiCop

"[...] look at the SLA and Contract that you have agreeded [sic] to."

Yeah, especially the clause that says "You'll get what we give you, and you'll like it."

Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)

"CPU architecture, but since all the CPUs in the ESX Cluster have to be the same"

this is your best chance of discovery, and the above it not necessarily true, if they've enabled EVC Mode, they could be using different types of CPUs,in the cluster, which Guest VMs, will see, so if a Xeon 5600 or Xeon 5500 for example, this should be detectable in the guest.

But it's general good practice, to use the same servers in the cluster, but servers fail, get replaced, upgraded uplifted.

PsiCop

I seem to recall that, at least at one time, the CPUs in an ESX Cluster all had to be the same flavor, down to the CPUID level. That may have been an older ESX version (v3.0 or perhaps before that), so I won't argue hanccocka's point.

So, if each ESX in the cluster had a detectably-different physical CPU flavor, that might work. But only if that were the case, and for it to be even slightly reliable you'd have to be in the loop about new ESX cluster members - and they'd need different CPU flavors from the rest. Tall order, probably too tall to expect some outsourcer to provide.

The other way to do it is through some API to the VIC/vSphere - assuming the vendor makes that interface available. A cruder way, if you can access it, would be to code something that logged into the web interface every so often and scraped the host data for each VM - again, that interface might not be available to you.

PsiCop

If the people who want you to be able to detect the migration event are the same ones who made the decision to outsource, just tell them "No, we can't, but we could if we owned the VMware infrastructure instead of it being outsourced"

That's what they get for being lazy.

ASKER CERTIFIED SOLUTION

Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)

membership

This solution is only available to members.

To access this solution, you must be a member of Experts Exchange.

Start Free Trial

Crunched

They would have to have some very good granular permissions in place to allow not changing anything, someting we've not seen as yet!

Err, this kind of access is super-simple to setup. But if this solution is outsourced, the virtual infrastructure (ESXi hosts, vCenter server etc) are all probably behind a firewall and would not be accessible to a vSphere client across the internet/WAN.

Regarding moving your host out of DRS - of course the host would not want this. If you tie your vm to a single ESXi host, there is no way that host could be put into maintenance mode if they ever needed to (which is quite common). Also, as mentioned before, the load on that server wouldn't be balanced and then customers would likely complain about high load/low RAM.

What are you actually trying to achieve by monitoring if the vm has been vmotioned??

Andrew Hancock (VMware vExpert PRO / EE Fellow/British Beekeeper)

VMs suffer service issues when they are vMotioned!

Michael Worsham

ASKER

As per our guidelines with our government entity, they are wanting 99.9% application uptime. As it is right now, when either a Windows or RHEL VM is vMotioned the hosting provider doesn't give us notification until either four hours later or the next day something has happened to our environment. If/when our Weblogic application takes a performance hit and knocks off a number of users, the Government entity thinks this is unforgivable and questions us, the support team, to make heads or tails of what is going on with the environment while our provider just gives us limited accessibility into what is really going on via root cause analysis, impact documentation, solution to resolving, etc.

Yes, we know this is bad hosting support, but the original contract is still in effect. A new one is being created/reviewed at this time and isn't scheduled for release until next year sometime -- our architecture and application integration team didn't think that far ahead into the future.

Long story short -- we are looking to have our cake and eat it too.

Crunched

Ah yep, the old developers blaming infrastructure and infrastructure blaming developers..

Michael Worsham

ASKER

Actually when it comes to our outsourced hosted virtualized environment (Terremark), their services and support are actually better than that of places like Amazon and Rackspace. The problem is the government and architectural team leads didn't watch what they were originally asking for, so when the time came for us to actually do higher-end technical work, things then started to fall into the cracks.

PsiCop

The problem is the government and architectural team leads didn't watch what they were originally asking for, so when the time came for us to actually do higher-end technical work, things then started to fall into the cracks.

Which is inevitable, because the people making the decision to outsource are not the people who actually understand what is being outsourced.