We have about 20 NetWare 6.5 SP5 servers that were all on ESX 2.02 on old Dell servers running fine. We recently installed 4 new Dell servers with much beefier CPU's and 4x the amount of RAM. Created a new LUN on the FC SAN, created a datastore in VMFS3.... Brought up a brand new clean (and on new hardware as well) Virtual Center server (the new version of Update 2 - past the license bug screwup) and imported the new VI3.5 Update 2 (new version) ESX servers. Everything worked perfectly. Then migrated the Novell guests from the old ESX 2.02 boxes to the new 3.5U2 boxes. Again, this process went perfectly. The NetWare guests had their virtual hardware upgraded (righ click menu selection in VC) prior to powering on. Once up and running, the new version of VMWTOOLS was installed. Guests were then rebooted.
OK, so far, so good - everything worked as advertised and is happy...then the ugly starts.
Timesync started drifting. A lot. Hours off in a 2 day period. Timesync was xntpd with an external server, and in accordance with the old VMware 2.02 time issue the timesync flag in the vmx file was set to True to sync with host. In an effort to resolve the timesync issue the flag was changed back to false (all guests powered off then back on), various options like the NTP local host source with and without the fudge statement was tried, the ESX hosts themselves as sources was tried - even back to the legacy timesync.nlm. The best results so far have been with timesync.nlm setting a single packet to be exchanges with a polling time of 20 and a radius of 8000. Even at that they stay within the 8000ms for about 24 hours then break free into never never land (an hour off in 28 hours after a reboot is common).
Now, a related problem is that once the timesync breaks - regardless of the timesync method used - CPU utliization hits 100% and stays there. The process with the highest thread count in the Kernel Processors monitor screen is always the following once CPU hits 100%
SERVER.NLM SyncClock Proc Tag
Communications is perfect on these guests - all can resolve both short name and FWN fine (though timesource in both timesync.cfg (both the legacy sys:\etc and the monitor created sys:\system) as well as ntp.conf files use IP addresses). None of the servers (ESX of guests) have shown a single communications problem - the timesync debug screen shows continual successful updates and adjustments into the tens of thousands of ms. The physical switches show no errors.
eDirectory is healthy. DSREPAIR is clean in both an advanced local repair and full unattended. DSTRACE shows nothing out of the ordinary even with all options enabled.
I can find no logical reason why time should be off. eDir looks good.
So the big questions are:
1) What is the cause and what is the result? Is timesync driving CPU to 100%, or is CPU being at 100% reventing tiem from syncing?
2) How do I get time to sync?
Anyone seen this before?
Start Free Trial