Link to home
Start Free TrialLog in
Avatar of Illuvitar
Illuvitar

asked on

Novell 6.5 SP5 Guests will not sync time and have 100% CPU utilization on VI3.5

We have about 20 NetWare 6.5 SP5 servers that were all on ESX 2.02 on old Dell servers running fine.  We recently installed 4 new Dell servers with much beefier CPU's and 4x the amount of RAM.  Created a new LUN on the FC SAN, created a datastore in VMFS3....  Brought up a brand new clean (and on new hardware as well) Virtual Center server (the new version of Update 2 - past the license bug screwup) and imported the new VI3.5 Update 2 (new version) ESX servers.  Everything worked perfectly.  Then migrated the Novell guests from the old ESX 2.02 boxes to the new 3.5U2 boxes.  Again, this process went perfectly.  The NetWare guests had their virtual hardware upgraded (righ click menu selection in VC) prior to powering on.  Once up and running, the new version of VMWTOOLS was installed.  Guests were then rebooted.

OK, so far, so good - everything worked as advertised and is happy...then the ugly starts.

Timesync started drifting.  A lot.  Hours off in a 2 day period.  Timesync was xntpd with an external server, and in accordance with the old VMware 2.02 time issue the timesync flag in the vmx file was set to True to sync with host.  In an effort to resolve the timesync issue the flag was changed back to false (all guests powered off then back on), various options like the NTP local host source with and without the fudge statement was tried, the ESX hosts themselves as sources was tried - even back to the legacy timesync.nlm.  The best results so far have been with timesync.nlm setting a single packet to be exchanges with a polling time of 20 and a radius of 8000.  Even at that they stay within the 8000ms for about 24 hours then break free into never never land (an hour off in 28 hours after a reboot is common).

Now, a related problem is that once the timesync breaks - regardless of the timesync method used - CPU utliization hits 100% and stays there.  The process with the highest thread count in the Kernel Processors monitor screen is always the following once CPU hits 100%

SERVER.NLM SyncClock Proc Tag

Communications is perfect on these guests - all can resolve both short name and FWN fine (though timesource in both timesync.cfg (both the legacy sys:\etc and the monitor created sys:\system) as well as ntp.conf files use IP addresses).  None of the servers (ESX of guests) have shown a single communications problem - the timesync debug screen shows continual successful updates and adjustments into the tens of thousands of ms.  The physical switches show no errors.

eDirectory is healthy.  DSREPAIR is clean in both an advanced local repair and full unattended.  DSTRACE shows nothing out of the ordinary even with all options enabled.

I can find no logical reason why time should be off.  eDir looks good.

So the big questions are:

1) What is the cause and what is the result?  Is timesync driving CPU to 100%, or is CPU being at 100% reventing tiem from syncing?

2) How do I get time to sync?

Anyone seen this before?
SOLUTION
Avatar of Naga Bhanu Kiran Kota
Naga Bhanu Kiran Kota
Flag of India image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of Illuvitar
Illuvitar

ASKER

The vmx files are current configured with the tools.syncTime = FALSE setting and have been for about a week.  Now, I have honestly NOT tried unloading both xntps and timesync.nlm and changing the vmx file to tools.syncTime = TRUE.  May be worth a try.  Will let you know.
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
If I unload xntpd and timesync.nlm then how exactly will eDirectory recognize that time is in sync?  Obviously regardless of how accurately the server tracks time - if eDir does not recognize this fact that the partition replicas will not stay in sync.  If you run a timesync report in dsrepair with both these nlm's unloaded, what does it look like?
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
The clock runs fast on the guests.  NTP is configured on the ESX hosts using Tick and Tock (192.5.41.40 and 41).  I had seen that thread - and we originally tried the vmx setting to false and used NTP on the guests all pointing to an external - worked for about 24 ours, then even unloading and reloading xntpd with the -S slam option didn't work - had to reboot each guest.  At the current moment we have xntpd and timesync UNLOADED and the vmx option set to TRUE.  While time is now staying in sync, the challenge is that in dsrepair each server just shows dashes...so I am not 100% convinced that eDir replicas are staying in sync.  If time holds I will do some further testing.
There's some very promising info waiting for you here -> http://kb.vmware.com/kb/1003613

Lars
Larstr - As mentioned in the original post, I tried very short poll times and huge radius settings (8000 instead of the lower 2000 in that tid) with no luck.
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
giovannicoa - That is exactly the track I was on.  In fact, it turns out, too much.  I have resolved the issue and here is the synopsis:

1) The physical host servers are brand new and have 4x the RAM of the previous hosts, and since we did not have any new VM's, I bumped RAM high - like 6GB - without thinking about it.  Again, these are NetWare 6.5 SP5 machines, scheduled to be migrated to the SLES platform in a month or so.  During the boot process then only recognized just shy of 4GB (expected for a 32-bit OS).  Again, didn't think much of it since these are going to SLES that will see the larger amount of RAM.  Just figured it would react like a physical box with too much RAM - ignore it.  This was NOT the case.  At another engineer's suggestion I reduced the RAM to the max the box could recognize (honestly, just to show him it would not make a difference).  Well, was I wrong.  The CPU utilization went from the normal pattern of <5% for 3-4 hours and then bouncing to 100% and staying there until reboot - to now with the lower amount of RAM it never exceeds the 5% mark.

2) Timesync was drifting due to the CPU utilization.  Now that CPU is in check due to RAM correction, the time stays in sync with either the XNTPD or TIMESYNC method.  

3) I have found that it is best to ALWAYS disable the sync with host feature.

4) I have found that vmwtool.nlm makes a huge difference and should always be installed and running.

5) I have confirmed my suspicion that unloading timesync and unloading xntpd is a bad idea.  While the boxes will technically stay in sync, eDirectory does not recognize this fact and the obits climbed at a steady rate with dstrace showing numerous errors.  Not good for a healthy tree.  I do not know the specific mechanics of how partition replicas determine if all other replicas of a given partition are in sync in order to remain in continuity, but it definitely appears that with both timesync methods disabled this process is hindered if not completely broken.

6) Through testing, and although this is not the Novell nor VMware recommended method, I have found that using the legacy timesync model with a very low polling rate (10-20) and a higher radius (6000-8000), configured sources obviously turned on an DEFINITELY using the :123 option to force NTP over TCP gives the best results as far as a DSREPAIR timesync report is concerned.  Again, make sure your VMX does not have the timesync via vmwtools set to true.

At this point I am considering the case closed - root of the problem was RAM amount being incorrectly configured and for some reason in a VM environment this just totally hosed things up.  I am not sure how to award points so I will select multiple solutions that matched my real world findings - as that is the most fair way I can think of.  Thanks for the pointers!
I have awarded points based on how much information the provided suggestion gave me that was useful in solving my problem.  Unfortunately the root of the problem was RAM config, which I had not provided in the original description and never would have thought to be an issue.  I hope this is a fair way of doing this.  Thanks again!
Hi Illuvitar,

Thanks for the points. But i wanted to thank you for giving a better clarification on how you resolved the issue. Because it is very rare when a asker summarizes the case.

I have not done much on Netware from the perspective of DSrepair and replication to think about Timesync.

But one ground rule that i learn when virtualizing Netware is that the timesync and other factors increase the CPU usage of the base machine and VMware tools takes care of that symptions.

Good to know your experience and i think this question would be a very comprehensive solution on time sync on virtual machines.

bhanu