[Last Call] Learn how to a build a cloud-first strategyRegister Now

x
  • Status: Solved
  • Priority: Medium
  • Security: Public
  • Views: 1581
  • Last Modified:

Windows terminalserver 2003 64bit / Xenapp 5 freezes intermittendly

Guys,

Following problem is bothering me for some weeks now.

Terminalserver (TS) W2K3 64bit, XenApp 5 rollup 6, most current HW (HP blade 460cG7, 2 sockets, 6 cores each, 192 GB => restricted to 6 cores in BIOS, restricted to ~32GB by Windows bootloader /MAXMEM=32000). Attached to a SAN solution (Fiber channel) and Gigabit.

Application mix mostly Office 2003 / Acrobat reader / IE 8 / etc. User profiles & user data stored on a Windows 2008R2 fileserver. Users work with active and open PST files on the fileserver (FS). Some have more than one PST opened in parallel. Size up to several GB.

TS & FS tuned following KB324446 and http://blogs.citrix.com/2010/10/21/smb-tuning-for-xenapp-and-file-servers-on-windows-server-2008/

BIOS is up2date, BIOS settings are according to guidelines for "low latency applications". Xenapp CPU management is active.

Same FS is accessed by all servers in this silo. As long as  load is spread to enough TS, each below ~60 sessions, each individual TS works fine.  Pushing server up to 75-80 sessions, with still some free resources (CPU, memory) remaining, the server gets unusable. As soon as nb. of sessions drops below 60, again, things are fine again.

Users experience "sluggish" server, with "bad" performance (key strokes come delayed etc.). No data loss.

Observation is that the TS starts to freeze intermittently for 1-10 sec. Obviously all user sessions are affected at the same time. This is reproducible on any servers in the same silo, most of them virtualized, one (HW described above) running without hypervisor to exclude ESX as the culprit.

Perfmon shows gaps in its graphs when freeze happens. Edgesight (only basic version) delivers no clue.

Already having looked at the obvious, eventlog shows nothing specific, but honestly I'm stuck now.

Work assumption for the time being: a Windows resource gets exhausted. Until it becomes available again, all requests are queued and processed afterwards.

What should  be monitored? By which tool?
Which settings to be  checked?
Any other idea?

Thx. in adv.

Joe
0
universal_dilettant
Asked:
universal_dilettant
  • 5
  • 3
  • 2
  • +1
1 Solution
 
basrajCommented:
Where does the user comes from? Did you check if this happens for both LAN and external users at the same time? I would also assume some networking issues between any components from the user device and the terminal servers as it affects all other users. When you face such problem, please ping and tracert from the end devices to the server. This might give a clue.

You can also verify latency using Citrix CMC tool if this happens only for a section of users. Please also involve network admin here. Generally when there is a drop in connection, session might freeze and it starts to work fine when the connection is back. I experienced this in my environment in the past.
0
 
Ayman BakrSenior ConsultantCommented:
It might be well that your XA servers have been benchamarked at max. of 60 users per server - this should not seem bad depending on the nature of usage; in our environment we have benchmarked it at 10 users per server!!

However, if you want to pinpoint which resource is being exhausted and causing the queue look at the following counters apart from Processor and Memory counters:

Paging File % usage: High numbers for a long time indicate that you have too little RAM

Physical Disk - Avg Disk Queue Length: I believe usually should be less than 2 (a higher number is a clear indication of disk congestion)

Physical Disk - Avg Disk Sec/Read: Avg should be around 20 ms with spikes no higher than 50 ms. A problem in this is an indication of congestion in reading data from SAN

Physical Disk - Avg Dsik Sec/Write: same as above

System - Processor Queue Length: a number higher than 3 usually mean that processor is not sufficient or processors are very overloaded

Network Interface - Bytes Total/Sec: should not exceed 70% to the total bandwidth of the interface
0
 
universal_dilettantAuthor Commented:
Thanks for your hints.

@basraj:
Users are on WAN connection, accessing the silo (with 10+ similar servers) from different locations. However, nothing is reported from network site. And only users on this very server are affected. So I exclude the network at the moment as the malefactor.

@mutawadi:
Paging File % usage: 30% to 35%. Server got plenty of RAM.

Physical Disk - Avg Disk Queue Length: The 3 "local" disks (system, applications, pagefile) are below 1 always.

Physical Disk - Avg Disk Sec/Read: avg. below 2 ms

Physical Disk - Avg Dsik Sec/Write: avg. around 5 ms

System - Processor Queue Length: I think, between 2 and 5 per processor indicate a still healthy situation. Avg. stays below 30 clearly, with spikes up to 200. Can't say if this is co-incidence or related to the server freezes or just caused by momentary high demand for CPU.

Network Interface - Bytes Total/Sec: below 30%

@all
Most of the "disk access" is by SMB to a W2K8 R2 fileserver. This FS has not indicated any error condition. What do you think about increasing SMB tuning values?
MaxMpxCt = 8192
MaxWorkItems = 32768
I will give it a try tonight.
0
 [eBook] Windows Nano Server

Download this FREE eBook and learn all you need to get started with Windows Nano Server, including deployment options, remote management
and troubleshooting tips and tricks

 
Ayman BakrSenior ConsultantCommented:
Hope this will help, though personally I don't think it will.

All the counters seem fine, however I would monitor at what times the spikes happen for the processor queue length and if there is a significance to the number of users exceeding 60.
0
 
CoralonCommented:
From the sound of it, I'd guess you have overloaded your file server, not the CTX servers.  
if you have a lot of redirections, that can have a very dramatic effect on the CTX servers.  Many of the redirected folders are polled for changes, and this can have a very dramatic and noticeable affect.  the Desktop folder is a prime example.

My suggestion would be to break our your home directories & profile redirection folders to different servers.  

Coralon
0
 
universal_dilettantAuthor Commented:
@Coralon:
Can you name what had to be monitored on the file server? I'm monitoring CPU / memory / network traffic / disk queue length. The latter is on avg below 1, sometimes spiking to 10 for 1 sec.
What had to be monitored else?

@all:
Last night I changed
  MaxMpxCt = 8192
  MaxWorkItems = 32768
on both terminal server and file server.
Redirector / Current command on terminal servers is growing linearly with the number of users, now with 47 users staying around 105 / 110.
BUT: absolutely 0 user complaints. Terminal server appears smoother (yes, I'm below my target #users with 47, but before, we were experiencing (very short) stucking terminal servers even at 40-45 users. So this seems to be promising.

I'm even thinking to increase MaxMpxCt / MaxWorkItems to their respective max values.

However, I'd appreciate to get a feedback what needs to be monitored on the file server.

Cheers,
Joe
0
 
Ayman BakrSenior ConsultantCommented:
Mainly you will need to look at the PhysicalDisk - Avg. Disk Queue Length.

Try also to disable opportunistic locking on the file server by configuring the following registry value to 0 and see if this would help:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\LanmanServer\Parameters\EnableOplocks

To read more about it see this KB article:
http://support.microsoft.com/kb/296264
0
 
CoralonCommented:
You've got to be careful disabling the oplocks.. you may fix your TS performance at the cost of your app performance..

On your file server watch your %disk time, %disk idle time, along with what you listed and the Avg. Disk Queue Length as Mutawadi mentioned.  You might even look at SplitIO.  This sounds really tough though..

How full is your SAN?  (grasping at straws here..)  I've heard of cases of poor performance with some SANs when they get too full..

Coralon
0
 
universal_dilettantAuthor Commented:
@all:

I was monitoring file servers all day yesterday. Nothing obvious.

%disk time was mostly around 10. Only once, it peaked to 300 for 1sec and directly dropped afterwards (strange for a percentage value, but according to some MS papers that is possible)

avg disk queue was around 2, with 1 sec peaks up to 10 (once 1 saw 16)

I have not played with oplocks  yet.

None of the peaks above can be put in line with the frozen CTX.


However, symptoms remained on the terminal server.
What looks very strange is the figure regarding "redirector / current commands". This one is increasing linearly with the number of users on the CTX server. One can say that 1 user is contributing with around 2.5 "redirector / current commands". During the test yesterday, with 80+ users on one of the CTX servers, it jumped above 200. This was at the same time when server was freezing quite freqently.


Last night I put the silo back into stable condition, with avg. 60 users per server. No complaints from users, no obviously frozen sessions so far.
This may indicate that the SMB tuning I applied has pushed-up to limits.

Joe
0
 
universal_dilettantAuthor Commented:
SOLUTION
=========

Finally I opened case with MS. They recommended the following hot fixes / patches to be applied => has not changed much.
Moreover, KB978243 which is superseded by KB2280732, was recommended => has solved the issue.

Joe
 

Win32k.sys http://support.microsoft.com/kb/958690 
Gdi32.dll MS09-006 Vulnerabilities in Windows Kernel could allow remote code execution 09-Feb-2009, dual mode (SP1, SP2), WS03 all platforms

Win32k.sys http://support.microsoft.com/kb/960228 
Gdi32.dll You receive a Stop error when you click the arrow to scroll down an application pop-up menu on a Windows Server 2003 SP1 or SP2-based computer
21-Nov-2008, dual mode (SP1, SP2), WS03 all platforms

Win32k.sys http://support.microsoft.com/kb/946633 
Gdi32.dll The "Font smoothing" feature has no effect in Windows Server 2003 terminal
Winlogon.exe sessions
Licdll.dll Contains :gdi32.dll, winlogon.exe, licdll.dll 22-Apr-2008, dual mode (SP1, SP2), post MS08-025 19-Mar-2008

Termdd.sys http://support.microsoft.com/default.aspx?scid=kb;EN-US;956438 
A Windows Server 2003-based or Windows Server 2008-based terminal server stops accepting new connections, and existing connections stop responding 02-Feb-2009, dual mode (SP1, SP2)
Termdd.sys http://support.microsoft.com/default.aspx?scid=kb;EN-US;935987 
The "File Control Byte/sec" counter of the System performance object displays values that are larger than expected on a computer that is running a 64-bit version of Windows Server 2003 16-Apr-2007, dual mode (SP1, SP2)
Termsrv.dll http://support.microsoft.com/default.aspx?scid=kb;EN-US;958476 Imagehlp.dll RDP clients and ICA clients cannot connect to a Windows Server 2003-based terminal server after hotfix 938759 is applied to the server 19-Nov-2008, dual mode (SP1,SP2)
Termsrv.dll http://support.microsoft.com/default.aspx?scid=kb;EN-US;930045 A Windows Server 2003-based computer stops responding when you shut down the computer in a remote console session 14-Nov-2007, dual mode (SP1,SP2)

Winsrv.dll http://support.microsoft.com/kb/948928 Stop error message when multiple console applications are opened and closed within a short time frame on a Windows Server 2003-based computer:
0
 
universal_dilettantAuthor Commented:
see solution
0

Featured Post

NFR key for Veeam Backup for Microsoft Office 365

Veeam is happy to provide a free NFR license (for 1 year, up to 10 users). This license allows for the non‑production use of Veeam Backup for Microsoft Office 365 in your home lab without any feature limitations.

  • 5
  • 3
  • 2
  • +1
Tackle projects and never again get stuck behind a technical roadblock.
Join Now