Link to home
Start Free TrialLog in
Avatar of ollybuba
ollybuba

asked on

SBS 2011 Standard Locks Up

I have a Windows SBS 2011 Standard server that has worked great since we installed it a few months ago.  Now at different times of the day it seems to freeze up.  We noticed it in a program we use that accesses data off of the shares that are on that server.  It use to only happen every hour on the hour and only last for about 10-15 seconds.  Now it freezes up randomly throughout the day and still only lasts for 10-15 seconds.  I have also noticed it when I rdp into it.  It occasionally freezes my screen for a few seconds and then be fine.

This server is one of two virtuals being hosted on Windows Server 2008 R2.  The other virtual is also Windows 2008 R2.  The SBS server has all but the last update Tuesdays updates applied to it and only has SP1.  It is running as our primary domain controller, exchange, and SharePoint server.

Does anyone have any idea's as to why the server seems to be freezing or locking up randomly?

Thank you!
Avatar of Cliff Galiher
Cliff Galiher
Flag of United States of America image

Could be memory or disk I/O issues. Can you tell a little more about the disk setup, vm Resources, and any 3rd party apps on the SBS VM?
Shadow Copy or VSS aware application running backups?  Antivirus settings?

I have seen Shadow Protect and Trend Micro causing issues on SBS 2011,  but not as often as you have indicated.
Avatar of Member_2_4984608
Member_2_4984608

You could start off by monitoring Task Manager and Resource Monitor when the freezes occur - do you see any memory, disk or CPU spikes?
Avatar of ollybuba

ASKER

I have Cobian Backup 11 and Carbonite running on the server.  Neither of these are set to backup during the business day which is when we notice it the most.  They both use VSS.  Forefront Client Security also is not to run during the day, only at 2am.  The only other 3rd party program that is on the server is CrystalDiskMark.

There are 3 hard drives that are Western Digital Caviar Black 2 TB SATA III 7200 RPM 64 MB Cache, WD1002FAEX running in a RAID 5. This VM also has 32 GB of RAM.

I will monitor the Task Manager and Resource Monitor to watch for abnormal spikes when the freezes occur.
Disks...  that's not a particularly fast disk array to run two VMs on - look at disk queue length and IOPS in resource monitor.
The highest Response Time (ms)  I've seen in the 5 minutes I've been watching the disk activity on a file was 415.  Again that was the highest.  Some of them are 0 and 1 and so on... The person that set this server up had these drives configured as a RAID 1 with a hot spare and in the last month I switched them to RAID 5 to try and improve performance.  I do have benchmarks on both the RAID 1 configuration and RAID 5. If you would like to view them let me know.
I agree that the disk configuration is not well suited for two VMs. I suspect most of your problems stem from there. When benchmarking, do not do so from within the VM. You must take your performance measurements from the host to get accurate results.
Test results from my current RAID 5 configuration.
   
                                               Host Virtual's Off                  
        5 Test Runs
                                            Read (MB/s)      Write (MB/s)      
Sequential Read/Write                 163.2                     24.66      
Random 512KB                          56.06                    23.35      
Random 4KB                                  0.795                    0.6235      
Random 4KB (QD=32)                  2.209                    0.649      


                                          Host Virtual's On                  
      5 Test Runs
                                            Read (MB/s)              Write (MB/s)      
Sequential Read/Write                      155.6                   16.65      
Random 512KB                              21.67                     20.935      
Random 4KB                                      0.4805                    0.642      
Random 4KB (QD=32)                    1.9055                    0.5925
That doesn't tell anybody much. Not you, not me. Raw performance isn't the problem. The problem is what is happening to those disks when under load. You tested with VMs off. And testing only goes so far. Fire up the VMs. Then watch the performance counters on the host. Especially disk queue counters. I suspect you'll find that the VMs are creating enough traffic that the physical disks are lagging to keep up, and that causes perceived delays within the VM.

SAS 10k is really a minimum for 2 VMs, and keep the host on different spindles. RAID1 for the host and a separate RAID5 for the VHDs is not an uncommon configuration. And if a VM handles high volume writes, like a busy SQL server, putting its VHD(s) on separate disks is not a bad idea. You even see tiered storage at the very high end.

The whole point is that there is a baseline and then paths upwards even for small deployments. And 7200 SATA rarely makes the grade for heavy random reads and writes (and since multiple VMs are running, the reads and writes are always pretty random from the host perspective.) So you take a pretty hefty performance hit.
I watched the average disk queue on both the host machine and the virtual.  The virtual server's queue ranged from 0-15.043.  The majority of the time it wasn't over 5.  The host server ranged from 0-5.47 and most of the time if it was around 1-2.  The users didn't necessarily notice the slowdown when the disk queue was high on the virtual server but when the processor time was above 50%.  Could this percentage be high because it can't access the disks fast enough?
OK, it doesn't look like you have any serious disk thrashing going on from the VM's perspective, maybe it's not the disks.  (I'm not some sort fo mystic remote hardware diagnosis wizard, just seemed like a plausible theory :))

Another common culprit is third party software - are you running antivirus on the SBS or host?  Mailscanner?  Mail archiver? Line of business applications?

Flakey network switch?
Virtual SBS Programs

Carbonite
Cobian Backup 11 Gravity
CrystalDiskMark 3.0.1c
Microsoft .NET Framework 4 Client Profile
Microsoft .NET Framework 4 Extended
Microsoft Exchange Server 2010
Microsoft Filter Pack 2.0
Microsoft Forefront Client Security Antimalware Service
Microsoft Forefront Client Security State Assessment Service
Microsoft Report Viewer Redistributable 2008 SP1
Microsoft Server Speech Platform Runtime (x64)
Microsoft Server Speed Recognition Language – TELE (en-US)
Microsoft SharePoint Foundation 2010
Microsoft SQL Server 2008 Analysis Services ADOMD.NET
Microsoft SQL Server 2008 R2 (64-bit)
Microsoft SQL Server 2008 Native Client
Microsoft SQL Server 2008 R2 Policies
Microsoft SQL Server 2008 R2 Setup (English)
Microsoft SQL Server 2008 Setup Support Files
Microsoft SQL Server Browser
Microsoft SQL Server Compact 3.5 SP2 ENU
Microsoft SQL Server Compact 3.5 SP2 Query Tools ENU
Microsoft SQL Server VSS Writer
Microsoft Sync Framework Runtime v1.0 (x64)
Microsoft Visual C++ 2010 x64 Redistributable – 10.0.30319
Microsoft Visual Studio Tools for Applications 2.0 – ENU
SQL 2008 R2 Reporting Services SharePoint 2010 Add-in
Windows Server Update Services 3.0 SP2
Windows Small Business Server 2011 Standard


Host Programs

CrystalDiskMark 3.0.1c
Java 7 Update 9
Microsoft .NET Framework 4 Client Profile
Microsoft .NET Framework 4 Extended
Microsoft Forefront Endpoint Protection
PowerAlert Local Software
Hmm, none of the usual culprits there really.

IIRC Carbonite was specifically implicated in SBS2011 issues in another Experts Exchange question recently - obviously you want to be backed up but are you able to test with all carbonite services disabled?  I'll see if I can find the question...
I have tried disabling all services but it automatically restarts them so I'm not sure if maybe I'm missing one.
How the hell does it automatically restart disabled services?  That's not normal is it?
Yes when I disabled it and then stopped it, it stayed stopped.  Before I just stopped it and it would restart.  My bad.  It has shown a decrease in cpu usage on our virtual machine and I will continue to review it today.
Would this be a sign of possible thrashing if it is coming from the host server or is it just working hard?
Possible-Thrashing.jpg
Doesn't look that bad to me.  If your server's locking up due to disk load I'd expect sustained high disk queue lengths, and more than 2 or 3 operations queued.

I hijacked the below from another Experts-Exchange post by someone far more experienced and knowledgable than I:

It's best not to look at Queue Length in isolation (although 375 looks a bit grim...). You also need to take a look at Response Time (Physical Disk -> Avg Disk sec/Read and Write) and IOPS (I/O Per Second) (Physical Disk -> Disk Reads/Sec and Writes/Sec).
If Queue length is high AND Response Time is high AND IOPS is high, then you have a disk performance problem.

The rules of thumb:
- Response time should be less than 20mS in a perfect world, and once it starts climbing t0 40mS and beyond, you're going to start seeing performance degradation.

 - Expect 3 to 4 queued I/Os per disk in a RAID set.

 - Expect 180 IOPS per 15K rpm SAS drive, 120 IOPS per 10K rpm SAS drive and 60 - 80 IOPS per 7200rpm SATA drive. This is the steady state performance, and drives will typically hit 2 to 2.5 times that figure per drive before you see a performance impact. So: if you have 4 10K SAS drives in a RAID 5 group, you'd expect between 480 IOPS and 1200 IOPS no worries. Once you got above 1200 IOPS, though, things are going to grind to a halt. If, on the other hand, you've got a mirrored pair of 2TB SATA drives, you're looking at 80 to 200 IOPS. Ouch.

Tracking down the source of the high queue depths is going to be tricky. Anything that's waiting on I/O is likely to have low processor utilisation, so Task Manager won't be a huge amount of help. Process Monitor from Sysinternals might be a good start for debugging (http://technet.microsoft.com/en-us/sysinternals/bb545046), but you may just have to pony up some cash for more disk drives, depending on what you find.
I'm not sure if you can interpret these results any better than I can.  The server doesn't seem to bad today so maybe it was a bad day to test.
Servers-I-O-Test-11-13-12.xls
Hi ollybuba, the figures look fine, I retract my statement about the disks.

How's it going with carbonite disabled?
It helps maybe a little on the SBS server.  The % Processor Time will still get high (over 55-60) at times and it's average is about 27-28 I'd say.  Our program also locks up when we hit at or above 60%.  Is there any way with resource monitor to possibly to see what would make it spike?  Or would I just have to use the program that was with that link that you provided above?  The host machine really doesn't change with the disabling of Carbonite.
ASKER CERTIFIED SOLUTION
Avatar of Member_2_4984608
Member_2_4984608

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
When I watched for a spike these are some of the results I got for the top 5 processes.

System Idle Process                                     50
Microsoft.Exchange.Search.ExSearch.exe         25
CarboniteService.exe                             22
lsass.exe                                                   02
explorer.exe                                           00



System Idle Process                                    59
Microsoft.Exchange.Search.ExSearch.exe        25
CarboniteService.exe                            13
msexchangerepl.exe                            01
lsass.exe                                                  01



System Idle Process                                    46
Microsoft.Exchange.Search.ExSearch.exe        25
CarboniteService.exe                            21
lsass.exe                                                  03
System                                                03



System Idle Process                                    65
Microsoft.Exchange.Search.ExSearch.exe        25
CarboniteService.exe                            09
lsass.exe                                                  01
explorer.exe                                          01

You get the pattern....
Looks perfectly normal...  The implication is that the problem is a faulty program / process rather than resource exhaustion.  To clarify, can you confirm that with Carbonite services disabled you still experience freezing / locking up?
The only freezing or locking up that the users experience now is right around every hour.  The random freezing has seemed to stop.  I watched the processor again and vds.exe process seems to top the CPU out at 90-100% utilization.  Any ideas with that one?
vds.exe on the host or SBS high CPU usage?

All I could find on Google is this post with pretty much exact same symptoms (but no resolution) Server 2008 slowdown every hour, on the hour. VDS.exe using an entire CPU core

What make is the server / RAID card?
It's a slowdown on the virtual SBS 2011.  This is all running on an Intel Modular Server.
Buggered if I know, I'm afraid.  If the problem is as regular and predictable as you describe the next step would be to investigate/disable any hourly scheduled tasks  (on my SBS2011 I have an 2 hourly scheduled tasks; Database One Copy Alert (Exchange DAG check) and a bloody Google update task).  Also (obv.) check event logs around the time of the problem.

Why the Virtual Disk Service on your VM is hammering the vCPU I dunno though.  Does Hyper-V do something every hour that confuses the VHDs?  (I haven't used Hyper-V)
ollybuba, did you eventually get the freezing to stop?