Solved

a Win 2008 R2 VM BSoD after on high CPU/IO utilization during an AV scan

Posted on 2014-09-04
12
494 Views
Last Modified: 2014-09-23
We have a VM with guest OS Win 2008 R2 which BSoD'ed after it was
on high CPU/IO (CPU utilization of about 50-65%) for more than hour.

The Realtime TrendMicro Deep Security anti-malware scan processes
was identified as the main CPU/IO consumer: from TrendM's diagnostics
logs, it was scanning a folder with several huge zip files (each of 0.5 to
1.5 GB in size) & then it BSOD'ed : refer to the blue screen attached

TrendM analysed the memory crash dump & doesn't find any of its
codes in the memory dump.

Q1:
Any idea what is the root cause based on the attached blue screen?
I may need to convert the event viewer logs to csv format, sanitize it
before I can attach here.

Q2:
Can't attach the large memory dump as it may contain sensitive info.
I don't have MS support service.  Is there any other way I can get
the memory dump analysed?

Q3:
Someone told me it's a common occurrence that when a VM (on VMware
hypervisor) is on high CPU/resource utilization, it will BSoD at some
point in time.  Anyone encounters this or has a KB to show this?
0
Comment
Question by:sunhux
  • 8
  • 4
12 Comments
 

Author Comment

by:sunhux
Comment Utility
http://esupport.trendmicro.com/solution/en-US/1096120.aspx

Extracted from the above, url is a statement indicating BSOD when
scanning large number of files.  How can I determine my BSOD is
due to the above scenario?  Our VM has 8GB RAM.

With DSVA’s 4GB minimum memory requirement, BSOD sometimes occurs during continuous scanning of a large number of files. This is an issue with the VMware vShield Endpoint Driver. It has been reported to VMware for further investigation.
0
 

Author Comment

by:sunhux
Comment Utility
0
 

Author Comment

by:sunhux
Comment Utility
0
 
LVL 34

Accepted Solution

by:
Seth Simmons earned 500 total points
Comment Utility
Any idea what is the root cause based on the attached blue screen?

A stop 0x24 is NTFS_FILE_SYSTEM.  How many files are we dealing with here?
Have you manually ran chkdsk on that volume or has the system done it automatically on boot?

Is there any other way I can get the memory dump analysed?

you can try WhoCrashed or use dumpchk

WhoCrashed Introduction
http://www.resplendence.com/whocrashed

DumpChk
http://msdn.microsoft.com/en-us/library/windows/hardware/ff542776(v=vs.85).aspx

Anyone encounters this or has a KB to show this?

I personally haven't seen it.  Once place where I worked there was an xp guest that had a process causing a CPU spike.  It was like that for months (the owner didn't bother with it) but it never crashed; just triggered CPU usage alerts in VCenter.

How can I isolate if my situation is similar to what's described in the above links' cases?

That first article is nearly 2 years old so it may have been fixed by now.  The VMware article refers to a different stop code so probably doesn't apply in your case.

What version of vmware are you using?  The article did cite the fact the issue was resolved with an ESX 5.0 patch.
0
 

Author Comment

by:sunhux
Comment Utility
> How many files are we dealing with here?
I can't access the VM but I won't be surprised that the huge zip files
each may have hundreds of thousands of files in each of them.

We are not vSphere 5.0 Update 1
0
 

Author Comment

by:sunhux
Comment Utility
How could we determine if TrendM's deep security is creating lots of temporary
files while scanning those huge zip files?

No, did not run chkdsk on that E: volume.

Ever since we disable the AV scan for that drive, we never have any high CPU
nor BSoD.  Perhaps I should ask my customer to send me those huge zips &
let me attempt to scan it on a test VM with Deep Security as that customer's
VM is production
0
How to run any project with ease

Manage projects of all sizes how you want. Great for personal to-do lists, project milestones, team priorities and launch plans.
- Combine task lists, docs, spreadsheets, and chat in one
- View and edit from mobile/offline
- Cut down on emails

 

Author Comment

by:sunhux
Comment Utility
I can't run whocrash on that Prod VM : if I copy the memory dump
over to my laptop, how can I make whocrash read this memory dump
& analyse it?  Or any other tool that could do this?
0
 
LVL 34

Expert Comment

by:Seth Simmons
Comment Utility
have you tried contacting trend support?
i've used a number of their products over the years but not deep security

you can install the SDK on your notebook that includes dumpchk
0
 

Author Comment

by:sunhux
Comment Utility
One of my colleague run an analyser against the dump & got the following.
I don't know how to interpret so appended below.

So what's the root cause & how to go about fixing it (ie preventive measure)
permanently :

========================================================

BugCheck 24, {1904fb, fffff88001fce4b8, fffff88001fcdd10, fffff88001435bc4}

Probably caused by : Ntfs.sys ( Ntfs!NtfsAddToWorkqueInternal+84 )

Followup: MachineOwner
---------

0: kd> !analyze -v
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

NTFS_FILE_SYSTEM (24)
    If you see NtfsExceptionFilter on the stack then the 2nd and 3rd
    parameters are the exception record and context record. Do a .cxr
    on the 3rd parameter and then kb to obtain a more informative stack
    trace.
Arguments:
Arg1: 00000000001904fb
Arg2: fffff88001fce4b8
Arg3: fffff88001fcdd10
Arg4: fffff88001435bc4

Debugging Details:
------------------


EXCEPTION_RECORD:  fffff88001fce4b8 -- (.exr 0xfffff88001fce4b8)
ExceptionAddress: fffff88001435bc4 (Ntfs!NtfsAddToWorkqueInternal+0x0000000000000084)
   ExceptionCode: c0000005 (Access violation)
  ExceptionFlags: 00000000
NumberParameters: 2
   Parameter[0]: 0000000000000000
   Parameter[1]: 00000000000014a8
Attempt to read from address 00000000000014a8

CONTEXT:  fffff88001fcdd10 -- (.cxr 0xfffff88001fcdd10)
rax=0000000000000002 rbx=0000000000000000 rcx=fffffa80cb53a510
rdx=fffffa80cb53a510 rsi=fffffa80cc7634c0 rdi=0000000000000000
rip=fffff88001435bc4 rsp=fffff88001fce6f0 rbp=00000000000014a8
 r8=0000000000000000  r9=0000000000000001 r10=fffffa80c150ee80
r11=fffffa80cce57090 r12=0000000000000000 r13=0000000017fd0900
r14=fffff80001d11900 r15=fffffa80cb53a510
iopl=0         nv up ei pl zr na po nc
cs=0010  ss=0018  ds=002b  es=002b  fs=0053  gs=002b             efl=00010246
Ntfs!NtfsAddToWorkqueInternal+0x84:
fffff880`01435bc4 817d00e8030000  cmp     dword ptr [rbp],3E8h ss:0018:00000000`000014a8=????????
Resetting default scope

DEFAULT_BUCKET_ID:  VISTA_DRIVER_FAULT

PROCESS_NAME:  System

CURRENT_IRQL:  0

ERROR_CODE: (NTSTATUS) 0xc0000005 - The instruction at 0x%08lx referenced memory at 0x%08lx. The memory could not be %s.

EXCEPTION_CODE: (NTSTATUS) 0xc0000005 - The instruction at 0x%08lx referenced memory at 0x%08lx. The memory could not be %s.

EXCEPTION_PARAMETER1:  0000000000000000

EXCEPTION_PARAMETER2:  00000000000014a8

READ_ADDRESS:  00000000000014a8

FOLLOWUP_IP:
Ntfs!NtfsAddToWorkqueInternal+84
fffff880`01435bc4 817d00e8030000  cmp     dword ptr [rbp],3E8h

FAULTING_IP:
Ntfs!NtfsAddToWorkqueInternal+84
fffff880`01435bc4 817d00e8030000  cmp     dword ptr [rbp],3E8h

BUGCHECK_STR:  0x24

LAST_CONTROL_TRANSFER:  from fffff8800152954e to fffff88001435bc4

STACK_TEXT:  
fffff880`01fce6f0 fffff880`0152954e : 00000000`00000000 00000000`00000000 fffffa80`cce57300 fffffa80`cce57090 : Ntfs!NtfsAddToWorkqueInternal+0x84
fffff880`01fce760 fffff800`01a8aa77 : fffff800`01c524f0 fffff8a0`17fd09a0 fffff8a0`17fd0900 fffffa80`cb53a510 : Ntfs! ?? ::NNGAKEGL::`string'+0xdf4e
fffff880`01fce7a0 fffff800`01be89f1 : fffff8a0`17fd09a0 00000000`00000000 fffff880`01fceb20 fffff8a0`17fd09a0 : nt!FsRtlpRemoveAndCompleteWaitingIrp+0x123
fffff880`01fce800 fffff800`01e26091 : fffff8a0`17fd09a0 fffffa80`cce573a0 fffff8a0`15d8bb40 00000000`00000000 : nt!FsRtlpAcknowledgeOplockBreakByCacheFlags+0x7d1
fffff880`01fce8e0 fffff880`014b21d6 : fffff8a0`15d8bc70 fffff880`00000000 00000000`00090240 00000000`00000001 : nt! ?? ::NNGAKEGL::`string'+0x39b53
fffff880`01fce970 fffff880`014ebf8e : fffffa80`df915e40 fffffa80`cce57090 fffffa80`00000000 fffff880`00000000 : Ntfs!NtfsOplockRequest+0x176
fffff880`01fcea00 fffff880`014ec33d : fffffa80`df915e40 00000000`00000000 fffff880`01fceb20 00000000`00000000 : Ntfs!NtfsUserFsRequest+0x7e
fffff880`01fcea40 fffff880`01270bcf : fffff880`01fceb90 fffffa80`cce57090 fffff880`01fceb00 fffffa80`df915e40 : Ntfs!NtfsFsdFileSystemControl+0x13d
fffff880`01fceae0 fffff880`0129095e : fffffa80`d00dc7e0 00000000`00000001 fffffa80`d00dc700 fffffa80`cce57090 : fltmgr!FltpLegacyProcessingAfterPreCallbacksCompleted+0x24f
fffff880`01fceb70 fffff880`03c19b49 : fffffa80`cce573a0 00000000`00000000 fffffa80`c7a48760 01cfc691`5f03db20 : fltmgr!FltpFsControl+0xee
fffff880`01fcebd0 fffff880`03c0e5ec : fffffa80`d00dc7e0 01cfc691`5f03db20 fffff800`01c7c200 fffffa80`c7a48760 : srv2!Smb2LeaseAcknowledge+0x239
fffff880`01fcec30 fffff800`01ae0261 : fffff880`03c07000 fffff800`01c7c280 fffffa80`c14ef040 fffffa80`00000003 : srv2! ?? ::FNODOBFM::`string'+0x418a
fffff880`01fcecb0 fffff800`01d74bae : 00000318`878b4822 fffffa80`c14ef040 00000000`00000080 fffffa80`c14df890 : nt!ExpWorkerThread+0x111
fffff880`01fced40 fffff800`01ac78c6 : fffff880`009bf180 fffffa80`c14ef040 fffff880`009ca0c0 3070ba0f`00000080 : nt!PspSystemThreadStartup+0x5a
fffff880`01fced80 00000000`00000000 : fffff880`01fcf000 fffff880`01fc9000 fffff880`01fce9e0 00000000`00000000 : nt!KiStartSystemThread+0x16


SYMBOL_STACK_INDEX:  0

SYMBOL_NAME:  Ntfs!NtfsAddToWorkqueInternal+84

FOLLOWUP_NAME:  MachineOwner

MODULE_NAME: Ntfs

IMAGE_NAME:  Ntfs.sys

DEBUG_FLR_IMAGE_TIMESTAMP:  5167f5fc

STACK_COMMAND:  .cxr 0xfffff88001fcdd10 ; kb

FAILURE_BUCKET_ID:  X64_0x24_Ntfs!NtfsAddToWorkqueInternal+84

BUCKET_ID:  X64_0x24_Ntfs!NtfsAddToWorkqueInternal+84

Followup: MachineOwner
---------

0: kd> lmvm Ntfs
start             end                 module name
fffff880`01432000 fffff880`015d4000   Ntfs       (pdb symbols)          downstreamstore\ntfs.pdb\0842A8FED1C5463FB4078078781F5C622\ntfs.pdb
    Loaded symbol image file: Ntfs.sys
    Image path: \SystemRoot\System32\Drivers\Ntfs.sys
    Image name: Ntfs.sys
    Timestamp:        Fri Apr 12 19:54:36 2013 (5167F5FC)
    CheckSum:         001A27D8
    ImageSize:        001A2000
    File version:     6.1.7601.18127
    Product version:  6.1.7601.18127
    File flags:       0 (Mask 3F)
    File OS:          40004 NT Win32
    File type:        3.7 Driver
    File date:        00000000.00000000
    Translations:     0409.04b0
    CompanyName:      Microsoft Corporation
    ProductName:      Microsoft® Windows® Operating System
    InternalName:     ntfs.sys
    OriginalFilename: ntfs.sys
    ProductVersion:   6.1.7601.18127
    FileVersion:      6.1.7601.18127 (win7sp1_gdr.130412-0013)
    FileDescription:  NT File System Driver
    LegalCopyright:   © Microsoft Corporation. All rights reserved.
0
 
LVL 34

Expert Comment

by:Seth Simmons
Comment Utility
everything points to the file system driver (ntfs.sys) so it could be something with the zip files
perhaps it is coming across a zip file that is corrupt or large number of files/folders?
if you're able to setup a test environment it's worth trying to reproduce
0
 

Author Comment

by:sunhux
Comment Utility
Not that easy to clone that VM as it's huge with a few RDM LUNs.
We don't have the luxury of getting so much storage.

I'll post the event viewer logs in 20 hours
0
 
LVL 34

Expert Comment

by:Seth Simmons
Comment Utility
any update on this?
0

Featured Post

6 Surprising Benefits of Threat Intelligence

All sorts of threat intelligence is available on the web. Intelligence you can learn from, and use to anticipate and prepare for future attacks.

Join & Write a Comment

When we have a dead host and we lose all connections to the ESXi, and we need to find a way to move all VMs from that dead ESXi host.
HOW TO: Connect to the VMware vSphere Hypervisor 6.5 (ESXi 6.5) using the vSphere (HTML5 Web) Host Client 6.5, and perform a simple configuration task of adding a new VMFS 6 datastore.
Teach the user how to install log collectors and how to configure ESXi 5.5 for remote logging Open console session and mount vCenter Server installer: Install vSphere Core Dump Collector: Install vSphere Syslog Collector: Open vSphere Client: Config…
Get a first impression of how PRTG looks and learn how it works.   This video is a short introduction to PRTG, as an initial overview or as a quick start for new PRTG users.

771 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

10 Experts available now in Live!

Get 1:1 Help Now