Link to home
Start Free TrialLog in
Avatar of mbkitmgr
mbkitmgrFlag for Australia

asked on

Running out of options to troubleshoot an IIS based app

I am keeping this generic - so that some may suggest something 'Out of the Box' that may lead to a technique/tool/approach being suggested.

Scenario
I have a client that uses a 3rd party IIS based application running from an internal server and uses a MS SQL Db underneath.  The SW vendor is having problems with it randomly stalling servers for the last 2 yrs :( and is stuck trying to resolve it) and it took them 18 months to admit it was affecting other customers   > :(

On ocassion it loads up the OS to the point that ASR cuts in and reboots the server, which is handy (but not desirable) given the OS becomes totally inaccessible.  There are times (like today) where I get enough notice that its having a moment, that I can log in and gather information on what is going on.  I usualy go to IIS manager and stop/start the respective App pool.

I have used
  1. Event log information
  2. Perfmon to gather huge data on the W3WP process and subordinates.
  3. IIS to list currently active worker processes and their Request State details for the vendor.
  4. Ran a dependency trace on the W3WP.exe and relevent application specific DLL's to weed out missing/outdated dependencies
  5. Used some of Idera's SQL analysis tools to check on what the DB is "hearing" and doing.

So, throw me something I havent thought of that may be worth a try.  The SW vendor is a decorated MSFT developer so I expect they've got access to or used tools from MSFT that are much more sophisticated, but I am hoping that like in times before, I can throw the dev some detail that helps get it some way towards being diagnosed and then rectified.
ASKER CERTIFIED SOLUTION
Avatar of Gary Patterson, CISSP
Gary Patterson, CISSP
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of mbkitmgr

ASKER

Many many thanks to those who have contributed.  

I am the IT contractor that provides support to the client, I specialise in MSFT OS's and VMWare, I dont have much exposure to IIS based matters.  My aim has been to try and find other ways to try to support/encourage/chastise the SW vendor to make sure they are doing everything possible to rectify the issue.  The dependecy trace was one thing they hadnt thought of.

I have spoke with the SW vendor a number of times over the last 24hours (making sure I have their attention) and raised the items mentioned in your responses, namely asking have they tried some of the methods mentioned.  The vendor has had to go back to his developer for answers and hasn't got back to me in 16hrs >:(

Sadly as warned to them the repeated ASR's on the server appears to have finally taken its toll - after the last crash of the app pretty much every activity on the same server is having some sort of problem.

Summary
The app is used by a small group of people (six) internally to manage contracts, and to update the management of those contracts on a daily basis.  The vendor says its written in .Net  The OS is Windows Server 2008R2 and is using SQL server Express 2012User generated image
The screen capture shows the only bit of information I could get before the ASR took over

The generics of my question
I have on many ocassions found applications of any type where the SW vendor has been stuck trying to sort out an issue.  My aim is to see if there is anything else I can do.
  • One example was where a dependency trace showed the client side code was calling an IE4 related DLL that did not exist on the most recent OS's at the time.  We learnt the vendor was developing the SW on Windows 2000 when we were on WIndows 8 waiting to go to 10.  I dug a little further on the server side of things and it was a "dogs breakfast" a mess.  $250k later the CEO of the customer pulled the pin and recliamed all the costs of the project back from the Vendor, plus an undisclosed sum for losses in productivity.
  • MSFT were stuck when we had an iussue with SMS2.0 on an Windows 2000 domain.  After 1 week I spent time digging and found that the SSID was getting too long for domain users which was affecting Kerberos, which affected SMS2.0.  MSFT had the fix to us that day (loved gold support)They admited they wuld never have thought of that being the cause
Seems peculiar that a recovery option which presumably triggers first, restart of the service, does not remedy the situation until such time that a reboot is triggered.
It is likely that the wrong thing is being looked at.

Either there is a memory leak, at which point restart of the web public shiny service would clear the issue, release the memory.

A memory saturation over time, could be remedied, managed once the underlying issue is determined.

Presumably the issue arose after many/several years of use.

Please confirm the server configuration settings/performance settings are for best pergormance, prioritizing resources for programs CPU/memory.

How is page file managed, is Windows managing the paging?
Avatar of Charlie Arehart
Charlie Arehart

Again, a failed request trace on the urls listed in the wpm screenshot should give you the answers you need.

It will tell you at what step in the processing of the requests things have stopped. That should help you focus attention, rather than be left feeling again "out of options".

BTW, I don't see the connection between this iis problem and the asrs. I've run windows 2008 servers for a decade and never saw one. I'd think those must indicate some OS-level problem, rather than an iis one. But unless those or the windows event log offer more detail on what is triggering the asrs, I would still focus first on the frt to see what it may tell you.
I understand that you're looking for a punch list of "things I can capture", but, as a guy who deals with performance problems regularly, "burying the vendor" in a bunch of random data just isn't that helpful.  

Targeted, iterative troubleshooting, and focused logging is.  Troubleshooting a problem like this is an iterative process. You gather some information, analyze it, and then, based on the results of that analysis, determine "next steps".  

Based on the info provided so far:

Windows Server 2008R2, SQL Server Express 2012, .NET application running in IIS - randomly stalling servers, sometimes resulting in ASR.  Also, from your comments, it appears that you have some indication when a stall is about to happen, so I assume this means that performance degrades as you get closer to a halt.

These symptoms are a good fit for a memory leak.  Personally, I'd look for that first.  Memory leaks are easy to diagnose up close, and harder to do at a distance.  Up close, you just need to monitor overall system memory usage, and process memory usage over time.  How much time depends on how "fast" the leak is.

For "faster" leaks, I generally just use the Process Explorer tool ( https://docs.microsoft.com/en-us/sysinternals/downloads/process-explorer ) to spot check private memory usage on the suspect process from time to time.  If it just goes up and up, and never back down, you have a memory leak in that process.  I usually sort by the "Private Bytes" column, and look for processes that tend to just go up over time.  Leaking applications will drift toward the top of the list, the longer they are running and in use.

For slower leaks, you might want to do a daily check on the suspect process - making a note of the Private Bytes usage each time - or set up a performance monitor and capture a few days of usage, or until the next outage.  Capture:

Memory/Available Bytes
Memory/Committed Bytes

For the suspect process(es)

Process/Private Bytes
Process/Page File Bytes
(While you're at it, grab Process/Handle Count, too)

You don't mention how long between problems.  Is it typically hours, days, weeks or months, on average?  Un-diagnosed memory leaks are often of the "slower" variety, so memory utilization just creeps up slowly over time (but possibly accelerating dramatically at time if the application has an uneven usage pattern with big peaks and valleys in usage).  Eventually bad things happen to the process, or to the system as a whole.

If you have a memory leak, there are additional diagnostic steps that you can follow to help isolate it for the developer.

Bear in mind a few things about memory leaks:

  • Leaks can be in vendor-supplied code, OS code, or third-party tools and utilities used directly or indirectly by the application.  This could  account for why the issue crops up in some shops but not others.
  • Once identified, and while working through a memory leak problem with a vendor, you may be able to alleviate symptoms by doing scheduled application restarts.  This application has a very small user base, which means there is a good chance that there is some time every day where it could be restarted.
  • If the average time between failures is more than a day, an overnight scheduled restart might at least alleviate the symptoms, or most of them, for your users.
  • If the average time between failures is less than a day, scheduled or on-demand restarts might still be a better solution than waiting for a crash.  If possible coordinate with your users and agree to an automated restart schedule.  Alternately, give a user supervisor or power user if you have one a script to restart the service on demand when they notice it getting laggy.
  • Leaks in an application can often be isolated to a single line of code.  So how fast an application leaks can be very dependent on usage patterns.  Maybe you go weeks without a problem, but then a user goes through a period where they are heavily using a leaking function, and you go from good to bad in just a few minutes or hours.  This is one reason why they can be tough to isolate and identify.

If you diagnose a memory leak (or if you don't) and want some help with next steps, post back.
mbkitmgr, did you ever resolve things? 3 of us offered substantial diagnostic options for you. Did anything work out?
We solved the issue another way.  Ditching the product altogether.  

I spoke with their developers and came to the conclusion they had little interest in diagnosing or resolving the issue.  From that I spoke with my client and they decided the time was ready for a change.

Thanks to all of you for your input.  Sorry for the delayed response, I am a Sole Trader support approx 300 PC's and 30+ servers in different flavours across 20 odd clients.  I had to wait till I was back on the case to update this