Solved

How to know which process was using system resources when it was found hung ?

Posted on 2014-01-23
10
440 Views
Last Modified: 2014-02-04
Hi Experts,

Many times i am experiencing an issue where i found many systems in hung state and at that time i was unable to execute any command to know what exactly is keeping the resources busy..

Can someone pls help me in understanding it more as how can we get to know about what all processes were held responsible for making the system busy ?

Any help will be highly appreciated.

Thanks,
SA
0
Comment
Question by:Sandy
  • 5
  • 5
10 Comments
 
LVL 8

Expert Comment

by:Surrano
ID: 39802467
If this is a reoccurring problem then best would be to save some sort of information every N minutes. I'd include:
- top
- iotop
- iftop
- lsof
So write a cron job that saves this e.g. every 10 minutes and keeps the result of last 30 runs. After a hang, you'll have 5 hours to log in and find out who and why.

Another possibility would be quota on CPU usage and stuff but that'd need at least some hints on what was going on and even then my expertise is shorter than that...
0
 
LVL 13

Author Comment

by:Sandy
ID: 39802479
Thanks Surrano.. This can be done if issue has re-occurrence but i am more concern on what had happened in past with the system :(

TY/SA
0
 
LVL 8

Expert Comment

by:Surrano
ID: 39802631
Well then dig your system for files that have been modified during the hang... or anytime later.
- Log files e.g. syslog, wtmpx, etc.
- Core files or other crash dump-like stuff
The pure identity of these files, or the contents within, may give you some hints.

If collection of system activity records (sar) was turned on then you can check things like amount of device I/O, semaphores, etc. E.g. if you find that there was a peak on one of the logical volumes that contains a database index file for table X you should check the use cases that access that particular table / index. If that device contains only Apache logs then you should check your webserver. Etc.
0
Back Up Your Microsoft Windows Server®

Back up all your Microsoft Windows Server – on-premises, in remote locations, in private and hybrid clouds. Your entire Windows Server will be backed up in one easy step with patented, block-level disk imaging. We achieve RTOs (recovery time objectives) as low as 15 seconds.

 
LVL 13

Author Comment

by:Sandy
ID: 39802641
definitely i can search for application specific logs for that but just to be sure as which exactly was the process caused this because many times we as SysAdmin are not allowed to look into app logs and sometimes those are written in such language which is not understandable to SysAdmin.

TY/SA
0
 
LVL 8

Expert Comment

by:Surrano
ID: 39802785
I never thought about apps, more like OS / OEM level. If you find something there it would give you the cutting edge to explain why you need to look into app logs-- or ask app support people to have a look based on strong suspicion that it caused the hang and let them prove that it did not.
0
 
LVL 13

Author Comment

by:Sandy
ID: 39831632
Any further possible way ??
0
 
LVL 8

Expert Comment

by:Surrano
ID: 39831770
Can you inspect the quality of water in a flow in a retrospective way? Currently it has no cyanide, but did it have cyanide one hour / day / year ago? Not unless cyanide left some telltale sign. It's up to you to find those signs, and, especially, up to you to develop proactive measures like generating those signs automatically if the problem occurs.
0
 
LVL 13

Author Comment

by:Sandy
ID: 39831800
Instead of as you mentioned "you" i believe in "us"...

I am trying to figure out the preventive way from those signs...,  but seeking (your) experts advices if we can make it better. Hope not bothering you :)

TY/SA
0
 
LVL 8

Accepted Solution

by:
Surrano earned 500 total points
ID: 39831856
I didn't mean personally, I meant the generic subject "you" in English "man" in German... So *we* are looking for a needle in a haystack. This is similar to my job I do for a living as an employee but usually information is more than simply "something crashed". Even then, I'm so familiar with *our* systems that I know all the OS/OEM/APP layers inside-out.

As a next step I'd ask for vast symptoms like:
- /var/log/messages and syslog
- /var/adm/sa/sa* files
- Webserver logs
- Database logs
- Application logs
- lsof output
- iostat output (with various flags)
And if it still doesn't help, then login data and other descriptive information about the time and duration of the hang. If you can provide these within the limitations of EE (which I doubt) then I could help in the analysis. Otherwise, all you can do is to capture all these symptoms e.g. every hour until next crash. Always keep last 24 hours at least and then you'll have a possibility to compare.

As a rule of thumb, performance is relative. A hang can be assessed only if you can compare the performance data collected during/immediately before the hang to a baseline collected during normal operation.
0
 
LVL 13

Author Closing Comment

by:Sandy
ID: 39831897
Thanks for your help Surrano...  Vielen Dank
0

Featured Post

Free Tool: Postgres Monitoring System

A PHP and Perl based system to collect and display usage statistics from PostgreSQL databases.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

Question has a verified solution.

If you are experiencing a similar issue, please ask a related question

Using 'screen' for session sharing, The Simple Edition Step 1: user starts session with command: screen Step 2: other user (logged in with same user account) connects with command: screen -x Done. Both users are connected to the same CLI sessio…
Google Drive is extremely cheap offsite storage, and it's even possible to get extra storage for free for two years.  You can use the free account 15GB, and if you have an Android device..when you install Google Drive for the first time it will give…
Learn how to navigate the file tree with the shell. Use pwd to print the current working directory: Use ls to list a directory's contents: Use cd to change to a new directory: Use wildcards instead of typing out long directory names: Use ../ to move…
This demo shows you how to set up the containerized NetScaler CPX with NetScaler Management and Analytics System in a non-routable Mesos/Marathon environment for use with Micro-Services applications.

756 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question