How to know which process was using system resources when it was found hung ?

Hi Experts,

Many times i am experiencing an issue where i found many systems in hung state and at that time i was unable to execute any command to know what exactly is keeping the resources busy..

Can someone pls help me in understanding it more as how can we get to know about what all processes were held responsible for making the system busy ?

Any help will be highly appreciated.

Thanks,
SA
LVL 13
SandyAsked:
Who is Participating?
 
SurranoConnect With a Mentor System EngineerCommented:
I didn't mean personally, I meant the generic subject "you" in English "man" in German... So *we* are looking for a needle in a haystack. This is similar to my job I do for a living as an employee but usually information is more than simply "something crashed". Even then, I'm so familiar with *our* systems that I know all the OS/OEM/APP layers inside-out.

As a next step I'd ask for vast symptoms like:
- /var/log/messages and syslog
- /var/adm/sa/sa* files
- Webserver logs
- Database logs
- Application logs
- lsof output
- iostat output (with various flags)
And if it still doesn't help, then login data and other descriptive information about the time and duration of the hang. If you can provide these within the limitations of EE (which I doubt) then I could help in the analysis. Otherwise, all you can do is to capture all these symptoms e.g. every hour until next crash. Always keep last 24 hours at least and then you'll have a possibility to compare.

As a rule of thumb, performance is relative. A hang can be assessed only if you can compare the performance data collected during/immediately before the hang to a baseline collected during normal operation.
0
 
SurranoSystem EngineerCommented:
If this is a reoccurring problem then best would be to save some sort of information every N minutes. I'd include:
- top
- iotop
- iftop
- lsof
So write a cron job that saves this e.g. every 10 minutes and keeps the result of last 30 runs. After a hang, you'll have 5 hours to log in and find out who and why.

Another possibility would be quota on CPU usage and stuff but that'd need at least some hints on what was going on and even then my expertise is shorter than that...
0
 
SandyAuthor Commented:
Thanks Surrano.. This can be done if issue has re-occurrence but i am more concern on what had happened in past with the system :(

TY/SA
0
Free Tool: Site Down Detector

Helpful to verify reports of your own downtime, or to double check a downed website you are trying to access.

One of a set of tools we are providing to everyone as a way of saying thank you for being a part of the community.

 
SurranoSystem EngineerCommented:
Well then dig your system for files that have been modified during the hang... or anytime later.
- Log files e.g. syslog, wtmpx, etc.
- Core files or other crash dump-like stuff
The pure identity of these files, or the contents within, may give you some hints.

If collection of system activity records (sar) was turned on then you can check things like amount of device I/O, semaphores, etc. E.g. if you find that there was a peak on one of the logical volumes that contains a database index file for table X you should check the use cases that access that particular table / index. If that device contains only Apache logs then you should check your webserver. Etc.
0
 
SandyAuthor Commented:
definitely i can search for application specific logs for that but just to be sure as which exactly was the process caused this because many times we as SysAdmin are not allowed to look into app logs and sometimes those are written in such language which is not understandable to SysAdmin.

TY/SA
0
 
SurranoSystem EngineerCommented:
I never thought about apps, more like OS / OEM level. If you find something there it would give you the cutting edge to explain why you need to look into app logs-- or ask app support people to have a look based on strong suspicion that it caused the hang and let them prove that it did not.
0
 
SandyAuthor Commented:
Any further possible way ??
0
 
SurranoSystem EngineerCommented:
Can you inspect the quality of water in a flow in a retrospective way? Currently it has no cyanide, but did it have cyanide one hour / day / year ago? Not unless cyanide left some telltale sign. It's up to you to find those signs, and, especially, up to you to develop proactive measures like generating those signs automatically if the problem occurs.
0
 
SandyAuthor Commented:
Instead of as you mentioned "you" i believe in "us"...

I am trying to figure out the preventive way from those signs...,  but seeking (your) experts advices if we can make it better. Hope not bothering you :)

TY/SA
0
 
SandyAuthor Commented:
Thanks for your help Surrano...  Vielen Dank
0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.