HP Ux Superdome 100% CPU utilization and memory Utilization

Hi

I am running a HP UX 11.x Superdome 4 CPU 16 GB Memory in a Cluster Mode (2 Superdomes in Cluster - One running the App Server and 1 running the DB server). 4 GB Ram is for Oracle. I am running a banking app and the Database is Oracle 9i. I also use Apache and Resin.

The number of logged in users are about 150 (peak). The Memory and CPU utilizations hit the roof and hovers between 85% to 100%. How to go about diagnosing and how to sort out this. For the type of load this looks absurd.

Further the resin part is also a bit confusing with resin also appearing to be slow in startup and is there a standard config guidelines available for Configuring and installing Resin on HP Ux 11.x

Regards

Surya
suryapadmaAsked:
Who is Participating?
I wear a lot of hats...

"The solutions and answers provided on Experts Exchange have been extremely helpful to me over the last few years. I wear a lot of hats - Developer, Database Administrator, Help Desk, etc., so I know a lot of things but not a lot about one thing. Experts Exchange gives me answers from people who do know a lot about one thing, in a easy to use platform." -Todd S.

tfewsterCommented:
To start from basics:
For CPU utilisation, run `top` to see the heaviest CPU users; Trace the "CPU hog" processes back to their parents, and decide if the CPU usage is "reasonable";

Also look at the load averages `top` shows you; If possible, post the `top` output.

To check process memory usage, do:
UNIX95= ps -eo uid,pid,ppid,pcpu,state,sz,vsz,time,comm |sort -rnk7 |more

Of course, the App and DB systems will have very different profiles and will need tuning differently.

Start `sar` logging running so you can build up a picture of resource usage over a few days.
0
yuzhCommented:

 Follow tfewster to use top or ps to find out what processes are eating up all your
system resource.

It is posible to see a box use up all it CPU and RAM, it depands on what it is doing.
eg.
some of my boxes are running Neural Network simulations at the time, and in most
of the case 99.99% CPU time + 100% RAM are used.

find out the cause of the problem first and then decide what to do, hareward upgrade, or
split work load to more boxes, or software bug, need patches.

0
gheistCommented:
try $ vmstat 1 10
to identify if bottleneck is i/o wait or cpu brainpower.
depending on result there are many ways to improve before buying expensive hardware

then try sync ; sync ; vmstat 1 10
if this improves wait time - your system is simply swapping, teach your apps to use less memory .

etc etc
0
Cloud Class® Course: MCSA MCSE Windows Server 2012

This course teaches how to install and configure Windows Server 2012 R2.  It is the first step on your path to becoming a Microsoft Certified Solutions Expert (MCSE).

tfewsterCommented:
Oops, just noticed this response from Surya that was posted as feedback instead of a comment:
(is anyone familiar with configuring/tuning Resin?)

      Author: suryapadma
      Date: 11/17/2004 02:48PM GMT
      View Source Question
      Post a Reply to this feedback
      I have got the stats collected using Glance and also OV. I have the stats as a file also. I need some help to drill down on to the problem further and get it sorted out. I have Resin 2.0.5 also running and this is also hogging memory and may be CPU also. Resin takes a lot of time to come up and in a Cluster Switch Over fails. Do you have any inputs on Resin and its Installation and Tuning on HP UX. I can send you the log files, if you can give me an Id where I can send them.

      Regards

      Surya
0
sirjbCommented:
Hi  suryapadma,

From reading your posts I see 3 questions
1. HIGH CPU utilization in general
2. HIGH MEMORY utilization in general
3. Application Resin is using high CPU utilization.
4. Cluser Switch over fails with resin

Performance issue is not easy and will not be be solved simply using one or two tools.
First rule. If no one is complaining about the performance, don't try to do / change too much.
(you complained about the resin which i assume is running on the second superdome but you didn't complain about the performance of the oracle database server)

Second:
High CPU utilization is not neccessary a performance issue. To say it simply, if a process get all the resouces it demands ( allocate enough memory, IO from disk etc ) then it will use up to 100% CPU to try to run ( calculate ) the process and try to finish it as soon as possible. If a process starts up and runs 100% cpu for a few seconds or minutes then disappear, then it sounds ok to me. But if 1 single process keeps running 100% CPU for a long time, then it can be cause by software failure or other reason.
High Memory utilization can be ok too. Check if you have enough ( or too much swap ), check SWAP utilization, page out frequency, IO on the swap devices etc.
As long as you have enough swap, NO PAGE OUT!!!, and specially when no users are complaining, you don't have to worry ! :)

Third: collect all relevate informatie !
Some idea's
use glance to check global CPU, Memory, but also disk IO's swap utilization.  Check the size of global shared mem,  numbers of nfiles, Page put, system calls, global WAIT stats / threads. Check single process CPU/memory/IO, activities per CPU (maybe one cpu is broken?! ), check activity in file system (LVM), PV, NFS? network utilization. All can be done and view within Glance!
using glance you can determinate if the CPU usage is used by system mode (kernel) or usermode (outside the kernel).

If this performance issue is temperaray and only happens within a certain time frame. Use perfview ( measureware) to collect data's  and put it ina graph to find out a  trance!

Check syslog for any unusal warnings
Check EMS for any hardware warning
use netstat to check any hanging open connections
check the kernel setup (kmtune). check the value of maxuers, maxfiles, maxdsiz,maxssiz maxtsiz, max_thread_proc, nfiles,nproc, nkthread etc etc etc........
use lsof (list open files) to track down what a process ( the one claiming 100% cpu / mem constantly ) is actually doing.
It's not a standard HP software but you can found one from http://hpux.connect.org.uk/hppd/hpux/Sysadmin/lsof-4.73/man.html
Check the latest quality package of HP. I will advice to install the latest half year of HW and quality packages. ( don't install the newest one unless you have no chooise for example to solve some typical problem, a quality packages of at least 6 months old will provide enough stability and it's almost 100% safe to install it with no headache)

So far, these are checks are only usefull if the performance issue is related to HP-UX OS.
if you have issue with oracle, grab a DBA and ask hem to do trouble shooting together with you, then try the following
check the set up of oracle
does the HP kernel confirm the requirement of a typical version of oracle? check metalink for it.
check the global logs for oracle
check the listenaar logs
check others like data framentations in a datafiles, check buffer cache hit rates, check numbers of concurrent users etc etc

do the same for other applicaion like resin. Read the install doc for resin, for example
http://www.sia12.net/hp_docs/getting.started  (good one)
http://www.caucho.com/download/resin-install.pdf
http://cocoon.apache.org/2.1/installing/#Installing+on+Resin+2.x
To begin with check the base configuration.

Anyway too much to mentions.

About you specific problems.
1 and 2. first collect all the needed information as describle above to determinate if it's a global issue or single process / applicaiton. Hardward or only software relative. Is it a concurrent problem or only happens between 9-17 hours??? gather more information and post it here, we will help you.

3. resin is slow / using high CPU utilization.
give us the result of checking configuration, show us more the configurations of you HP-OS like kernel settings etc.
Do a lsof on the busy process. Maybe it's relative to apache. Start the application but with no user loads and see if the problems still exits. ( try to start it without MC serviceGuard)

4. Cluster switch failed. You have to show us the entire configuration of MC/ServiceGuard.
MCSG is very powerfull but need to be configure carefull. 1 single small mistake and it will fail to start.
Collect the error log from serviceguard with the configuration of the node / packages and post it here please!


Just go and try out a bit, post your questions in details in and see if we can help more.
If you like to learn more about HPUX / oracle setups, i have a few cookbook ready for you.
Post me and give me your mail address and i can mail them to you. but they are but , like a couple of Mg each :)) FAT cookbook.

Hope this can guide you to the right direction !!

Good luck

JB
0

Experts Exchange Solution brought to you by

Your issues matter to us.

Facing a tech roadblock? Get the help and guidance you need from experienced professionals who care. Ask your question anytime, anywhere, with no hassle.

Start your 7-day free trial
tfewsterCommented:
Points to sirjb for good generic info on investigating performance problems; Without detailed feedback it would be impossible to diagnose the real problem, but sirjb's suggestions will go long way to finding the problem.

Incidentally, I'd be interested in seeing JBs "cookbooks"; I have a few links to HP-UX performance troubleshooting docs (Some of which require an account on HPs ITRC and, naturally, there is a great deal of overlap...)
HP-UX Performance Cookbook: http://www.interex.org/pubcontent/enterprise/nov01/grumann.jsp
Basics of Poor System Performance: http://www4.itrc.hp.com/service/cki/docDisplay.do?docLocale=en_US&docId=200000063203050
Determining the cause of system performance problems: http://www4.itrc.hp.com/service/cki/docDisplay.do?docLocale=en_US&docId=200000063210674
System Performance Tuning Guide: http://www4.itrc.hp.com/service/cki/docDisplay.do?docLocale=en_US&docId=200000073346765
0
sirjbCommented:
Hi,

The link to the HP_UX Performance Cookbook posted by tfewster is good BUT that document is outdated.
The original and the latest version is:

http://h21007.www2.hp.com/dspp/files/unprotected/devresource/Docs/TechPapers/UXPerfCookBook.pdf  by Stephen Ciullo and Doug Grumann 27th May 2003
This is the latest one !

To tfewster:
I have some e-books for specific subjects like MC/SG, LVM in combination of SAN storage for like XP's, VA's etc etc
What are you looking for? or in what area?
PS: thanks for the comment

Regards,

JB
0
gheistCommented:
most likely problem is with JVMs, and one has to renice and tune it if ir interferes with normal processes ( thus accounting etc)
0
tfewsterCommented:
JB - Thanks for the link to the updated Performance Tuning pdf - I searched the ITRC for "cookbook" and found an MC/Serviceguard one as well, but if you have anything specific on tuning HP-UX for Oracle I'd be interested. My email address is in my profile (but posting a link would be better, so everyone can see it ;-)

0
It's more than this solution.Get answers and train to solve all your tech problems - anytime, anywhere.Try it for free Edge Out The Competitionfor your dream job with proven skills and certifications.Get started today Stand Outas the employee with proven skills.Start learning today for free Move Your Career Forwardwith certification training in the latest technologies.Start your trial today
Unix OS

From novice to tech pro — start learning today.

Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.