Solved

Load at 5, no CPU I/O or swap in use

Posted on 2010-08-19
20
402 Views
Last Modified: 2013-12-06
Hi,

We are currently running CentOS 5 update 4 on a Dell R910 server 16 cores/32 hyperthreaded with 64GB of memory. It is our main Oracle 11g DB server for one of our customers and is attached to an MD 3000 storage array. We are having a load averaging around 5 but see no swap in use, CPUs are pretty much idle and no I/O wait. We have Oracle dataguard turned on in transactional mode. I've checked everything that I can think of, there are no Oracle processes running which would cause a spike. Anyone have any ideas as to what to check next?

I have another R910 configured the same way and do not see any issues with the 3 databases running on that server. The load is at .5.

Thanks
0
Comment
Question by:mw-hosting
  • 9
  • 7
  • 2
  • +2
20 Comments
 
LVL 26

Expert Comment

by:arober11
Comment Utility
Stop as many of the daemons as you can on the server e.g. Oracle and Apache then check the load, if still high try a re-boot, Else bring the services back one at a time and monitor the impact on the load. If a culprit is identified let us know.
0
 
LVL 29

Expert Comment

by:Michael W
Comment Utility
Do you have SELinux and auditd enabled?
0
 

Author Comment

by:mw-hosting
Comment Utility
SELinux not runnig

auditd is runnig.
0
 
LVL 29

Expert Comment

by:Michael W
Comment Utility
Can you post a chkconfig list just to see if you have any serviers you don't need enabled/running?

"chkconfig --list"

0
 

Author Comment

by:mw-hosting
Comment Utility
All running services from chkconfig --list are the same between both R910's....I guess we are going to have to shut down the Oracle database and see if the issue is with that. That's all we have running on it besides a few out of box CentOS services.



0
 
LVL 29

Expert Comment

by:Michael W
Comment Utility
Have you tried a basic 'top' and see what process(es) are peaking?
0
 

Author Comment

by:mw-hosting
Comment Utility
Sure did. There is very little activity going on the server. I do see on occasion the oracle processes, scsi_eh_3 and hald-addon-stor.

 CPU is at most 3% (oracle)

These processes appear also on our other R910 so at least the last two processes appear to be normal.

0
 
LVL 29

Expert Comment

by:Michael W
Comment Utility
I can tell you that the services mcstrans should be put into a 'stopped'/'off' state. Our Oracle DBA discovered that this sometimes causes CPU spikes when running.

Also, some unnecessary services that can be shutdown & deinstalled: pcsc-lite (PCSC Crypto Card detection), smartmontools (SMART drive monitoring), bluez-utils (bluetooth). If you aren't using SELinux, make sure that setroubleshootd is also in a disabled state (or even better -- deinstalled).
0
 

Author Comment

by:mw-hosting
Comment Utility
SE Linux and the firewall are also disabled. The odd thing is we don't see any CPU spikes at all, just the load is high.



0
 
LVL 29

Expert Comment

by:Michael W
Comment Utility
It could be that your pagecache and slabcache is peaked.

Try this (as root):

echo 3 > /proc/sys/vm/drop_caches
0
Free Trending Threat Insights Every Day

Enhance your security with threat intelligence from the web. Get trending threat insights on hackers, exploits, and suspicious IP addresses delivered to your inbox with our free Cyber Daily.

 

Author Comment

by:mw-hosting
Comment Utility
Tried that, still high load.

Maybe it is the hardware, we have dell's openmanage installed and no alerts there.
0
 
LVL 29

Expert Comment

by:Michael W
Comment Utility
Can you post a screenshot of a 'top' output?
0
 

Author Comment

by:mw-hosting
Comment Utility
Screen shot of top attached
screenshot.png
0
 
LVL 12

Expert Comment

by:hfraser
Comment Utility
The top command shows an average of the 16 cores (which actually appears as 32 processors because they're hyperthreaded). If you hit the "1" key, top will display the states for each of the processors. You might find that 5 or 6 of the processors are actually busy, while the rest are idle.

The load average is a sampled measurement of the run queue. On a single core machine, a load average of 1.0 means there was one runnable process when the sample was taken (about every 5 seconds). A value of 2.0 means there were 2 runnable processes, which of course means the cpu's overloaded.  But on a dual-core system, and value of 2 typically means each of the cores has a single runnable process.

All this simply means that a load average of 5 on a 16-core machine means it's very lightly loaded (less than a third of its capacity). The rule-of-thumb is to start worrying at 70% of capacity, which translates to .7*16 or a load average of 11.2.

So use top with the separate processor stats to see what the system really looks like. Also, keep in mind that the load average is a sampled value, and may not translate to how the system performs. The general wisdom is that the absolute number isn't as important as the change in value, which is a flag that something's happening.
0
 

Author Comment

by:mw-hosting
Comment Utility
I would accept that but the other R910 server (same configuration) is sitting a .5 load and has 3 database servers compared to this one. I had run the 1 to show each CPU and it is still only at 3% on a single core if that.



0
 
LVL 3

Expert Comment

by:ckhsu1977
Comment Utility
You mention you were going to shutdown oracle to see if the load drops. Did that happen?
0
 
LVL 12

Expert Comment

by:hfraser
Comment Utility
Download a copy of iotop (or use iostat) to see if there's a lot of I/O, particularly swapping or faulting. Your system sure isn't faulting pages out, because it has lots of free memory, but lets check to be sure.
0
 

Author Comment

by:mw-hosting
Comment Utility
I had determined this past weekend that it was one of the hal daemon processes that was causing the issue. Is there anything I can look at to determine what may have caused the issue with that process?
0
 
LVL 29

Expert Comment

by:Michael W
Comment Utility
When the process is running, you can use 'strace' to see what the process is attempting to call/utilize upon the server environment.

0
 

Accepted Solution

by:
mw-hosting earned 0 total points
Comment Utility
It was the haldaemon that was taking up the load.

Anyone come across this before?
0

Featured Post

Backup Your Microsoft Windows Server®

Backup all your Microsoft Windows Server – on-premises, in remote locations, in private and hybrid clouds. Your entire Windows Server will be backed up in one easy step with patented, block-level disk imaging. We achieve RTOs (recovery time objectives) as low as 15 seconds.

Join & Write a Comment

The purpose of this article is to show how we can create Linux Mint virtual machine using Oracle Virtual Box. To install Linux Mint we have to download the ISO file from its website i.e. http://www.linuxmint.com. Once you open the link you will see …
It’s 2016. Password authentication should be dead — or at least close to dying. But, unfortunately, it has not traversed Quagga stage yet. Using password authentication is like laundering hotel guest linens with a washboard — it’s Passé.
Via a live example, show how to take different types of Oracle backups using RMAN.
This videos aims to give the viewer a basic demonstration of how a user can query current session information by using the SYS_CONTEXT function

744 members asked questions and received personalized solutions in the past 7 days.

Join the community of 500,000 technology professionals and ask your questions.

Join & Ask a Question

Need Help in Real-Time?

Connect with top rated Experts

13 Experts available now in Live!

Get 1:1 Help Now