Load at 5, no CPU I/O or swap in use

Hi,

We are currently running CentOS 5 update 4 on a Dell R910 server 16 cores/32 hyperthreaded with 64GB of memory. It is our main Oracle 11g DB server for one of our customers and is attached to an MD 3000 storage array. We are having a load averaging around 5 but see no swap in use, CPUs are pretty much idle and no I/O wait. We have Oracle dataguard turned on in transactional mode. I've checked everything that I can think of, there are no Oracle processes running which would cause a spike. Anyone have any ideas as to what to check next?

I have another R910 configured the same way and do not see any issues with the 3 databases running on that server. The load is at .5.

Thanks
mw-hostingAsked:
Who is Participating?

Improve company productivity with a Business Account.Sign Up

x
 
mw-hostingConnect With a Mentor Author Commented:
It was the haldaemon that was taking up the load.

Anyone come across this before?
0
 
arober11Commented:
Stop as many of the daemons as you can on the server e.g. Oracle and Apache then check the load, if still high try a re-boot, Else bring the services back one at a time and monitor the impact on the load. If a culprit is identified let us know.
0
 
Michael WorshamInfrastructure / Solutions ArchitectCommented:
Do you have SELinux and auditd enabled?
0
A proven path to a career in data science

At Springboard, we know how to get you a job in data science. With Springboard’s Data Science Career Track, you’ll master data science  with a curriculum built by industry experts. You’ll work on real projects, and get 1-on-1 mentorship from a data scientist.

 
mw-hostingAuthor Commented:
SELinux not runnig

auditd is runnig.
0
 
Michael WorshamInfrastructure / Solutions ArchitectCommented:
Can you post a chkconfig list just to see if you have any serviers you don't need enabled/running?

"chkconfig --list"

0
 
mw-hostingAuthor Commented:
All running services from chkconfig --list are the same between both R910's....I guess we are going to have to shut down the Oracle database and see if the issue is with that. That's all we have running on it besides a few out of box CentOS services.



0
 
Michael WorshamInfrastructure / Solutions ArchitectCommented:
Have you tried a basic 'top' and see what process(es) are peaking?
0
 
mw-hostingAuthor Commented:
Sure did. There is very little activity going on the server. I do see on occasion the oracle processes, scsi_eh_3 and hald-addon-stor.

 CPU is at most 3% (oracle)

These processes appear also on our other R910 so at least the last two processes appear to be normal.

0
 
Michael WorshamInfrastructure / Solutions ArchitectCommented:
I can tell you that the services mcstrans should be put into a 'stopped'/'off' state. Our Oracle DBA discovered that this sometimes causes CPU spikes when running.

Also, some unnecessary services that can be shutdown & deinstalled: pcsc-lite (PCSC Crypto Card detection), smartmontools (SMART drive monitoring), bluez-utils (bluetooth). If you aren't using SELinux, make sure that setroubleshootd is also in a disabled state (or even better -- deinstalled).
0
 
mw-hostingAuthor Commented:
SE Linux and the firewall are also disabled. The odd thing is we don't see any CPU spikes at all, just the load is high.



0
 
Michael WorshamInfrastructure / Solutions ArchitectCommented:
It could be that your pagecache and slabcache is peaked.

Try this (as root):

echo 3 > /proc/sys/vm/drop_caches
0
 
mw-hostingAuthor Commented:
Tried that, still high load.

Maybe it is the hardware, we have dell's openmanage installed and no alerts there.
0
 
Michael WorshamInfrastructure / Solutions ArchitectCommented:
Can you post a screenshot of a 'top' output?
0
 
mw-hostingAuthor Commented:
Screen shot of top attached
screenshot.png
0
 
Hugh FraserConsultantCommented:
The top command shows an average of the 16 cores (which actually appears as 32 processors because they're hyperthreaded). If you hit the "1" key, top will display the states for each of the processors. You might find that 5 or 6 of the processors are actually busy, while the rest are idle.

The load average is a sampled measurement of the run queue. On a single core machine, a load average of 1.0 means there was one runnable process when the sample was taken (about every 5 seconds). A value of 2.0 means there were 2 runnable processes, which of course means the cpu's overloaded.  But on a dual-core system, and value of 2 typically means each of the cores has a single runnable process.

All this simply means that a load average of 5 on a 16-core machine means it's very lightly loaded (less than a third of its capacity). The rule-of-thumb is to start worrying at 70% of capacity, which translates to .7*16 or a load average of 11.2.

So use top with the separate processor stats to see what the system really looks like. Also, keep in mind that the load average is a sampled value, and may not translate to how the system performs. The general wisdom is that the absolute number isn't as important as the change in value, which is a flag that something's happening.
0
 
mw-hostingAuthor Commented:
I would accept that but the other R910 server (same configuration) is sitting a .5 load and has 3 database servers compared to this one. I had run the 1 to show each CPU and it is still only at 3% on a single core if that.



0
 
ckhsu1977Commented:
You mention you were going to shutdown oracle to see if the load drops. Did that happen?
0
 
Hugh FraserConsultantCommented:
Download a copy of iotop (or use iostat) to see if there's a lot of I/O, particularly swapping or faulting. Your system sure isn't faulting pages out, because it has lots of free memory, but lets check to be sure.
0
 
mw-hostingAuthor Commented:
I had determined this past weekend that it was one of the hal daemon processes that was causing the issue. Is there anything I can look at to determine what may have caused the issue with that process?
0
 
Michael WorshamInfrastructure / Solutions ArchitectCommented:
When the process is running, you can use 'strace' to see what the process is attempting to call/utilize upon the server environment.

0
Question has a verified solution.

Are you are experiencing a similar issue? Get a personalized answer when you ask a related question.

Have a better answer? Share it in a comment.

All Courses

From novice to tech pro — start learning today.