I've been struggling with this problem for many months now. One of our workhorse servers is seeing some very regular load average "spikes", where the box basically freezes for a couple seconds and when it unfreezes the load average has jumped to about twice as high.
For example, our linux box generally runs at a load average of 2-3. Every 5 minutes (it seems fairly regular), the box just hangs (am not able to send any command to it, nothing gets refreshed), and then in about 5 seconds the box unhangs and then the load average is at 5-6.
My current theory is that it is some sort of disk i/o bottleneck, hogging the HD which keeps processes from running, which leads to high load average (since that metric is based on how many processes are waiting to run). Using sar, I have correlated the hangs to 100% disk i/o util.
We have a lot of things that run on the system, mostly proprietary software that we built. I've tried to find the specific culprit for this, but have not been able to.
Some things that would help diagnose the problem would be a util that better told me what process is causing the high load average ('top' isn't cutting it), or something that told me which processes were doing the most disk writing.
I'm ssh'ing to the box, so I'm not directly on it, if that makes a difference.
If anyone has anything that could help me, I'd consider giving you my first born. Or tons of points. Whatever you'd prefer.