Trading Memory for CPU in a Linux application (without modifying the application, that is)

markdolar
markdolar used Ask the Experts™
on
Hello Experts,

I have here a linux based application that was recently moved onto a faster server with lots of memory and fast disks.  The application, written in C, basically performs the following steps:

Read data from a file
allocate memory for the data
add the data to a linked list
Repeat

When I run the application on the older server and monitor the CPU performance, I see it spending about 20% of its time in IO Wait and 15% in kernel mode.  When I move the application to the new fast server, it spends about 1% in IO Wait mode and 40% in kernel mode.  While this is sort of a good thing, if I run about 16 of these I can max out the CPUs my new dual quad core server.   I need to do other things on this server and this gets in the way.  When I analyze where the application is spending its time, on the old server it is mostly read and on the new server it is mostly malloc.  On the new server there is lots of free memory and on the old server memory is more limited.  In top, I can observe the memory allocation going up as the linked list is built.  The list can be very long and it can take several minutes.  

The obvious path is to rewrite the application so that the memory is allocated once at the beginning of data retrieval.  The list is not open-ended, so the amount of memory required can be determined in advance (or at least close enough).   However, changing the application will take some time and I would like to see if there is a way to tune the Linux OS to optimize performance of this application in its current state.

I can tweak system configuration paramenters, user account parameters, hardware, process priorities and stuff like that.    I could probably even patch the kernel if that would help.  The one thing I can't easily do is modify the application.  

I realize that by changing the priorities of the processes, I can "buy back" some of the throttling effect being provided by the slower I/O subsystem on the older machine.   That has the net effect of extending the time it takes to build the linked list, so doesn't count as a valid answer to this question.  What I am looking for is an approach that reduces the total time the application is spending in kernel mode.

Thank you in advance.

Comment
Watch Question

Do more with

Expert Office
EXPERT OFFICE® is a registered trademark of EXPERTS EXCHANGE®
Top Expert 2009
Commented:
Why not take the simple route and limit the number of concurrent threads / processes of this program? You are running 16 processes so thats 4 per CPU on avg. Sounds like a bit much if you are concerned with overall system performance.

You could definitely rewrite with a large allocation, and/or even try the Hoard allocator as a drop-in replacement. No changes are needed, just link it in place of the regular malloc.

http://www.cs.umass.edu/~emery/hoard/

Author

Commented:
Thanks for the response.   The system load as it exists currently is heavily tilted towards system CPU with malloc as call where the process is spending its time.  The code is not multi-threaded; it's actually a multi-user timesharing system.   I'm taking a closer look at Hoard to see if it will address this issue.  Thanks again.

Commented:
A lot of I/O wait and kernel mode may be a sign of paging in/out because of short memory. In this case one-time allocation may not solve the problem.
11/26 Forrester Webinar: Savings for Enterprise

How can your organization benefit from savings just by replacing your legacy backup solutions with Acronis' #CyberProtection? Join Forrester's Joe Branca and Ryan Davis from Acronis live as they explain how you can too.

Author

Commented:
Thanks for the comment.  I don't appear to have a lot of I/O wait based on the CPU information.  I am in the process of gathering I/O statistics and memory statistics, but the initial runs indicate no issues here.   Unfortunately, it takes a little time to gather the statistics because the system is controlling a real-time process and the resources that provide input into the system in real-time must be scheduled.

But, this brings up the question of how to best manage memory in Linux.  In general, memory managers can potentially have one of three mutually exclusive scenarios to manage against.  These are:

1.  Everything that is on the system right now should run as fast as possible.  There will be little or no additional load placed on the system at any random point in time.  A real-time process control system works in this fashion.   Once the system is running. demand remains constant.

2. Load will constantly change at random points in time.  All of a sudden, a bunch of new processes will show up or disappear.  Resources should be kept in reserve for this possibility.  An online ticketing system or event management system might be this type of system.

3. Load will change, but in a non-random or somewhat predictable fashion.  This could be, for example, a database application where updates and reports are run simultaneously.  Data entry people might be only entering transactions, but queries against the entire database are run by known processes at random points within a fixed time window (like a workday).

My system is more like number 3 above.  Although it's doing process control, the values on the input sensors can trigger a query of the data.   So, I have this fairly constant stream of input data and associated calculation, mixed with random events where the database is read completely into memory so the application can build a linked list (remember, I can't change the application).

So, let's say I just add more memory to the system (it's already at 8GB and the whole database is only 150MB).  On what basis am I predicting that this memory is going to be used more efficiently if the application has to slog through all these malloc statements?  How do I know, beyond any noticed changes in the kernel mode CPU, that it is being used any more efficiently?   Will I just end up with more free memory in my top display?

Thanks again for the comment.  Looking forward to more information if possible.
Top Expert 2009
Commented:
The critical key is to use no more than your real RAM for all involved, or as soon as you go over the  8GB limit for kernel + programs  you start swapping. Not quite that simple, since the kernel will page inactive things out so you'll effectively get more than 8GB but the point is, use real RAM, not virtual RAM.

Do you have any reason to suspect that you are actually using more than 8GB total?

Use vmstat and top to verify that you are not over the physical RAM threshold and into swapping.

I'm back to the original thinking that you could have millions of tiny little mallocs where you might be better off reworking that piece to use large contiguous chunks of RAM. Also, a linked list begs the quesiton: How are you accessing the items in that list? Hopefully only linearly. Otherwise the linked list itself should be replaced with a tree of hash structure for many random accesses.

Author

Commented:
Finally got another test run -

When I look at the memory utilization on the system during peak load, only about 1-2GB appears to be actively used.  Almost 6GB is in cache and no swapping is occurring.  No swapping is being observed and the swap file is not growing.  I think the "millions of mallocs" comment is at least partially correct, though there are several dozen C programs in this system and each one would have to be analyzed to see if malloc was the culprit in all cases.  

The linked list is accessed in a linear fashion, so at least that part is ok.

Keeping in mind that modifying the application is not an option, at this point, the options look like:

1. See if Hoard will improve things by reducing the demand created by the malloc processing.
2. Add processors.  An 8-core or even 16-core system will likely handle the millions of mallocs better.
3. Find something for this 6GB of cache memory to do, besides eliminate I/O wait and move the bottleneck to CPU.  What would that be?



Commented:
Well, 40% of time in kernel space is a lot.
It can come from, for example, system calls: what does "strace" tell you?

Author

Commented:
strace tells me the process is spending about 10% malloc and 9% in munmap

I managed to get the system to thrash by increasing the load.  CPU pegged at 100% for several minutes and user mode slowly but steadily dropped.  I am beginning to suspect it has something to do with task management.  When I stopped nearly all application processing, the kernel load dropped to near zero (as expected) but the task list remained high - around 250 tasks, almost all sleeping.   As I start to ramp up the user demand, the kernel mode starts to go up again.  

I'm wondering if there are a bunch of hung tasks out there that are taking some critical resource.   As I ramp up the user load, the system has a progressively more difficult time allocating this resource and starts to thrash.  Could that be a possibility?  If so, what should I do?  I assume I try to find and kill these tasks...

Author

Commented:
I have some output from a system trace on a calc routine on the system with the performance problem while the system is under load:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total          
 time   seconds   seconds    calls  ms/call  ms/call  name    
 67.43     14.97    14.97                             read
 13.24     17.91     2.94                             write
  7.52     19.58     1.67                             lseek
  2.57     20.15     0.57                             __fcntl_nocancel
  1.26     20.43     0.28                             time
  0.68     20.58     0.15                             _int_malloc
  0.41     20.67     0.09                             times
  0.36     20.75     0.08                             _dl_close

On a slower system with a much lighter load, a similar transaction generates the following result:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total          
 time   seconds   seconds    calls  ms/call  ms/call  name    
 62.21      1.35     1.35                             read
 15.67      1.69     0.34                             lseek
  2.76      1.75     0.06                             gfield
  1.38      1.78     0.03                             memset
  1.38      1.81     0.03                             strcmp
  1.38      1.84     0.03                             table_is_current
  0.92      1.86     0.02                             dbfget
  0.92      1.88     0.02                             dbfval
  0.92      1.90     0.02                             fldesc
  0.92      1.92     0.02                             recphys
  0.92      1.94     0.02                             strbcmp
  0.92      1.96     0.02                             write

It looks like this discrepancy might point us at the problem.  If I understand self seconds correctly, I am using a lot more processing time on the big fast system then the slow small system to do a set of read operations.  How do I find out why these reads are taking so many cycles?


Commented:
I don't see currently any use of gprof. But the heavy usage of munmap tells that there is a real problem with a number of memory allocations/deallocations. GNU malloc uses mmap() for blocks of large size, thus deallocating them using munmap() call.
From this I can conclude that you can optimize memory allocation/deallocation pattern alot.

Author

Commented:
I agree that the memory allocation pattern can be optimized a lot.  However, I don't think that's what is  bringing this system to its knees.  2 reasons why I think this:

1.  The system ran better on the older, lower capacity hardware under the same load.
2.  The number of processes running like the profile listed above (with reads at 14.97 seconds) is much higher than the query processes.  

The indicators point towards a bottleneck when accessing a shared resource of some kind.   This is supposition of course, but since modifying the application is outside the scope of changes I can make, I need to keep looking.

The application was written at a time when the machine it ran on had very little memory and slow disk.  For these reasons, the approach towards memory allocation in the app makes sense.  On a system with fast I/O and lots of memory, it makes sense to change it, but since the application is made up of a variety of programs, changing one may not solve the problem.

Author

Commented:
I apologize - I realized as I read through this post that the definition of the problem is changing in nature and moving away from the original note.   So, to return and respond directly to JIEXA:

How can I optimize the memory allocation/deallocation pattern?

 
Commented:
I don't know whether you can rewrite the application's part we're talking about (the one that uses a lot of munmap() calls).
The older and slower machine probably had different malloc implementation. Also, if the allocations themselves became bigger (say, on old machine you've allocated blocks of 512kb, and now you allocate blocks of 2Mb), even the same GNU malloc implementation can use mmap instead of sbrk (and of course it can be slower even on the new machine).

So, if you have access to the source code of your application and can change its allocations to be smaller (below 512k) and/or less frequent (for example, by writing your own layer of functions for memory management, like global "operator new/[]" for C++, it's a probable option.

Another one is to take some different malloc implementation (like latest one from Doug Lea, htmalloc, Hoard etc), create its shared library and override malloc etc for your application like:
   export LD_PRELOAD=/my/malloc/impl.so ; /the/allocation/application "with some params" -given "to it"

Author

Commented:
All good advice - thanks.

Had to award some points for the "millions of mallocs" line since I'm using it.  Hoard is a good suggestion, though it's designed for multithreaded processes, which this is not.  But it or a different malloc implementation is a good suggestion.

Do more with

Expert Office
Submit tech questions to Ask the Experts™ at any time to receive solutions, advice, and new ideas from leading industry professionals.

Start 7-Day Free Trial