Link to home
Start Free TrialLog in
Avatar of markdolar
markdolarFlag for United States of America

asked on

Trading Memory for CPU in a Linux application (without modifying the application, that is)

Hello Experts,

I have here a linux based application that was recently moved onto a faster server with lots of memory and fast disks.  The application, written in C, basically performs the following steps:

Read data from a file
allocate memory for the data
add the data to a linked list
Repeat

When I run the application on the older server and monitor the CPU performance, I see it spending about 20% of its time in IO Wait and 15% in kernel mode.  When I move the application to the new fast server, it spends about 1% in IO Wait mode and 40% in kernel mode.  While this is sort of a good thing, if I run about 16 of these I can max out the CPUs my new dual quad core server.   I need to do other things on this server and this gets in the way.  When I analyze where the application is spending its time, on the old server it is mostly read and on the new server it is mostly malloc.  On the new server there is lots of free memory and on the old server memory is more limited.  In top, I can observe the memory allocation going up as the linked list is built.  The list can be very long and it can take several minutes.  

The obvious path is to rewrite the application so that the memory is allocated once at the beginning of data retrieval.  The list is not open-ended, so the amount of memory required can be determined in advance (or at least close enough).   However, changing the application will take some time and I would like to see if there is a way to tune the Linux OS to optimize performance of this application in its current state.

I can tweak system configuration paramenters, user account parameters, hardware, process priorities and stuff like that.    I could probably even patch the kernel if that would help.  The one thing I can't easily do is modify the application.  

I realize that by changing the priorities of the processes, I can "buy back" some of the throttling effect being provided by the slower I/O subsystem on the older machine.   That has the net effect of extending the time it takes to build the linked list, so doesn't count as a valid answer to this question.  What I am looking for is an approach that reduces the total time the application is spending in kernel mode.

Thank you in advance.

SOLUTION
Avatar of mrjoltcola
mrjoltcola
Flag of United States of America image

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of markdolar

ASKER

Thanks for the response.   The system load as it exists currently is heavily tilted towards system CPU with malloc as call where the process is spending its time.  The code is not multi-threaded; it's actually a multi-user timesharing system.   I'm taking a closer look at Hoard to see if it will address this issue.  Thanks again.
A lot of I/O wait and kernel mode may be a sign of paging in/out because of short memory. In this case one-time allocation may not solve the problem.
Thanks for the comment.  I don't appear to have a lot of I/O wait based on the CPU information.  I am in the process of gathering I/O statistics and memory statistics, but the initial runs indicate no issues here.   Unfortunately, it takes a little time to gather the statistics because the system is controlling a real-time process and the resources that provide input into the system in real-time must be scheduled.

But, this brings up the question of how to best manage memory in Linux.  In general, memory managers can potentially have one of three mutually exclusive scenarios to manage against.  These are:

1.  Everything that is on the system right now should run as fast as possible.  There will be little or no additional load placed on the system at any random point in time.  A real-time process control system works in this fashion.   Once the system is running. demand remains constant.

2. Load will constantly change at random points in time.  All of a sudden, a bunch of new processes will show up or disappear.  Resources should be kept in reserve for this possibility.  An online ticketing system or event management system might be this type of system.

3. Load will change, but in a non-random or somewhat predictable fashion.  This could be, for example, a database application where updates and reports are run simultaneously.  Data entry people might be only entering transactions, but queries against the entire database are run by known processes at random points within a fixed time window (like a workday).

My system is more like number 3 above.  Although it's doing process control, the values on the input sensors can trigger a query of the data.   So, I have this fairly constant stream of input data and associated calculation, mixed with random events where the database is read completely into memory so the application can build a linked list (remember, I can't change the application).

So, let's say I just add more memory to the system (it's already at 8GB and the whole database is only 150MB).  On what basis am I predicting that this memory is going to be used more efficiently if the application has to slog through all these malloc statements?  How do I know, beyond any noticed changes in the kernel mode CPU, that it is being used any more efficiently?   Will I just end up with more free memory in my top display?

Thanks again for the comment.  Looking forward to more information if possible.
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Finally got another test run -

When I look at the memory utilization on the system during peak load, only about 1-2GB appears to be actively used.  Almost 6GB is in cache and no swapping is occurring.  No swapping is being observed and the swap file is not growing.  I think the "millions of mallocs" comment is at least partially correct, though there are several dozen C programs in this system and each one would have to be analyzed to see if malloc was the culprit in all cases.  

The linked list is accessed in a linear fashion, so at least that part is ok.

Keeping in mind that modifying the application is not an option, at this point, the options look like:

1. See if Hoard will improve things by reducing the demand created by the malloc processing.
2. Add processors.  An 8-core or even 16-core system will likely handle the millions of mallocs better.
3. Find something for this 6GB of cache memory to do, besides eliminate I/O wait and move the bottleneck to CPU.  What would that be?



Well, 40% of time in kernel space is a lot.
It can come from, for example, system calls: what does "strace" tell you?
strace tells me the process is spending about 10% malloc and 9% in munmap

I managed to get the system to thrash by increasing the load.  CPU pegged at 100% for several minutes and user mode slowly but steadily dropped.  I am beginning to suspect it has something to do with task management.  When I stopped nearly all application processing, the kernel load dropped to near zero (as expected) but the task list remained high - around 250 tasks, almost all sleeping.   As I start to ramp up the user demand, the kernel mode starts to go up again.  

I'm wondering if there are a bunch of hung tasks out there that are taking some critical resource.   As I ramp up the user load, the system has a progressively more difficult time allocating this resource and starts to thrash.  Could that be a possibility?  If so, what should I do?  I assume I try to find and kill these tasks...
I have some output from a system trace on a calc routine on the system with the performance problem while the system is under load:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total          
 time   seconds   seconds    calls  ms/call  ms/call  name    
 67.43     14.97    14.97                             read
 13.24     17.91     2.94                             write
  7.52     19.58     1.67                             lseek
  2.57     20.15     0.57                             __fcntl_nocancel
  1.26     20.43     0.28                             time
  0.68     20.58     0.15                             _int_malloc
  0.41     20.67     0.09                             times
  0.36     20.75     0.08                             _dl_close

On a slower system with a much lighter load, a similar transaction generates the following result:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total          
 time   seconds   seconds    calls  ms/call  ms/call  name    
 62.21      1.35     1.35                             read
 15.67      1.69     0.34                             lseek
  2.76      1.75     0.06                             gfield
  1.38      1.78     0.03                             memset
  1.38      1.81     0.03                             strcmp
  1.38      1.84     0.03                             table_is_current
  0.92      1.86     0.02                             dbfget
  0.92      1.88     0.02                             dbfval
  0.92      1.90     0.02                             fldesc
  0.92      1.92     0.02                             recphys
  0.92      1.94     0.02                             strbcmp
  0.92      1.96     0.02                             write

It looks like this discrepancy might point us at the problem.  If I understand self seconds correctly, I am using a lot more processing time on the big fast system then the slow small system to do a set of read operations.  How do I find out why these reads are taking so many cycles?


I don't see currently any use of gprof. But the heavy usage of munmap tells that there is a real problem with a number of memory allocations/deallocations. GNU malloc uses mmap() for blocks of large size, thus deallocating them using munmap() call.
From this I can conclude that you can optimize memory allocation/deallocation pattern alot.
I agree that the memory allocation pattern can be optimized a lot.  However, I don't think that's what is  bringing this system to its knees.  2 reasons why I think this:

1.  The system ran better on the older, lower capacity hardware under the same load.
2.  The number of processes running like the profile listed above (with reads at 14.97 seconds) is much higher than the query processes.  

The indicators point towards a bottleneck when accessing a shared resource of some kind.   This is supposition of course, but since modifying the application is outside the scope of changes I can make, I need to keep looking.

The application was written at a time when the machine it ran on had very little memory and slow disk.  For these reasons, the approach towards memory allocation in the app makes sense.  On a system with fast I/O and lots of memory, it makes sense to change it, but since the application is made up of a variety of programs, changing one may not solve the problem.
I apologize - I realized as I read through this post that the definition of the problem is changing in nature and moving away from the original note.   So, to return and respond directly to JIEXA:

How can I optimize the memory allocation/deallocation pattern?

 
ASKER CERTIFIED SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
All good advice - thanks.

Had to award some points for the "millions of mallocs" line since I'm using it.  Hoard is a good suggestion, though it's designed for multithreaded processes, which this is not.  But it or a different malloc implementation is a good suggestion.