Memory allocation for a large list - WITHOUT swap - how to figure out what I can allocate?
Posted on 2005-05-16
I'm looking for a way to determine how much memory is available to a process before it begins to use swap. I want to allocate only as many table/linked list entries as I can without utilizing swap space.
I realize this is an odd request.
The reason behind it is that the program has two modes of behavior.
In normal mode, obviously the VM subsystem does it's thing and the 10% of entries which are being referenced frequently are the ones that remain in the working set.
But then, say, every 5 minutes, the process needs to walk the entire structure (it's a top-n and purge idle process in fact). So if entries are swapped out, this causes a page-swap storm as each page is brought in for the walk, only to cause another page to be swapped out. By the end of the walk, the 10% I really need are probably gone and so they have to be paged back in.
Then in 10-20 seconds things settle down ... until the next walk.
The 10% isn't always the same 10% - so I can't set up a multi-level structure. It's a locality of reference thing - if I reference entry 1002 now, I'll probably reference it again soon. If I haven't referenced entry 1001 for a while, I probably won't. That's why the VM does a pretty ok job until the walk starts... But there is enough variablity that you can't figure out if something belongs in L1 or L2 or L3. And there is NOTHING identifiable about an individual entry - the cause of active/inactive is external to my process.
Other complicating facts:
(0) Each 'entry' is actually composed of multiple malloc() items, so there isn't a fixed size. All are relatively small and so come out of sbrk. I can adjust thresholds to make some come out of mmap but why bother? Once I allocate enough, some malloc() implementations will mmap another large region for smaller suballocations anyway. I can not control the malloc() on the system - it's probably the glibc default, but it doesn't have to be.
(1) The process is soft real time.
This means the walk needs to finish in a few seconds so that other processes can run on time.
(2) There may be other processes running and the system is probably configured with swap space.
What this means is that the basic memory size information the system provides is NOT equivalent to how much my process can allocate.
(3) Other processes could suddenly change their memory usage, forcing my table/linked list entries to be paged out.
Yes, I know that this whole question assumes that the rest of the system is pretty stable. That's a valid assumption - we recommend that only our process runs on the host, for precisely this reason. But even if other processes are running, they tend to settle down into their own vm size and resident/working set. So after the system is up, if I could allocate 100MB of memory w/o starting to page, then I probably can always allocate 100MB of memory.
I'm willing to live with this - I can retune this size every hour or so. A little swapping won't kill me, the problem is that swapping EVERYTHING in makes the walk take so long that other near-real-time processes don't run.
(4) It would be nice if this worked across Unixes, but I can live with a Linux only solution.
(5) I'm well aware of vmstat, free, etc. and their associated /proc files. And mallinfo() and getrlimit() etc.
But these don't tell you how much I can increase my RSS (working set) before starting to swap, because of Linux's grab memory buffering strategy. And re-read #2, above...
Here for example is the output from free (/proc/meminfo), with the process having a small number of entries and pretty much idle:
total used free shared buffers cached
Mem: 839756 216580 623176 0 32020 79520
-/+ buffers/cache: 105040 734716
Swap: 1012084 0 1012084
$ cat /proc/meminfo
MemTotal: 839756 kB
MemFree: 619400 kB
Buffers: 32432 kB
Cached: 81708 kB
SwapCached: 0 kB
Active: 116016 kB
Inactive: 67136 kB
HighTotal: 0 kB
HighFree: 0 kB
LowTotal: 839756 kB
LowFree: 619400 kB
SwapTotal: 1012084 kB
SwapFree: 1012084 kB
Dirty: 8 kB
Writeback: 0 kB
Mapped: 76600 kB
Slab: 30968 kB
CommitLimit: 1431960 kB
Committed_AS: 304720 kB
PageTables: 1152 kB
VmallocTotal: 180216 kB
VmallocUsed: 3600 kB
VmallocChunk: 174580 kB
Hugepagesize: 2048 kB
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
1 0 0 619272 32616 82304 0 0 13 15 505 16 27 0 72 1
So it looks like I could allocate 600 something MB before swapping. Can I really? No!
(6) Don't bother suggesting a caching strategy. Because of the realities of the data it won't work (I can not tell which is an active entry - all I could do is migration - and the virtual memory manager does a far, far better job and does it automatically).
(7) Right now, what I have is to check every hour or so and figure out if the system-wide amount of swap space has grown. If so, I set the limit to about 95% of what I am presently using. That actually works OK, but I would prefer a less empirical solution...