Link to home
Start Free TrialLog in
Avatar of bstrauss3
bstrauss3

asked on

Memory allocation for a large list - WITHOUT swap - how to figure out what I can allocate?

I'm looking for a way to determine how much memory is available to a process before it begins to use swap.  I want to allocate only as many table/linked list entries as I can without utilizing swap space.

I realize this is an odd request.

The reason behind it is that the program has two modes of behavior.

   In normal mode, obviously the VM subsystem does it's thing and the 10% of entries which are being referenced frequently are the ones that remain in the working set.

  But then, say, every 5 minutes, the process needs to walk the entire structure (it's a top-n and purge idle process in fact).  So if entries are swapped out, this causes a page-swap storm as each page is brought in for the walk, only to cause another page to be swapped out.  By the end of the walk, the 10% I really need are probably gone and so they have to be paged back in.

Then in 10-20 seconds things settle down ... until the next walk.

The 10% isn't always the same 10% - so I can't set up a multi-level structure.  It's a locality of reference thing - if I reference entry 1002 now, I'll probably reference it again soon.  If I haven't referenced entry 1001 for a while, I probably won't.  That's why the VM does a pretty ok job until the walk starts... But there is enough variablity that you can't figure out if something belongs in L1 or L2 or L3.  And there is NOTHING identifiable about an individual entry - the cause of active/inactive is external to my process.

Other complicating facts:

(0) Each 'entry' is actually composed of multiple malloc() items, so there isn't a fixed size.  All are relatively small and so come out of sbrk.  I can adjust thresholds to make some come out of mmap but why bother?  Once I allocate enough, some malloc() implementations will mmap another large region for smaller suballocations anyway.  I can not control the malloc() on the system - it's probably the glibc default, but it doesn't have to be.


(1) The process is soft real time.

This means the walk needs to finish in a few seconds so that other processes can run on time.


(2) There may be other processes running and the system is probably configured with swap space.  

What this means is that the basic memory size information the system provides is NOT equivalent to how much my process can allocate.


(3) Other processes could suddenly change their memory usage, forcing my table/linked list entries to be paged out.

Yes, I know that this whole question assumes that the rest of the system is pretty stable.  That's a valid assumption - we recommend that only our process runs on the host, for precisely this reason.  But even if other processes are running, they tend to settle down into their own vm size and resident/working set.  So after the system is up, if I could allocate 100MB of memory w/o starting to page, then I probably can always allocate 100MB of memory.

I'm willing to live with this - I can retune this size every hour or so.  A little swapping won't kill me, the problem is that swapping EVERYTHING in makes the walk take so long that other near-real-time processes don't run.


(4) It would be nice if this worked across Unixes, but I can live with a Linux only solution.


(5) I'm well aware of vmstat, free, etc. and their associated /proc files.  And mallinfo() and getrlimit() etc.

But these don't tell you how much I can increase my RSS (working set) before starting to swap, because of Linux's grab memory buffering strategy.  And re-read #2, above...

Here for example is the output from free (/proc/meminfo), with the process having a small number of entries and pretty much idle:

$ free
                     total        used         free     shared    buffers     cached
Mem:        839756     216580     623176          0      32020      79520
-/+ buffers/cache:     105040     734716
Swap:      1012084          0    1012084

$ cat /proc/meminfo
MemTotal:       839756 kB
MemFree:        619400 kB
Buffers:         32432 kB
Cached:          81708 kB
SwapCached:          0 kB
Active:         116016 kB
Inactive:        67136 kB
HighTotal:           0 kB
HighFree:            0 kB
LowTotal:       839756 kB
LowFree:        619400 kB
SwapTotal:     1012084 kB
SwapFree:      1012084 kB
Dirty:               8 kB
Writeback:           0 kB
Mapped:          76600 kB
Slab:            30968 kB
CommitLimit:   1431960 kB
Committed_AS:   304720 kB
PageTables:       1152 kB
VmallocTotal:   180216 kB
VmallocUsed:      3600 kB
VmallocChunk:   174580 kB
HugePages_Total:     0
HugePages_Free:      0
Hugepagesize:     2048 kB

$ vmstat
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 1  0      0 619272  32616  82304    0    0    13    15  505    16 27  0 72  1


So it looks like I could allocate 600 something MB before swapping.  Can I really? No!

(6) Don't bother suggesting a caching strategy.  Because of the realities of the data it won't work (I can not tell which is an active entry - all I could do is migration - and the virtual memory manager does a far, far better job and does it automatically).

(7) Right now, what I have is to check every hour or so and figure out if the system-wide amount of swap space has grown.  If so, I set the limit to about 95% of what I am presently using.  That actually works OK, but I would prefer a less empirical solution...

-----Burton

ASKER CERTIFIED SOLUTION
Avatar of NovaDenizen
NovaDenizen

Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
Avatar of bstrauss3
bstrauss3

ASKER

NovaDenizen:

mlock() in an interesting idea.  In order for the call to succeed, I have to have allocated the memory via mmap.   And it has to happen before I shed root privs.  Both of those I can finesse around.  So I would have to start by mmap() and mlock() until I no longer get MNOMEM.  Of course, this would pretty much kill any subsequent process that tried to run on the box.  And I wouldn't be able to alter my usage.  I'll have to think about that for a while...

Interesting idea, but not really an answer to the ? "how much can I allocate".   Let's see what other answers pop up.

-----Burton
Talmash:

No.  All top does is to echo the values from /proc.   Please read my (5) in the question.

-----Burton
I'm not 100% familiar with mlock(), but I don't think mmap is a prerequisite for calling it.  At any rate, you would have to implement your own buffer pool and cache within the mlock()ed area.  

You could also do something like this.  It mlock()s memory gradually up to a limit.  Call it every time you malloc a buffer.  You have to hold on to root privileges, though:
#define MLOCK_LIMIT_KB 512
#define PAGE_ADDR(ptr) ((ptr) & 0xfffff000)
void *extend_mlock(void *ptr) {
    static unsigned long long last_mlock = 0;  // is this right?  not sure what the minimum should be
    unsigned long long ptr_l = (unsigned long long) ptr;
    if (reslong > last_mlock && reslong < (MLOCK_LIMIT_KB << 10)) {
         mlock(last_mlock, PAGE_ADDR(ptr_l + 4095) - last_mlock);
         last_mlock = PAGE_ADDR(ptr_l + 4095);
    }
}

Another alternative is mlockall().  Call mlockall(MCL_FUTURE) and your entire process will stay in ram.  You can drop root privileges right after the call.  You have to be responsible and make sure your process doesn't use up all the ram, though.

silly question: how can you enshure that your process (or parts of it) are not swapped out?
If you cannot enshure that ('cause you're not root) why do you care about your data/tables then?

For user space processes you may use mmap, but keep in mind that this may cause disk IO.
Probably shared memory can solve your problem for the data part of your process (sorry no experiance with this)
Silly answer: Read my original question ... it's a performance issue.

Better answer:  If I walk the structure and it's in memory it completes a whale of a lot faster than if it's so big much of it gets swapped out.

So, I'm trying to offer a trade-off - store less data but have better performance.  Then the user makes the decision - if you need all the data, well, you don't enable the ceiling option.  If performance is more important - set the ceiling.


Remember how an LRU VM works - infrequently used pages get aged out to slower storage.  With my reference pattern it's almost guaranteed that as I bring in rarely used entries for the walk, the frequently used ones I really need will be swapped out, because at that instant in time they are the least recently referenced ones.  Say my total size is 1G and my working set 100M - I'll end up causing 1.1GB of swapping...

After the walk ends, the frequently used entries are brought back in for the normal process.  I can't cache because *I* can't identify the frequently used ones - it's external.  The only characteristic of frequently used entries are that they are frequently used.   So while you would normally setup a manual caching process, why bother - the Virtual Memory Manager already does this.

It works great for the 59s of each minute which are the "normal" process, but that 1s of walking all entries is a killer...

-----Burton
>  Silly answer: Read my original question ... it's a performance issue.
> .. allocate ..  as I can without utilizing swap space.

hmm, it might be a perfomance issue for you as programmer/user
But the OS (in particular the implementation of the kernel's task switcher and memory manager) desides which process gets swapped out, the programmer cannot prevend this (exceptions see comments above).

Said this, my question again: how can you enshure that your process (or parts of it) are not swapped out?
If you solved this, you can simply use any algorithm storing your most important data on the stack.
Or do I miss something?
"how can you enshure [sic] that your process (or parts of it) are not swapped out?"  That's basically what I'm asking, but as "how much malloc() can I do", vs. tje historical question, "What is my resident set size"?

More broadly, yes you are missing something - which I said in my original question - there is no way to tell what's important.  

You can tell what WAS important, but that's only past information.  If I use that information, I'm as likely to make a bad choice as a good one.  And what was good at t-30 may not be good now at t  (it was probably a good choice at t-31 and t-29, but ... )

Think of it as Brownian motion (drunk's walk) - eventually you get from point a to point b, but a priori I can't say where you are going to be.  At any point in time you are likely to be near where you were in the recent past, but not always (one good lurch moves you a long way as it were).

That's exactly the locality of reference that a Least Recently Used (LRU) algorithm depends upon, so in general the OS' VMM does a good job of keeping important data in memory and swapping out the less important stuff.

Then, every minute or five, I have a process which checks EVERY entry.  This causes them all to be swapped in  AND distorts the LRU data.  After the process completes, we go back to the 'steady state' and the VMM is again doing a good job.

But, if I offer the opportunity to only store enough data so that I don't swap, this periodic process runs much faster - fast enough so that it completes in the available window and I don't lose data.  It's both quantity  (limit-x < x) of data to process AND the per item time, since I don't have the 1000s of cycles to swap in a page.

If I can figure out how much data I can store within the available working set, I can offer the user the ability to deploy his/her knowledge of what's important to keep the number of entries within limit-x.

While typically other processes distort any such measurement of resident set, available space, etc, in this narrow case, I can insist that our process be the only one on the box (except for the OS stuff itself).  So it's valid to say that if I can allocate xMB w/o swap, that's what I want to use.

-----Burton

> .. store enough data so that I don't swap ..
If "I" here means *you* (as programmer or the program itself), then you make the wropng descission/assumtion:
Again: the OS/kermel desides when and how to swap in/out, not the process!

> I can insist that our process be the only one on the box
ok, if no kernel around this process, then that's ok. But ...

> .. (except for the OS stuff itself).
that's the culprit.
Even a system doing nothing except "idleing around" and switching to your process task ometinmes may decide to swap when the task is idle for xxx seconds/minutes/hours/days/etc.. That's what my "silly question" is about: how can you enshure that this will not happen?
I believe that your algorithm is correct, but you need to enshure that the process (or parts of it) cannot be swapped, and that's a root task (if even possible).
AH: I know all these things ... 20+ years of coding.  On OSes you may never of heard of unless you're another grey beard.  Back before Jolt cola was invented and when real coders lived within 64K w/ overlays... And coded MFT mods for the SHARE tapes (which I've done).  Trust me, I really do understand how a VM works.

My point - assumption - is that a unix box running no applications reaches a steady state.  Sure there's lots of stuff that perturbs it (cron, logrotate, et al), but over say 5m or 10m, it is still a pretty steady state.  Definition of perturbation is that the system returns to that state pretty reliabily.  And, sure, if something changes that steady state, I'm in trouble.

But assuming that steady state, if I can increase my resident set from 48MB to say 128MB w/o starting the system swapping, that's much memory is pretty much going to be available to me at all times.

It's not a simple question - that's why I came here after a couple of weeks of research.  I know that...

BUT THAT IS THE QUESTION I ASKED...

Please try to answer my question... "I'm looking for a way to determine how much memory is available to a process before it [the system] begins to use swap."

TIA!

-----Burton
SOLUTION
Link to home
membership
This solution is only available to members.
To access this solution, you must be a member of Experts Exchange.
Start Free Trial
rusage() returns the ulimit setting, e.g.

$ ulimit -m
unlimited

This is (1) advisory (2) valued only if ulimit set it and (3) not related to the actual RSS the kernel will permit.

All this does is limit what you can request be locked via madvise().

Same kinds of things for getrlimit().

As I've said, I've read vmstat et al - all they do it sscanf() from the /proc files.  

Problem with the /proc files is that they report what has happened.  Not capabilities.  With /proc, the best you can do is see to compare values over time.

at t-30 I was using 30MB and there was 0 swap in use.  At t I'm using 45MB and there is 10MB of swap in use.  So I can probably use about 35MB w/o paging.  That's what I had coded before asking the question...


BUZZ... not quite, but thanks for playing.

-----Burton
Well this seems to have petered out...  I'm disappointed, but not surprised, as a lot of research didn't find a simple answer myself.

Anyway, gang, thanks for playing.

Both NovaDenizen and ahoffmann had some interesting thoughts.  Upon reflection mlock() probably would work - but I would have to allocate before I shed root and manage the memory using my own malloc() clone.

So I'll be splitting points 60/40

Thanks!

-----Burton