Visualizing Heap Stress Under Linux

June 2009

Contents

Introduction

Inside the Linux Heap Allocator

Instrumenting the Allocator

Overhead

Interpreting the Results

Past and Future

Attachments


Introduction

Every manufacturer of an embedded device wants to know how much memory should be included in the device when it ships. Memory isn't free, and embedded devices often have tight cost margins. Too much memory is a waste, but too little will result in poor performance. Since the end user of an embedded device typically can't upgrade its memory, the decision has to be made once and for all at the time of manufacture.

Heap allocation and deallocation under Linux is an extremely dynamic process... even during periods of apparent idleness the heap allocator is still busy responding to allocation requests. Allocation is also a highly “crafted” process, developed and refined in layers over many years and implementing a small host of interlocking memory management strategies. The resulting system is quite complex and difficult to summarize in an easily-understood form. This article introduces a simple, low-overhead method of visualizing at least one of the aspects of this system, an aspect we refer to as heap stress.

Inside the Linux Heap Allocator

It turns out that one simple and quite intuitive way to judge the level of pressure that the heap allocator is feeling is to instrument some of the stages it goes through in its efforts to satisfy allocation requests. As of this writing Linux kernel version 2.6.30 has been released, and the heap allocator lives in the file mm/page_alloc.c. Unlike many Linux kernel source files, this one is fairly well endowed with useful comments. About a third of the way through is a function named __alloc_pages_internal(). The header comment associated with it says simply: “This is the 'heart' of the zoned buddy allocator.”

Here it is that all run-time heap allocation requests are ultimately satisfied. It's not a very large or complicated function, but a detailed analysis of it is not the purpose of this article. Instead, step back and look at the larger structure of the function. It may not be immediately obvious, but a fair description of it might read something like this:

"For each desired memory page, __alloc_pages_internal() calls get_page_from_freelist(). If this succeeds the allocated page is returned. If it fails, the allocator may loosen its allocation criteria or initiate strategies to free up currently allocated pages, and then try again. It keeps doing this until it succeeds or ultimately fails."

The heap isn't just a homogenous loaf of memory from which pages are carved like slices of bread – true to its “crafted” nature, the heap has a good deal of internal structure, and the allocator consults a number of criteria which control what parts and how much of the heap may be parceled out under given circumstances. The very first call on get_page_from_freelist() occurs almost at the top of the function, and it looks like this:

page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order, zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET);

This is the default set of arguments passed to get_page_from_freelist(). For the purposes of this article, the first and last arguments are of most interest. The first, gfp_mask, is a set of “action modifiers” – control bits which implement some of the allocator's memory management strategies. A list of these modifier bits may be found in linux/include/linux/gfp.h. This value is partly passed in, but it is also partly overridden by __alloc_pages_internal(), even in this first call.

The last argument is alloc_flags. This too is a set of flag bits, defined locally, which explicitly specify acceptable watermark levels and allocation “effort”. All of these of control bits, plus other information such as the memory “zone” from which allocation must be performed, determine which memory pages may be allocated and which may not.

By default, the criteria are fairly tight, yet even so, the requested memory page can often be obtained. If so, the function uses a goto to jump directly to got_pg, where the sucessfully allocated page is returned to the requester. But if the page cannot be allocated, the allocator does two things:

 It wakes up the kswap daemon, which begins scavenging for unused pages which can be freed back to the heap.

 It again calls get_page_from_freelist(), this time passing a looser set of allocation criteria, as determined by adjustments to the control bits.

If this allocation request fails, the allocator loosens its criteria still further and tries a third time. If that attempt fails, it continues on with a fourth and a fifth until, in the final extremity, it has to give up and return failure. A quick search for calls on get_page_from_freelist() reveals five different references, each time with successively looser allocation criteria, and each time followed by a jump to got_pg if the allocation attempt succeeds.

Instrumenting the Allocator

We can take advantage of the heap allocator's architecture to install instrumentation tracking the depth to which the allocator must descend before an individual allocation request is satisfied. To do this, we declare a set of global counters:

unsigned long long heap_alloc_count[6];

The first five of these counters are incremented immediately before each jump to got_pg, recording the levels at which allocation succeeded. Just for completeness, we will place the final counter immediately following the nopage label, which is where the flow of control will bring us after the ultimate allocation failure.

We now have an ongoing record of the heap allocator's activities, expressed as a count of allocation successes at each level. We will finish the job by adding a new virtual file to the proc filesystem, /proc/allocinfo. Reading this file will return a string containing six numeric values on a single line:

4373 516 3 0 0 0

These represent the values of the six heap_alloc_count counters. Reading /proc/allocinfo not only returns these counter values, it also zeros them, making it possible for a tester to acquire allocation information over a particular time-span of interest: simply read the virtual file immedately before performing the desired test function, perform it, and then read the virtual file immediately afterward.

Overhead

It would be nice to present a completely passive measurement system, one whose use presented no cost, but unfortunately the information we desire is not recorded in any other form. Our recording method is inexpensive, demanding no more than the incrementing of a single in-memory 64-bit integer counter per allocation request. It isn't zero-cost, but using it, even on 32-bit hardware, shouldn't greatly distort the behaviour of any but extremely CPU-bound systems.

Interpreting the Results

It is probably impossible to characterize any system as complex as the Linux heap allocator in only six integers, and the method described in this article does not claim to do so. What we do claim is that, with a little practice, it's surprising how effectively these “level of allocation success” counts manage to convey a kind of high-level gestalt. Have a look at the numbers you get during operations typical for the platform you are studying. Get a feel for what normal operation looks like, then try stressing the system and see how the numbers change. One interesting data point is to look at the allocations seen during system boot.

Below are a set of descriptions, one for each of the six values in an allocinfo report. They are intended to help a user interpret such reports, but it is important to remember that these descriptions are themselves interpretations, drastic and incomplete summaries of complicated states:

1. Count of allocation requests which were satisfied using default criteria. This is the ideal allocation transaction.

2. Count of allocation requests which were satisfied after one failure. Because of that failure the kswap daemon was awakened to begin background scavenging of unused or low-use pages for return to the heap, and allocation criteria were lowered according the needs of the caller. This is still normal operation, but a high ratio of level 2 to level 1 allocations may suggest that the allocator is eating into its reserves too fast for the kswap daemon to keep up.

3. Count of allocation requests which were satisfied only after allocation criteria were set to their lowest levels. Allocations satisfied at this level suggest a heap which is under considerable stress, drawing deeply from its reserves to meet the caller's needs.

4. Count of allocation reqeusts which were satisfied only by synchronous scavenging, overriding the kswap daemon's normal background scavenging operations. Allocation requests which couldn't wait have already failed, and even the more patient requesters were delayed. The heap allocator is under extreme stress, on the verge of failure.

5. Count of allocation requests which were satisfied only because at the last moment some other process, also starved of memory, finally died and its memory resources were released back to the heap. This is a last-chance, unlikely-to-succeed desperation measure.

6. Count of allocation failures. The heap allocator has been unable, by any means at its disposal, to supply the required memory. This doesn't necessarily indicate a system crash, as the kernel and some key drivers maintain private memory “pools” against just this possibility, but a Linux system which is seeing heap allocation failures is in a very bad way. This indicates the ultimate level of pressure on the heap allocator.

Past and Future

Obviously the instrumentation described in this article is wholly dependent on the structure of the heap allocator. As the allocator's design changes over time, it may not be possible to instrument hypothetical future allocators in exactly the same ways as we have instrumented the current version. The number of “success” points may change, leading to more or less than six values – in the most extreme case, the allocator may be completely redesigned in some way which does not lend itself to this kind of evaluation.

Much more likely, and harder to evaluate, the interpretation of the various allocation criteria may change from version to version, leading to subtle but possibly quite significant differences in allocator behaviour – and hence, to the interpretation of our instrumentation of that behaviour. For all of these reasons, we do not recommend this measurement method for comparison across different versions of the heap allocator without careful study of those differences, and a clear understanding of their consequences.

Nevertheless, and admitting all of these perfectly valid qualifiers, this method of recording and representing heap stress still seems useful, inexpensive and easily installed on a variety of different kernels. EQware has certainly used this method on other 2.6 kernels, and indeed it would have been possible to do something very much like this even in 2.4 kernels. It seems probable that the heap allocator will maintain its basic structure for some time to come, and that this “heap stress” instrumentation will continue to have value.

Attachments

Attached below are three files containing sections of code which together implement the instrumentation described above.

 page_alloc.c.frag  (HTML)

From the 2.6.30 version of linux/mm/page_alloc.c, this file contains only the __alloc_pages_internal() function.

 allocinfo.c  (HTML)

A new file, intended to go into linux/fs/proc/allocinfo.c, providing the /proc/allocinfo interface.

 Makefile  (HTML)

The Makefile from linux/fs/proc, to cause our new allocinfo.c file to be compiled and linked into the kernel.



Contact Us | info@eqware.net


Copyright ©2009 by EQWARE Engineering Inc.