Nicolas van Kempen, Emery D. Berger
In this work, we revisit the Reconsidering Custom Memory Allocation paper1. That widely-cited paper (selected as a Most Influential Paper2) conducted a study that demonstrated that in many cases, custom memory allocators often do not provide significant performance benefits over a good general-purpose memory allocator.
That paper compared the original programs (with custom allocators) with versions that used general-purpose allocators
(either directly or via a "shim" that redirected individual object allocations to malloc
/free
). However, their
approach started with a clean-slate heap (no prior allocations); this does not account for the case of long-running
applications, where heap fragmentation is inevitable. We show that, for fragmented heaps, custom memory allocators can
significantly outperform modern general-purpose allocators.
To simulate the effect of long-running execution on the heap, we present a novel methodology we call littering.
Littering allows us to measure the benefits of custom allocators without unfairly favoring malloc
/free
. Unlike
starting from a clean-slate heap, littering simulates the conditions of a long-running program, where fragmentation is
inevitable. Littering works by first running a short pre-conditioning step before starting the program, allocating and
freeing objects.
More precisely, littering works in two phases:
- Detection: We run the program as usual, but with a detector library (
LD_PRELOAD=libdetector.so <program>
) that keeps track of every allocated object's size, binning sizes to obtain a rough size distribution of the objects used by the program, which it produces as output to be consumed in the littering phase. Detector also keeps track of several allocation statistics, including mean, min/max, and most importantly,MaxLiveAllocations
.
As an example, here is the size class distribution recorded by the detector for boxed-sim
.
- Littering: With the statistics and histogram in hand, we next run littering
(
LD_PRELOAD=liblitterer.so <program>
). Littering allocatesLITTER_MULTIPLIER * MaxLiveAllocations
objects following the recorded size distribution, (optionally) shuffling the addresses of allocated objects, and frees a fraction of1 - LITTER_OCCUPANCY
of them. We use the samemalloc
as the program, so the program starts with the heap in a littered state. There are a few tunable parameters, which can be provided as environment variables:LITTER_SEED
: Seed to be used for the random number generator. Random by default.LITTER_OCCUPANCY
: Fraction of objects to keep on the heap. Value must be between 0 and 1.LITTER_NO_SHUFFLE
: Set to 1 to disable shuffling, and free from the last allocated objects.LITTER_SLEEP
: Sleep x seconds after littering, but before starting the program. Default is disabled.LITTER_MULTIPLIER
: Multiplier of number of objects to allocate. Default is 20.
The diagram below shows in a simple way the effect of littering on the heap. With a blank, fresh heap, the allocator is usually able to pack allocations in contiguous memory, yielding much better locality and cache performance throughout the program. Fragmentation however, either natural or artificial with littering, forces the allocator to return non-contiguous addresses to successive allocations. This tends to degrade performance.
All data was obtained on a Thinkpad P15s Gen 2 with an Intel i7-1165G7 processor. All results below were obtained with
LITTER_MULTIPLIER = 1
, varying the occupancy number. A multiplier of 1 is approximately equivalent to letting the
program run once, keeping a random fraction of the objects it allocated on the heap, and running it again.
We evaluate littering on the four benchmarks from the original paper we were able to obtain and run.
Benchmark | Custom allocation method | Description | Input |
---|---|---|---|
197.parser |
custom | English sentence parser from SPEC2000 | ref.in |
boxed-sim |
freelist | Polyhedral approximated balls bouncing in a box | -n 31 -s 311317 |
mudlle |
region | MUD interpreter | time.mud |
175.vpr |
region | FPGA placement & routing from SPEC2000 | train placement |
Notice that as we run on modern hardware, a number of these benchmarks run very quickly. When possible, we increase the input size so benchmarks run longer, allowing us to provide more accurate measurements. We were not able to increase mudlle's input, with the benchmark taking typically less than a second.
-
197.parser
: This benchmark uses a custom stack-like pattern: it initially allocates a large chunk of memory, and then uses a bump pointer for each malloc call. If an object is freed, it is marked, and the pointer is sent back to the end of the last unmarked object. In the shim, we replacexalloc
with the usualmalloc
. -
boxed-sim
: This benchmark uses a free list for each type of object. In the shim, we simply disable the freelists and replace the times when the program looked if an object was reusable with a standardmalloc
. -
mudlle
: A typical arena-allocator. Citingcalloc-heap.c
.Based on the concept of blocks: you allocate memory from a block, and can free the whole block at one go. Individual deallocations are not possible.
We shim this system by simply keeping allocated objects on a linked list, and calling
free
recursively when a block is freed. -
175.vpr
: Mix between a standard arena-allocator and a "stack-like" allocator similar to the one used in197.parser
. "Chunks" are allocated, with allocations going in the current chunk (or a new one is created). Objects can be freed all at once by saving and passing the chunk pointer before starting bulk allocations.
We present a series of graphs for each benchmark, using the following versions of glibc (the Linux default allocator), jemalloc, and mimalloc.
Version | |
---|---|
glibc | 2.31 |
jemalloc | 5.3.0 |
mimalloc | 2.0.6 |
For each benchmark, the elapsed time graph is normalized to the elapsed time of the original program (running with its
custom allocator), without littering. We also generate graphs separating events counted with perf
. We assign events
inside or outside of malloc
/free
based on which shared object (library) the sample comes from (e.g.,
libjemalloc.so
).
We run each benchmark without littering, with both the custom allocation enabled and the shim.
Benchmark | custom | glibc malloc | jemalloc | mimalloc |
---|---|---|---|---|
197.parser |
37.62s | 49.97s (1.33x) | 43.94s (1.17x) | 40.41s (1.07x) |
boxed-sim |
17.14s | 17.78s (1.04x) | 17.19s (1.00x) | 17.95s (1.05x) |
mudlle |
0.0900s | 0.1226s (1.36x) | 0.1022s (1.14x) | 0.0960 (1.07x) |
175.vpr |
33.30s | 33.86s (1.02x) | 33.27s (1.00x) | 35.23s (1.06x) |
Footnotes
-
Emery D. Berger, Benjamin G. Zorn, and Kathryn S. McKinley. 2002. Reconsidering custom memory allocation. In Proceedings of the 17th ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications (OOPSLA '02). Association for Computing Machinery, New York, NY, USA, 1–12. https://doi.org/10.1145/582419.582421 ↩
-
Most Influential OOPSLA Paper Award. https://www.sigplan.org/Awards/OOPSLA ↩