Releasedocs0 7 (#693)

Co-authored-by: Nathaniel Wesley Filardo <VP331RHQ115POU58JFRLKB7OPA0L18E3@cmx.ietfng.org>
microsoft · Nov 28, 2024 · 4272126 · 4272126
1 parent cd91793
commit 4272126
Show file tree

Hide file tree

Showing 5 changed files with 2,845 additions and 23 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,11 +1,14 @@
 # conventional build dirs
-release*/
-debug*/
-build*/
-cmake-build-*/
+/release*/
+/debug*/
+/build*/
+/cmake-build-*/
+/out*/
+
+/tidy.fail
 
 # cmake intermediate files
-CMakeFiles/
+/CMakeFiles/
 
 # vscode dirs
 .vscode/

diff --git a/docs/combininglock.md b/docs/combininglock.md
@@ -301,10 +301,19 @@ Then it can notify (`HEAD`) the newly joined thread that it can execute the rema
 Finally, it must notify the last element it ran the lambda for that its work is `DONE`.
 The `DONE` notification must be done after the read of `next`, as signalling `DONE` can cause the memory for the node to be freed from the stack.
 
+## Full Implementation
+
 The [actual implementation](../src/snmalloc/ds/combininglock.h) in the `snmalloc` codebase is a little more complex as it uses a fast path lock for the uncontended case.
 This uses fewer atomic operations in the common case where the lock is not contended,
 and is a standard technique for queue locks.
 
+The implementation is also optionally integrated with the OS primitive for waiting such as `futex` or `waitonaddress`.
+This means that if enabled, then after spinning for a while the thread will back off to the OS primitive.
+This leads to better tail latency in the case where the lock is highly contended.
+
+We observed the OS waiting primitive can be negatively impacted by virtualization. So we provide a flag to disable
+the OS waiting primitive. Setting `SNMALLOC_ENABLE_WAIT_ON_ADDRESS` to `OFF` will disable the OS waiting primitive.
+The PR adding this feature is [#685](https://github.com/microsoft/snmalloc/pull/685/).
 
 ## Performance Results
 
@@ -313,32 +322,18 @@ We used a machine with 72 hardware threads.
 The benchmark causes all the threads to synchronise on starting their first allocation.
 This means all 72 threads are contending on the lock at the same time to get their allocator initialised.
 
-We ran this bench mark with a standard spin-lock (0.7.0-spin) and with the new combining lock (0.7.0):
+We ran this benchmark with a standard spin-lock (0.7.0-spin) and with the new combining lock (0.7.0):
 
 ![Combining lock performance](combininglockperf.svg)
 
 As we can see from the benchmark, the combining lock is significantly faster than the spin lock in this highly contented scenario taking only 60% of the time to complete the initialisation.
 
 This benchmark is not at all representative of most scenarios, and is stressing a worst case scenario of the system.
 
-## Future Work
-
-There are two things that would make the Combining lock even better:
-
-* Back off strategies
-* Integration with Futex/Waitonaddress
-
-The current implementation does not have any back off strategies for its spinning loops.
-This can be bad for performance, however, with the MCS queue lock the threads spin on individual cache lines so some of the worst effects of [spinning are mitigated](https://pdos.csail.mit.edu/papers/linux:lock.pdf).
-However, for a more general purpose the combining lock should have a back off strategy to increase the time between checks the longer it waits.
-
-The second improvement would be to integrate the combining lock with the OS level support for waiting.
-The spinning should ultimately back off to the OS primitive for waiting such as `futex` or `waitonaddress`.
-A naive integration we tried was slower, so we need to integrate with a more sensible back-off strategy to make this work.
-
-Neither of these improvements are particularly difficult, but we have not found them necessary for the current use case in `snmalloc`.
+The benchmarking also disabled the OS level support for waiting, so the combining lock is only using spinning.
+This was because on a virtualized system the OS level support was slower than spinning.
 
 ## Conclusion
 
 The combining lock can be surfaced in C++ with a really simple API that just takes a lambda to execute while holding the lock.
-It was really easy to integrate into the `snmalloc` codebase and has provided a significant performance improvement in a highly contended micro-benchmark.
+It was really easy to integrate into the `snmalloc` codebase and has provided a significant performance improvement in a highly contended micro-benchmark.
diff --git a/docs/release/0.7/README.md b/docs/release/0.7/README.md
@@ -0,0 +1,130 @@
+# Release 0.7.0
+
+The latest release of `snmalloc` has a few interesting features that are worth discussing in more detail.
+Primarily this release focuses on improving the performance of `snmalloc` in a few key areas.
+But we have also added some new features to build new security features on top of `snmalloc`.
+
+## BatchIt
+
+The main addition is the 0.7 release is integrating the `BatchIt` algorithm from the paper
+
+> [BatchIt: Optimizing Message-Passing Allocators for Producer-Consumer Workloads: An Intellectual Abstract](https://dl.acm.org/doi/10.1145/3652024.3665506)
+> Nathaniel Wesley Filardo, Matthew J. Parkinson
+
+As the title suggests, the paper considers producer-consumer workloads, wherein producer threads allocate memory for messages that get passed to consumer threads, which free their received messages.
+For workloads that do significant amounts of such message passing, the `snmalloc` implementation prior to this release can be sub-optimal.
+In particular, the current algorithm can suffer a lot of cache-misses while it handles returning the deallocated messages back to the producer threads' allocators.
+
+BatchIt proposes to add a small consumer- (that is, deallocator-) side cache to allow messages to the same slab of memory to be batched together.
+This results in smaller message queues within `snmalloc` and gives much better cache locality when handling messages.
+
+We developed [a micro-benchmark](../../../src/test/perf/msgpass/msgpass.cc) that simulates a producer-consumer workload with back-pressure sending a fixed number of messages per producer.
+We then measure the time taken to process all the messages with different numbers of producer and consumer threads.
+`msgpass-1` has a single producer and a single consumer, `msgpass-2` has two producers and two consumers, and so on.
+
+![Graph of BatchIt performance](./snmalloc-msgpass.svg)
+
+The results show a significant potential for improvement in the producer-consumer workload.
+As the number of threads increases the cache becomes less effective as each producer can send to all the other consumers,
+that is, in the `msgpass-8` case each of the 8 producers can talk to each of the 8 consumers.
+
+The [paper](https://dl.acm.org/doi/10.1145/3652024.3665506) contains a lot more results, we have just given you a taste of the improvement here.
+
+## Start-up performance
+
+Due to a potential customers benchmarking, we observed that `snmalloc` was slower than some other allocators when starting up when there are a lot of threads.
+During start-up, we use a lock to ensure that certain tasks are only performed once.
+However, when starting a lot of threads, this lock can become a bottleneck.
+
+To address this, we analysed what was doing while holding the lock.
+We found that we were doing a several things that were causing more time to be spent inside the lock than was necessary.
+Overall, we improve the start-up time of `snmalloc` in high thread scenarios as follows:
+
+We have a particularly tough benchmark for testing [startup time](../src/test/perf/startup/startup.cc).
+We used a machine with 72 hardware threads.
+The benchmark causes all the threads to synchronise on starting their first allocation.
+This means all 72 threads are contending on the lock at the same time to get their allocator initialised.
+The results are shown in the graph below.
+
+![Performance graph for startup times](./perf-startup.svg)
+
+Here 0.6.2 is the last release of snmalloc, and 0.7 is the current release.
+We use `spin` to mean that the combining lock is not using OS level waiting, but is spinning instead.
+We use `sec` to mean that `snmalloc` has been compiled with the security checks enabled.
+
+The results show that the 0.7 release is significantly faster than the 0.6.2 release.
+The improvements are smaller in the `sec` case as there are more interactions with the OS to set up disjoint address spaces for the meta-data and the object-data.
+The benchmarks were run on an Azure VM with 72 hardware threads.  Virtualization seems to be costly for the futex system call, so the `spin` version is faster.
+
+The rest of this section details some improvements to get those results.
+
+### Combining Lock
+
+The most interesting feature was the combining lock.
+This uses ideas from the Flat Combining work to provide a C++ lock that can be used to reduce the number of cache misses during lock contention.
+You can read more about that in [combininglock.md](../combininglock.md).
+
+### DO_DUMP and DONT_DUMP
+
+To understand what each part of memory is used for, `snmalloc` allocates a pagemap.
+This is normally 1/1024 of the address space, and is very sparsely populated.
+If we get a core dump while running snmalloc, and the platform does not compress the core dump, then the pagemap can be very large.
+To address this, `snmalloc` judiciously uses `madvise` to tell the kernel that it does not need to dump the pagemap.
+However, we were also applying this to other structures that were not as large. 
+
+We found that the additional `madvise` calls were taking a noticeable amount of time. 
+To address this, we refactored the use to DO_DUMP and DONT_DUMP to only apply to the page map, and not the other smaller overallocations.
+
+See [#665](https://github.com/microsoft/snmalloc/pull/665).
+
+### Lazy initialization of the buddy allocator
+
+The backend of `snmalloc` uses a buddy allocator to manage the large ranges of memory.
+This stores power of two sized and aligned blocks of memory, and consolidates them when possible.
+To reduce the amount of system calls snmalloc typically requests a large range of memory from the OS.
+However, we found that the faulting in the pages for the buddy allocator was taking a noticeable amount of time.
+This lead to us refactoring the buddy allocator to lazily initialize the structures it needs, so the number of initial faults is reduced.
+
+See [#665](https://github.com/microsoft/snmalloc/pull/665).
+
+### Miscellaneous
+
+There were other small changes that were made to reduce the number of times the lock had to be held, e.g. [#639](https://github.com/microsoft/snmalloc/pull/639).
+
+
+## Custom meta-data
+
+We have been designing a new feature in `snmalloc` to build new security features on top of.
+The key idea is to allow the allocator to be built with an optional data structure that can be used to store meta-data about every allocation.
+In `snmalloc` we have a pagemap that for each 16KiB of memory stores 16 bytes of data.
+This stores three things for each chunk of memory:
+* The size class of allocations in the chunk of memory;
+* The owning allocator of the chunk of memory; and
+* A pointer to additional meta-data.
+
+The additional meta-data in snmalloc, is information such as how many allocations are free, and various free-lists for that chunk.
+The meta-data can be shared between adjacent chunks of memory, which provides up with variable sized slabs of memory.
+
+The additional meta-data size in snmalloc 0.6 was fixed and under a cache line in most configurations.
+In snmalloc 0.7, we have made this meta-data size configurable.
+This allows developers to build new security features on top of snmalloc.
+
+For instance, building snmalloc with the following definition of `Alloc` will allow you to store a 64-bit counter for each allocation:
+```cpp
+  using Alloc = snmalloc::LocalAllocator<snmalloc::StandardConfigClientMeta<
+    ArrayClientMetaDataProvider<std::atomic<size_t>>>>;
+```
+
+This does not affect the underlying alignment of the allocations.
+It also only increases the size of the meta-data by the required additional meta-data size.
+It does not increase the size of the pagemap.
+
+We have built a simple example inspired by Google's `miracle_ptr`,
+that uses this feature to provide the reference counting for all allocations, but out-of-band.
+See [miracle_ptr](../src/test/func/miracle_ptr/miracle_ptr.cc) for our current experiment.
+We are still experimenting with this feature, and would love to hear your feedback.
+
+## Conclusion
+
+The 0.7 release addresses a few awkward performance issues in `snmalloc`, and provides an interesting platform to develop new security features on top of `snmalloc`.
+Happy allocating!