Miser–A Dynamically Loadable Memory Allocator for Multi-Threaded Applications

by Barry Tannenbaum

by Barry Tannenbaum

(This is a follow-up to our earlier post on multicore storage allocation.)

 

While working with an early Cilk++ adopter, it quickly became apparent that the default memory allocator shipped with the Windows C Run-Time Library can be a bottleneck in a multithreaded application. The Windows memory allocator has a single lock which it uses to serialize access to its internal structures. While this is safe, it proved to cause a serious loss of parallelism in the customer’s application.

There are lots of memory allocators available, both open source and commercially. We looked at using a number of them didn't find any that had the combination of features we were looking for. Ultimately we chose to write our own. Since the customer's program was a Windows application, we initially wrote concentrated on the Windows implementation.

Features borrowed from Hoard

We decided to build the new allocator inspired by the principles in Hoard: A Scalable Memory Allocator for Multithreaded Applications by Emery D. Berger, Kathryn S. McKinley, Robert D. Blumofe and Paul R. Wilson. Hoard is designed to minimize lock contention between threads, false sharing and memory drift. Detailed information about Hoard can be found at the Hoard website.

We named our new memory allocator "Miser" to play on its relationship to Hoard. We have implemented Miser for both Linux and Windows, and this post describes data structures common to both, several Windows-specific implementation details, and several Linux-specific notes at the end.

Superblocks

The basic unit of allocation in Miser is a "superblock." A superblock holds a header, an array to hold the size of each requested block of memory and an array of same-sized memory blocks, or bins, which can be used to satisfy memory requests. The bins are powers of 2; 8 bytes through 256 bytes. Studies have shown that this will satisfy over 98% of memory requests. Requests for larger blocks are forwarded to the C Run-Time Library. The array of bins is placed against the back of the superblock, so memory blocks returned to the user are always naturally aligned.

Global and Per-Thread Pools

Miser, like Hoard, features a private pool of memory for each thread. Because the pool is thread-private, it can be accessed without locking. A pool is made up of superblocks which are kept in a roughly sorted order based on how much space is available in each superblock. The use of the per-thread pools is fundamental to improving application performance:

  • Per-thread pools allow Miser to satisfy most memory requests without locking.
  • Per-thread pools limit false-sharing. Unless a block is passed from one thread to another, all use of the memory accessed through a cache-line should be from one thread.

The global pool is not owned by any thread. If a per-thread pool cannot satisfy a request, it will take a superblock from the global pool. If the global pool is empty, Miser will allocate more memory from the operating system. As memory is released to a per-thread pool, it will contribute superblocks to the global pool to prevent memory drift.

Additional Features

While the features inspired by Hoard provided a good foundation, we had the following additional requirements for Miser:

  • It must be able to be loaded dynamically. This allows a component shipped as a shared library to use Miser instead of requiring that the application load it at startup.
  • It must be able to recognize blocks that it had not allocated, and pass those to the system C Run-Time Library.
  • It must be thread-safe and fast.
  • It must be able to simultaneously support the multiple versions of the C Run-Time Library that may be in use by an application.

Loading Miser dynamically

The format for a Windows module, either an executable or Dynamic-Link Library (DLL) is defined by the Windows Portable Executable File Format. Almost every Windows module imports functionality from some other module. These dependencies are described by a pair of tables; the Import Name Table and the Import Address Table. When a module is loaded into memory, the loader will "snap" the addresses in the Import Address Table to the entrypoint in the destination module. All calls to that function from the module will be made indirectly through this cell.

When Miser is loaded into an application, it will go through each of the modules in the application and redirect the entries in the Import Address Table for each of the C Run-Time Library memory allocation functions to its own entrypoint. This technique was first described by Matt Pietrek in Peering Inside the PE: A Tour of the Win32 Portable Executable File Format. Since the modification of the IAT entry is a 32-bit write, it’s an atomic operation; the IAT entry is either the C Run-Time Library address or the Miser address.

Once Miser has been loaded into a process, it can never be unloaded. While Miser could restore the original values of the C Run-Time Library functions in the IATs, the C Run-Time Library functions cannot handle memory allocated by Miser.

Recognizing blocks not allocated by Miser

When Miser loads into the application, the application may have already allocated memory. This is no way for Miser to replace those blocks with blocks from its own resources. Since Miser redirects all calls to free() and _msize() to itself, it must be able to recognize memory that has been allocated by the C Run-Time Library and pass those calls to the appropriate version of the C Run-Time Library.

The Windows OS provides a number of user-mode APIs to manage memory. One of the most fundamental is VirtualAlloc() which allocates memory in 64K blocks on 64K boundaries. Miser will subdivide these into 4K superblocks, each of which starts on a 4K boundary. Not coincidentally, this is the page size for the 32-bit x86 Windows OS. The first thing in each superblock is a header which contains a "magic number" and other control information. So by masking off the low 12 bits of a memory address, we should find the header for the superblock containing the memory block. There are two checks to validate a superblock.

  1. The superblock header contains the correct "magic number."
  2. The superblock header contains a pointer to the per-thread pool owning the superblock and an index into an array of pointers to the superblocks owned by the per-thread pool. The entry in the array should match the address of the superblock.

If either of these tests fails, Miser will pass the call off to the Windows C Run-Time Library.

An additional benefit of being able to detect memory allocated by the C Run-Time Library and pass the request on is that Miser can pass any requests for more than 256 bytes off to the C Run-Time Library.

Supporting multiple versions of the C Run-Time Library

Each module loaded into an application can be built against a different version of the C Run-Time Library. In addition to the side-by-side support built into the OS, recent versions of the C Run-Time Library have had their version number in the DLL name. Miser supports all versions of the C Run-Time Library since Visual Studio .NET 2002. Since each version of the C Run-Time Library has its own heap, Miser keeps track of which version of the C Run-Time Library it has hooked and will forward calls to the appropriate version.

Speed and thread-safety

Thread-safety and speed are intimately tied. The more the code needs to take out a lock, the greater the chance that it will have to wait for some other thread. For most operations, Miser does not need to lock; it can usually satisfy a request from resources already available in the per-thread pool. Miser must take out a lock in the following circumstances:

  1. Accessing the global pool – only one thread may be allowed to access anything in the global pool at any time. Any thread that attempts to access the global pool must lock it first, serializing access.
  2. Freeing a block allocated from another per-thread pool – Unlike Hoard, Miser does not maintain a lock in the superblock header. Blocks to be freed in some other per-thread pool are placed on that per-thread pool's remote free queue using interlocked instructions. The remote free queue is drained before attempting to allocate a block in the hopes that a freed block may satisfy the request.
  3. Supporting _msize() for a block allocated by another thread - The _msize() function allows a Windows application to query the C Run-Time Library about the size of a block of memory. The value returned is the size in bytes that was specified when the block was allocated. If the block was allocated by Miser in the current thread, we can safely access the information in a superblock owned by the per-thread pool. However if the superblock is owned by another thread’s per-thread pool, then we must lock the per-thread pool before we attempt to validate the superblock against its superblock array.
  4. Maintaining the list of superblocks – Each per-thread pool maintains an array of the superblocks that it has allocated. The header for each superblock has a pointer to the per-thread pool that owns it, as well as an index into the per-thread pool superblock array. These are used to validate that a memory address is from a block that Miser allocated. Because another thread may access the array of superblocks to satisfy an _msize() call, the array of superblocks must be locked before it is updated.

One important thing to note is that while Miser supports Cilk++ well, it is not tied in any way to the Cilk++ compiler or runtime. It should serve equally well for a multithreaded application using any of the threading technologies available.

Miser Limitations on Windows

  • Miser does not support any of the C Run-Time Library calls which allocate memory on an address boundary.
  • Miser assumes program correctness. It does not put any padding between blocks to attempt to detect writes beyond the bounds of the allocated block. Allocated blocks are butted directly against each other; Miser keeps the allocation size in a separate structure to allow it to maintain natural alignment in it's bin arrays.
  • Because of its dependence on the Import Address Table to hook into a module, Miser can only be used with modules that use the DLL forms of the C Run-Time Library. Modules that link against the static C Run-Time Library do not go through the IAT, so Miser cannot hook into them.

Performance

The following is an example of performance gains achievable with Miser, from an earlier blog post (Multicore-enabling the N-Queens Problem Using Cilk++).

Miser on Linux

Although Miser was initially written for Windows, since it performed better than Hoard in some of our tests, it was ported to Linux. The back-end code remained basically identical except for various system calls (e.g., changing calls to VirtualAlloc() to mmap(), etc.).

On the front-end, the mechanism for accessing the functions was rewritten. The Miser library exports malloc(), free(), calloc(), etc., and linking a binary with it will cause the loader to look for it at runtime. Additionally, preloading it by setting
LD_LIBRARY_PATH causes objects to find Miser's allocators before those of the C runtime library. Miser still uses the system
malloc() for large requests and it finds them in the default library using dlsym().

In Linux, if Miser is loaded after a module has resolved the location of allocator functions, Miser does not hook itself in. You can still access Miser's malloc() using dlsym() and getting "malloc" from the library but, as in Windows, memory allocated with Miser must be freed with Miser. The system free() doesn't know what to do with Miser memory.

转自http://software.intel.com/en-us/articles/miser-a-dynamically-loadable-memory-allocator-for-multi-threaded-applications/

你可能感兴趣的:(Miser–A Dynamically Loadable Memory Allocator for Multi-Threaded Applications)