by Barry Tannenbaum
by Barry Tannenbaum
(This is a follow-up to our earlier post on multicore storage allocation.)
While working with an early Cilk++ adopter, it quickly became apparent that the default memory allocator shipped with the Windows C Run-Time Library can be a bottleneck in a multithreaded application. The Windows memory allocator has a single lock which it uses to serialize access to its internal structures. While this is safe, it proved to cause a serious loss of parallelism in the customer’s application.
There are lots of memory allocators available, both open source and commercially. We looked at using a number of them didn't find any that had the combination of features we were looking for. Ultimately we chose to write our own. Since the customer's program was a Windows application, we initially wrote concentrated on the Windows implementation.
We decided to build the new allocator inspired by the principles in Hoard: A Scalable Memory Allocator for Multithreaded Applications by Emery D. Berger, Kathryn S. McKinley, Robert D. Blumofe and Paul R. Wilson. Hoard is designed to minimize lock contention between threads, false sharing and memory drift. Detailed information about Hoard can be found at the Hoard website.
We named our new memory allocator "Miser" to play on its relationship to Hoard. We have implemented Miser for both Linux and Windows, and this post describes data structures common to both, several Windows-specific implementation details, and several Linux-specific notes at the end.
The basic unit of allocation in Miser is a "superblock." A superblock holds a header, an array to hold the size of each requested block of memory and an array of same-sized memory blocks, or bins, which can be used to satisfy memory requests. The bins are powers of 2; 8 bytes through 256 bytes. Studies have shown that this will satisfy over 98% of memory requests. Requests for larger blocks are forwarded to the C Run-Time Library. The array of bins is placed against the back of the superblock, so memory blocks returned to the user are always naturally aligned.
Miser, like Hoard, features a private pool of memory for each thread. Because the pool is thread-private, it can be accessed without locking. A pool is made up of superblocks which are kept in a roughly sorted order based on how much space is available in each superblock. The use of the per-thread pools is fundamental to improving application performance:
The global pool is not owned by any thread. If a per-thread pool cannot satisfy a request, it will take a superblock from the global pool. If the global pool is empty, Miser will allocate more memory from the operating system. As memory is released to a per-thread pool, it will contribute superblocks to the global pool to prevent memory drift.
While the features inspired by Hoard provided a good foundation, we had the following additional requirements for Miser:
The format for a Windows module, either an executable or Dynamic-Link Library (DLL) is defined by the Windows Portable Executable File Format. Almost every Windows module imports functionality from some other module. These dependencies are described by a pair of tables; the Import Name Table and the Import Address Table. When a module is loaded into memory, the loader will "snap" the addresses in the Import Address Table to the entrypoint in the destination module. All calls to that function from the module will be made indirectly through this cell.
When Miser is loaded into an application, it will go through each of the modules in the application and redirect the entries in the Import Address Table for each of the C Run-Time Library memory allocation functions to its own entrypoint. This technique was first described by Matt Pietrek in Peering Inside the PE: A Tour of the Win32 Portable Executable File Format. Since the modification of the IAT entry is a 32-bit write, it’s an atomic operation; the IAT entry is either the C Run-Time Library address or the Miser address.
Once Miser has been loaded into a process, it can never be unloaded. While Miser could restore the original values of the C Run-Time Library functions in the IATs, the C Run-Time Library functions cannot handle memory allocated by Miser.
When Miser loads into the application, the application may have already allocated memory. This is no way for Miser to replace those blocks with blocks from its own resources. Since Miser redirects all calls to free()
and _msize()
to itself, it must be able to recognize memory that has been allocated by the C Run-Time Library and pass those calls to the appropriate version of the C Run-Time Library.
The Windows OS provides a number of user-mode APIs to manage memory. One of the most fundamental is VirtualAlloc()
which allocates memory in 64K blocks on 64K boundaries. Miser will subdivide these into 4K superblocks, each of which starts on a 4K boundary. Not coincidentally, this is the page size for the 32-bit x86 Windows OS. The first thing in each superblock is a header which contains a "magic number" and other control information. So by masking off the low 12 bits of a memory address, we should find the header for the superblock containing the memory block. There are two checks to validate a superblock.
If either of these tests fails, Miser will pass the call off to the Windows C Run-Time Library.
An additional benefit of being able to detect memory allocated by the C Run-Time Library and pass the request on is that Miser can pass any requests for more than 256 bytes off to the C Run-Time Library.
Each module loaded into an application can be built against a different version of the C Run-Time Library. In addition to the side-by-side support built into the OS, recent versions of the C Run-Time Library have had their version number in the DLL name. Miser supports all versions of the C Run-Time Library since Visual Studio .NET 2002. Since each version of the C Run-Time Library has its own heap, Miser keeps track of which version of the C Run-Time Library it has hooked and will forward calls to the appropriate version.
Thread-safety and speed are intimately tied. The more the code needs to take out a lock, the greater the chance that it will have to wait for some other thread. For most operations, Miser does not need to lock; it can usually satisfy a request from resources already available in the per-thread pool. Miser must take out a lock in the following circumstances:
_msize()
for a block allocated by another thread - The _msize()
function allows a Windows application to query the C Run-Time Library about the size of a block of memory. The value returned is the size in bytes that was specified when the block was allocated. If the block was allocated by Miser in the current thread, we can safely access the information in a superblock owned by the per-thread pool. However if the superblock is owned by another thread’s per-thread pool, then we must lock the per-thread pool before we attempt to validate the superblock against its superblock array. _msize()
call, the array of superblocks must be locked before it is updated. One important thing to note is that while Miser supports Cilk++ well, it is not tied in any way to the Cilk++ compiler or runtime. It should serve equally well for a multithreaded application using any of the threading technologies available.
The following is an example of performance gains achievable with Miser, from an earlier blog post (Multicore-enabling the N-Queens Problem Using Cilk++).
Although Miser was initially written for Windows, since it performed better than Hoard in some of our tests, it was ported to Linux. The back-end code remained basically identical except for various system calls (e.g., changing calls to VirtualAlloc()
to mmap()
, etc.).
On the front-end, the mechanism for accessing the functions was rewritten. The Miser library exports malloc()
, free()
, calloc()
, etc., and linking a binary with it will cause the loader to look for it at runtime. Additionally, preloading it by settingLD_LIBRARY_PATH
causes objects to find Miser's allocators before those of the C runtime library. Miser still uses the systemmalloc()
for large requests and it finds them in the default library using dlsym()
.
In Linux, if Miser is loaded after a module has resolved the location of allocator functions, Miser does not hook itself in. You can still access Miser's malloc()
using dlsym()
and getting "malloc" from the library but, as in Windows, memory allocated with Miser must be freed with Miser. The system free()
doesn't know what to do with Miser memory.
转自http://software.intel.com/en-us/articles/miser-a-dynamically-loadable-memory-allocator-for-multi-threaded-applications/