Allocate Memory Interleaved between NUMA Nodes#
Note
To enable this feature, set the TBB_PREVIEW_NUMA_ALLOCATION macro to 1. When available and enabled,
the feature-test macro TBB_HAS_NUMA_ALLOCATION is defined.
Description#
A well-known method to improve performance on NUMA systems is to interleave memory between several NUMA nodes. There are two parameters that control the interleaving: the set of NUMA nodes across which memory is allocated and the chunk size used for interleaving. The first parameter allows users to select a subset of NUMA nodes, which may be desirable if a parallel algorithm uses only part of the available NUMA nodes. The second parameter controls the granularity of interleaving, which may be desirable to optimize for specific access patterns.
Allocated memory is not split or cached. It’s returned back immediately upon deallocation.
Under Linux*, the API uses the libnuma library, which must be available at runtime. If the library is not
available, the allocation functions fall back to standard memory allocation. On Windows*, the API uses
functionality available starting from Microsoft* Windows* 10 / Microsoft* Windows* Server 2016; on older
versions of Microsoft* Windows*, the allocation functions also fall back to standard memory allocation.
Note
By default, Docker environment blocks move_pages system call, which is used for interleaved memory
allocation. For successful allocation, this syscall must be unblocked.
API#
Header#
#define TBB_PREVIEW_NUMA_ALLOCATION 1
#include <oneapi/tbb/numa_allocation.h>
Synopsis#
namespace oneapi {
namespace tbb {
inline void* allocate_numa_interleaved(size_t bytes,
const std::vector<tbb::numa_node_id>& nodes,
size_t bytes_per_chunk = 0);
inline void* allocate_numa_interleaved(size_t bytes, size_t bytes_per_chunk = 0);
inline void deallocate_numa_interleaved(void* ptr, size_t bytes);
} // namespace tbb
} // namespace oneapi
Functions#
-
void *allocate_numa_interleaved(size_t bytes, const std::vector<tbb::numa_node_id> &nodes, size_t bytes_per_chunk = 0)#
Returns: A pointer to the allocated memory interleaved between the specified NUMA
nodesin chunks ofbytes_per_chunk. In case of allocation failure or invalid arguments, returnsnullptr.Requirements:
bytesmust be non-zero,nodesmust not be empty, andbytes_per_chunkmust be a multiple of the system page size.If
nodescontains some NUMA node IDs more than once, each of these IDs independently participates in the interleaving order. That allows flexible load balancing between nodes. Ifbytes_per_chunkis zero, the system page size is used. The allocated memory contains zeros and is aligned to the system page size.
-
void *allocate_numa_interleaved(size_t bytes, size_t bytes_per_chunk = 0)#
Same as the above, but allocates memory interleaved across all available NUMA nodes.
-
void deallocate_numa_interleaved(void *ptr, size_t bytes)#
Deallocates memory allocated by
allocate_numa_interleaved.Requirements:
ptrmust be previously allocated byallocate_numa_interleavedand not yet deallocated, andbytesmust be the same as the corresponding value used to allocate the memory. Otherwise, the behavior is undefined.
Examples#
The code below provides a simple example with direct use of the allocated memory as a NUMA-interleaved array.
#define TBB_PREVIEW_NUMA_ALLOCATION 1
#include <oneapi/tbb/numa_allocation.h>
#include <oneapi/tbb/parallel_for.h>
int main() {
std::size_t array_size = 10LLU * 1024 * 1024;
double* ptr =
(double*)oneapi::tbb::allocate_numa_interleaved(array_size * sizeof(double));
if (!ptr)
return -1;
oneapi::tbb::parallel_for(std::size_t(0), array_size, [=](std::size_t i) {
ptr[i] = i;
});
oneapi::tbb::deallocate_numa_interleaved(ptr, array_size * sizeof(double));
}
In the following example, interleaved memory is wrapped in tbb::memory_pool. This allows to amortize
allocation overhead and construct a container that uses interleaved NUMA memory.
#define TBB_PREVIEW_MEMORY_POOL 1
#define TBB_PREVIEW_NUMA_ALLOCATION 1
#include <oneapi/tbb/numa_allocation.h>
#include <oneapi/tbb/memory_pool.h>
#include <oneapi/tbb/parallel_for.h>
#include <array>
#include <vector>
class numa_interleaved_provider {
static constexpr std::size_t page_size = 4 * 1024;
public:
// Guarantee that each allocation is a multiple of the system page size,
// so allocate_numa_interleaved() requirements are satisfied.
typedef std::array<char, page_size> value_type;
numa_interleaved_provider() {}
// Like std::allocator<T>::allocate, these functions expect the number of
// objects of the same size as sizeof(value_type).
void* allocate(std::size_t num_of_objects) {
return oneapi::tbb::allocate_numa_interleaved(num_of_objects * sizeof(value_type));
}
void deallocate(void* ptr, std::size_t num_of_objects) {
oneapi::tbb::deallocate_numa_interleaved(ptr, num_of_objects * sizeof(value_type));
}
};
int main() {
// Memory pool requests memory in big chunks, slices them internally and uses
// memory caching, so may improve performance for many small allocations and
// scenarios with the objects reuse.
oneapi::tbb::memory_pool<numa_interleaved_provider> pool;
oneapi::tbb::parallel_for(0, 1024*1024, [&pool](std::size_t) {
// Temporary arrays allocated from the pool will reside in different
// NUMA domains for better overall memory throughput.
// As the pool caches the memory, on average it is faster than
// allocate_numa_interleaved()/deallocate_numa_interleaved().
double* ptr = (double*)pool.malloc(10*1000*sizeof(double));
// ...
pool.free(ptr);
});
// std::vector uses interleaved NUMA memory
using pool_allocator_t = oneapi::tbb::memory_pool_allocator<double>;
std::vector<double, pool_allocator_t> values(pool_allocator_t{pool});
values.push_back(3.14);
}