Allocate Memory Interleaved between NUMA Nodes#

Note

To enable this feature, set the TBB_PREVIEW_NUMA_ALLOCATION macro to 1. When available and enabled, the feature-test macro TBB_HAS_NUMA_ALLOCATION is defined.

Description#

A well-known method to improve performance on NUMA systems is to interleave memory between several NUMA nodes. There are two parameters that control the interleaving: the set of NUMA nodes across which memory is allocated and the chunk size used for interleaving. The first parameter allows users to select a subset of NUMA nodes, which may be desirable if a parallel algorithm uses only part of the available NUMA nodes. The second parameter controls the granularity of interleaving, which may be desirable to optimize for specific access patterns.

Allocated memory is not split or cached. It’s returned back immediately upon deallocation.

Under Linux*, the API uses the libnuma library, which must be available at runtime. If the library is not available, the allocation functions fall back to standard memory allocation. On Windows*, the API uses functionality available starting from Microsoft* Windows* 10 / Microsoft* Windows* Server 2016; on older versions of Microsoft* Windows*, the allocation functions also fall back to standard memory allocation.

Note

By default, Docker environment blocks move_pages system call, which is used for interleaved memory allocation. For successful allocation, this syscall must be unblocked.

API#

Synopsis#

namespace oneapi {
    namespace tbb {
        inline void* allocate_numa_interleaved(size_t bytes,
                                               const std::vector<tbb::numa_node_id>& nodes,
                                               size_t bytes_per_chunk = 0);

        inline void* allocate_numa_interleaved(size_t bytes, size_t bytes_per_chunk = 0);

        inline void deallocate_numa_interleaved(void* ptr, size_t bytes);
    } // namespace tbb
} // namespace oneapi

Functions#

void *allocate_numa_interleaved(size_t bytes, const std::vector<tbb::numa_node_id> &nodes, size_t bytes_per_chunk = 0)#

Returns: A pointer to the allocated memory interleaved between the specified NUMA nodes in chunks of bytes_per_chunk. In case of allocation failure or invalid arguments, returns nullptr.

Requirements: bytes must be non-zero, nodes must not be empty, and bytes_per_chunk must be a multiple of the system page size.

If nodes contains some NUMA node IDs more than once, each of these IDs independently participates in the interleaving order. That allows flexible load balancing between nodes. If bytes_per_chunk is zero, the system page size is used. The allocated memory contains zeros and is aligned to the system page size.

void *allocate_numa_interleaved(size_t bytes, size_t bytes_per_chunk = 0)#

Same as the above, but allocates memory interleaved across all available NUMA nodes.

void deallocate_numa_interleaved(void *ptr, size_t bytes)#

Deallocates memory allocated by allocate_numa_interleaved.

Requirements: ptr must be previously allocated by allocate_numa_interleaved and not yet deallocated, and bytes must be the same as the corresponding value used to allocate the memory. Otherwise, the behavior is undefined.

Examples#

The code below provides a simple example with direct use of the allocated memory as a NUMA-interleaved array.

#define TBB_PREVIEW_NUMA_ALLOCATION 1

#include <oneapi/tbb/numa_allocation.h>
#include <oneapi/tbb/parallel_for.h>

int main() {
    std::size_t array_size = 10LLU * 1024 * 1024;
    double* ptr =
        (double*)oneapi::tbb::allocate_numa_interleaved(array_size * sizeof(double));
    if (!ptr)
        return -1;
    oneapi::tbb::parallel_for(std::size_t(0), array_size, [=](std::size_t i) {
        ptr[i] = i;
    });

    oneapi::tbb::deallocate_numa_interleaved(ptr, array_size * sizeof(double));
}

In the following example, interleaved memory is wrapped in tbb::memory_pool. This allows to amortize allocation overhead and construct a container that uses interleaved NUMA memory.

#define TBB_PREVIEW_MEMORY_POOL 1
#define TBB_PREVIEW_NUMA_ALLOCATION 1

#include <oneapi/tbb/numa_allocation.h>
#include <oneapi/tbb/memory_pool.h>
#include <oneapi/tbb/parallel_for.h>

#include <array>
#include <vector>

class numa_interleaved_provider {
    static constexpr std::size_t page_size = 4 * 1024;
public:
    // Guarantee that each allocation is a multiple of the system page size,
    // so allocate_numa_interleaved() requirements are satisfied.
    typedef std::array<char, page_size> value_type;
    numa_interleaved_provider() {}
    // Like std::allocator<T>::allocate, these functions expect the number of
    // objects of the same size as sizeof(value_type).
    void* allocate(std::size_t num_of_objects) {
        return oneapi::tbb::allocate_numa_interleaved(num_of_objects * sizeof(value_type));
    }
    void deallocate(void* ptr, std::size_t num_of_objects) {
        oneapi::tbb::deallocate_numa_interleaved(ptr, num_of_objects * sizeof(value_type));
    }
};

int main() {
    // Memory pool requests memory in big chunks, slices them internally and uses
    // memory caching, so may improve performance for many small allocations and
    // scenarios with the objects reuse.
    oneapi::tbb::memory_pool<numa_interleaved_provider> pool;

    oneapi::tbb::parallel_for(0, 1024*1024, [&pool](std::size_t) {
        // Temporary arrays allocated from the pool will reside in different
        // NUMA domains for better overall memory throughput.
        // As the pool caches the memory, on average it is faster than
        // allocate_numa_interleaved()/deallocate_numa_interleaved().
        double* ptr = (double*)pool.malloc(10*1000*sizeof(double));
        // ...
        pool.free(ptr);
    });

    // std::vector uses interleaved NUMA memory
    using pool_allocator_t = oneapi::tbb::memory_pool_allocator<double>;
    std::vector<double, pool_allocator_t> values(pool_allocator_t{pool});
    values.push_back(3.14);
}