Radix Sort#

radix_sort Function Templates#

The radix_sort function sorts data using the radix sort algorithm. The sorting is stable, preserving the relative order of equal elements. Both in-place and out-of-place overloads are provided. Out-of-place overloads do not alter the input sequence.

The functions implement a Onesweep* 1 algorithm variant.

Note

radix_sort is currently available for Intel® Arc™ B-Series and Intel® Data Center GPU Max Series. The Intel® oneAPI DPC++/C++ Compiler 2025.1.0 or greater is required, and the Unified Runtime adapter over Level Zero must be used (which is the default adapter for Intel GPUs). For more information, please refer to DPC++ Device Selection.

A synopsis of the radix_sort function is provided below:

// defined in <oneapi/dpl/experimental/kernel_templates>

namespace oneapi::dpl::experimental::kt::gpu {

// Sort in-place
template <bool IsAscending = true, std::uint8_t RadixBits = 8,
          typename KernelParam, typename Iterator>
sycl::event
radix_sort (sycl::queue q, Iterator first, Iterator last,
            KernelParam param = {}); // (1)

template <bool IsAscending = true, std::uint8_t RadixBits = 8,
          typename KernelParam, typename Range>
sycl::event
radix_sort (sycl::queue q, Range&& r, KernelParam param = {}); // (2)

// Sort out-of-place
template <bool IsAscending = true, std::uint8_t RadixBits = 8,
          typename KernelParam, typename Iterator1,
          typename Iterator2>
sycl::event
radix_sort (sycl::queue q, Iterator1 first, Iterator1 last,
            Iterator2 first_out, KernelParam param = {}); // (3)

template <bool IsAscending = true, std::uint8_t RadixBits = 8,
          typename KernelParam, typename Range1, typename Range2>
sycl::event
radix_sort (sycl::queue q, Range1&& r, Range2&& r_out,
            KernelParam param = {}); // (4)
}

Template Parameters#

Name

Description

bool IsAscending

The sort order. Ascending: true; Descending: false.

std::uint8_t RadixBits

The number of bits to sort for each radix sort algorithm pass.

Parameters#

Name

Description

q

The SYCL* queue where kernels are submitted.

  • first, last (1),

  • r (2),

  • first, last, first_out (3),

  • r, r_out (4).

The sequences to apply the algorithm to. Supported sequence types:

param

A kernel_param object.

Type Requirements:

  • The element type of sequence(s) to sort must be a C++ integral or floating-point type other than bool with a width of up to 64 bits.

Note

Current limitations:

  • Number of elements to sort must not exceed 230.

  • RadixBits can only be 8.

  • param.workgroup_size can be 512 or 1024.

Return Value#

A sycl::event object representing the status of the algorithm execution.

Usage Examples#

In-Place Example#

// possible build and run commands:
//    icpx -fsycl radix_sort.cpp -o radix_sort -I /path/to/oneDPL/include && ./radix_sort

#include <cstdint>
#include <iostream>
#include <sycl/sycl.hpp>

#include <oneapi/dpl/experimental/kernel_templates>

namespace kt = oneapi::dpl::experimental::kt;

int main()
{
   std::size_t n = 6;
   sycl::queue q{sycl::gpu_selector_v};
   std::uint32_t* keys = sycl::malloc_shared<std::uint32_t>(n, q);

   // initialize
   keys[0] = 3, keys[1] = 2, keys[2] = 1, keys[3] = 5, keys[4] = 3, keys[5] = 3;

   // sort
   auto e = kt::gpu::radix_sort<false, 8>(q, keys, keys + n, kt::kernel_param<10, 512>{}); // (1)
   e.wait();

   // print
   for(std::size_t i = 0; i < n; ++i)
      std::cout << keys[i] << ' ';
   std::cout << '\n';

   sycl::free(keys, q);
   return 0;
}

Output:

5 3 3 3 2 1

Out-of-Place Example#

// possible build and run commands:
//    icpx -fsycl radix_sort.cpp -o radix_sort -I /path/to/oneDPL/include && ./radix_sort

#include <cstdint>
#include <iostream>
#include <sycl/sycl.hpp>

#include <oneapi/dpl/experimental/kernel_templates>

namespace kt = oneapi::dpl::experimental::kt;

int main()
{
   std::size_t n = 6;
   sycl::queue q{sycl::gpu_selector_v};
   std::uint32_t* keys = sycl::malloc_shared<std::uint32_t>(n, q);
   std::uint32_t* keys_out = sycl::malloc_shared<std::uint32_t>(n, q);

   // initialize
   keys[0] = 3, keys[1] = 2, keys[2] = 1, keys[3] = 5, keys[4] = 3, keys[5] = 3;

   // sort
   auto e = kt::gpu::radix_sort<false, 8>(q, keys, keys + n, keys_out, kt::kernel_param<10, 512>{}); // (3)
   e.wait();

   // print
   for(std::size_t i = 0; i < n; ++i)
      std::cout << keys[i] << ' ';
   std::cout << '\n';
   for(std::size_t i = 0; i < n; ++i)
      std::cout << keys_out[i] << ' ';
   std::cout << '\n';

   sycl::free(keys, q);
   sycl::free(keys_out, q);
   return 0;
}

Output:

3 2 1 5 3 3
5 3 3 3 2 1

Memory Requirements#

The algorithm uses global and local device memory (see SYCL 2020 Specification) for intermediate data storage. For the algorithm to operate correctly, there must be enough memory on the device. If there is not enough global device memory, a std::bad_alloc exception is thrown. The behavior is undefined if there is not enough local memory. The amount of memory that is required depends on input data and configuration parameters, as described below.

Global Memory Requirements#

Global memory is used for copying the input sequence(s) and storing internal data such as radix value counters. The amount used depends on many parameters; below is an upper bound approximation:

Nkeys + C * Nkeys

where the sequence with keys takes Nkeys space, and the additional space is C * Nkeys.

The value of C depends on param.data_per_workitem, param.workgroup_size and RadixBits. For param.data_per_workitem set to 10, param.workgroup_size to 512, and RadixBits to 8, C is typically less than 1. Doubling either param.data_per_workitem or param.workgroup_size leads to a halving of C.

Local Memory Requirements#

Local memory is used for reordering keys within a work-group, and for storing internal data such as radix value counters. The amount used depends on many parameters; below is an upper bound approximation:

Nkeys_per_workgroup + C

where Nkeys_per_workgroup is the amount of memory to store keys. C is some additional space for storing internal data.

Nkeys_per_workgroup is equal to sizeof(key_type) * param.data_per_workitem * param.workgroup_size, C does not exceed 4KB.