Radix Sort By Key#

radix_sort_by_key Function Templates#

The radix_sort_by_key function sorts keys using the radix sort algorithm, applying the same order to the corresponding values. The sorting is stable, preserving the relative order of elements with equal keys. Both in-place and out-of-place overloads are provided. Out-of-place overloads do not alter the input sequences.

The functions implement a Onesweep* 1 algorithm variant.

Note

radix_sort_by_key is currently available for Intel® Arc™ B-Series and Intel® Data Center GPU Max Series. The Intel® oneAPI DPC++/C++ Compiler 2025.1.0 or greater is required, and the Unified Runtime adapter over Level Zero must be used. This is the default adapter for Intel GPUs. For more information, please refer to DPC++ Device Selection.

A synopsis of the radix_sort_by_key function is provided below:

// defined in <oneapi/dpl/experimental/kernel_templates>

namespace oneapi::dpl::experimental::kt::gpu {

// Sort in-place
template <bool IsAscending = true, std::uint8_t RadixBits = 8,
          typename KernelParam, typename Iterator1, typename Iterator2>
sycl::event
radix_sort_by_key (sycl::queue q, Iterator1 keys_first, Iterator1 keys_last,
                   Iterator2 values_first, KernelParam param = {}); // (1)

template <bool IsAscending = true, std::uint8_t RadixBits = 8,
          typename KernelParam, typename KeysRng, typename ValuesRng>
sycl::event
radix_sort_by_key (sycl::queue q, KeysRng&& keys,
                   ValuesRng&& values, KernelParam param = {}); // (2)


// Sort out-of-place
template <bool IsAscending = true, std::uint8_t RadixBits = 8,
          typename KernelParam, typename KeysIterator1,
          typename ValuesIterator1, typename KeysIterator2,
          typename ValuesIterator2>
sycl::event
radix_sort_by_key (sycl::queue q, KeysIterator1 keys_first,
                   KeysIterator1 keys_last, ValuesIterator1 values_first,
                   KeysIterator2 keys_out_first, ValuesIterator2 values_out_first,
                   KernelParam param = {}); // (3)

template <bool IsAscending = true, std::uint8_t RadixBits = 8,
          typename KernelParam, typename KeysRng1, typename ValuesRng1,
          typename KeysRng2, typename ValuesRng2>
sycl::event
radix_sort_by_key (sycl::queue q, KeysRng1&& keys, ValuesRng1&& values,
                   KeysRng2&& keys_out, ValuesRng2&& values_out,
                   KernelParam param = {}); // (4)
}

Template Parameters#

Name	Description
`bool IsAscending`	The sort order. Ascending: `true`; Descending: `false`.
`std::uint8_t RadixBits`	The number of bits to sort for each radix sort algorithm pass.

Parameters#

Name	Description
`q`	The SYCL* queue where kernels are submitted.
`keys_first`, `keys_last`, `values_first` (1), `keys`, `values` (2), `keys_first`, `keys_last`, `values_first`, `keys_out_first`, `values_out_first` (3) `keys`, `values`, `keys_out`, `values_out` (4).	The sequences to apply the algorithm to. Supported sequence types: USM pointers (1,3), oneapi::dpl::begin and oneapi::dpl::end (1,3). `sycl::buffer` (2,4), oneapi::dpl::experimental::ranges::views::all (2,4), oneapi::dpl::experimental::ranges::views::subrange (2,4).
`param`	A kernel_param object.

Name

Description

q

The SYCL* queue where kernels are submitted.

keys_first, keys_last, values_first (1),
keys, values (2),
keys_first, keys_last, values_first, keys_out_first, values_out_first (3)
keys, values, keys_out, values_out (4).

The sequences to apply the algorithm to. Supported sequence types:

USM pointers (1,3),
oneapi::dpl::begin and oneapi::dpl::end (1,3).
sycl::buffer (2,4),
oneapi::dpl::experimental::ranges::views::all (2,4),
oneapi::dpl::experimental::ranges::views::subrange (2,4).

param

A kernel_param object.

Type Requirements:

The element type of sequence(s) to sort must be a C++ integral or floating-point type other than bool with a width of up to 64 bits.

Note

Current limitations:

Number of elements to sort must not exceed 2³⁰.
RadixBits can only be 8.
param.workgroup_size can be 512 or 1024.

Return Value#

A sycl::event object representing the status of the algorithm execution.

Usage Examples#

In-Place Example#

// possible build and run commands:
//    icpx -fsycl radix_sort_by_key.cpp -o radix_sort_by_key -I /path/to/oneDPL/include && ./radix_sort_by_key

#include <cstdint>
#include <iostream>
#include <sycl/sycl.hpp>

#include <oneapi/dpl/experimental/kernel_templates>

namespace kt = oneapi::dpl::experimental::kt;

int main()
{
   std::size_t n = 6;
   sycl::queue q{sycl::gpu_selector_v};
   sycl::buffer<std::uint32_t> keys{sycl::range<1>(n)};
   sycl::buffer<char> values{sycl::range<1>(n)};

   // initialize
   {
      sycl::host_accessor k_acc{keys, sycl::write_only};
      k_acc[0] = 3, k_acc[1] = 2, k_acc[2] = 1, k_acc[3] = 5, k_acc[4] = 3, k_acc[5] = 3;

      sycl::host_accessor v_acc{values, sycl::write_only};
      v_acc[0] = 'r', v_acc[1] = 'o', v_acc[2] = 's', v_acc[3] = 'd', v_acc[4] = 't', v_acc[5] = 'e';
   }

   // sort
   auto e = kt::gpu::radix_sort_by_key<true, 8>(q, keys, values, kt::kernel_param<10, 512>{}); // (2)
   e.wait();

   // print
   {
      sycl::host_accessor k_acc{keys, sycl::read_only};
      for(std::size_t i = 0; i < n; ++i)
            std::cout << k_acc[i] << ' ';
      std::cout << '\n';

      sycl::host_accessor v_acc{values, sycl::read_only};
      for(std::size_t i = 0; i < n; ++i)
            std::cout << v_acc[i] << ' ';
      std::cout << '\n';
   }

   return 0;
}

Output:

1 2 3 3 3 5
s o r t e d

Out-of-Place Example#

// possible build and run commands:
//    icpx -fsycl radix_sort_by_key.cpp -o radix_sort_by_key -I /path/to/oneDPL/include && ./radix_sort_by_key

#include <cstdint>
#include <iostream>
#include <sycl/sycl.hpp>

#include <oneapi/dpl/experimental/kernel_templates>

namespace kt = oneapi::dpl::experimental::kt;

int main()
{
   std::size_t n = 6;
   sycl::queue q{sycl::gpu_selector_v};
   sycl::buffer<std::uint32_t> keys{sycl::range<1>(n)};
   sycl::buffer<std::uint32_t> keys_out{sycl::range<1>(n)};
   sycl::buffer<char> values{sycl::range<1>(n)};
   sycl::buffer<char> values_out{sycl::range<1>(n)};


   // initialize
   {
      sycl::host_accessor k_acc{keys, sycl::write_only};
      k_acc[0] = 3, k_acc[1] = 2, k_acc[2] = 1, k_acc[3] = 5, k_acc[4] = 3, k_acc[5] = 3;

      sycl::host_accessor v_acc{values, sycl::write_only};
      v_acc[0] = 'r', v_acc[1] = 'o', v_acc[2] = 's', v_acc[3] = 'd', v_acc[4] = 't', v_acc[5] = 'e';
   }

   // sort
   auto e = kt::gpu::radix_sort_by_key<true, 8>(q, keys, values, keys_out, values_out,
                                                       kt::kernel_param<10, 512>{}); // (4)
   e.wait();

   // print
   {
      sycl::host_accessor k_acc{keys, sycl::read_only};
      for(std::size_t i = 0; i < n; ++i)
            std::cout << k_acc[i] << ' ';
      std::cout << '\n';

      sycl::host_accessor v_acc{values, sycl::read_only};
      for(std::size_t i = 0; i < n; ++i)
            std::cout << v_acc[i] << ' ';
      std::cout << "\n\n";

      sycl::host_accessor k_out_acc{keys_out, sycl::read_only};
      for(std::size_t i = 0; i < n; ++i)
            std::cout << k_out_acc[i] << ' ';
      std::cout << '\n';

      sycl::host_accessor v_out_acc{values_out, sycl::read_only};
      for(std::size_t i = 0; i < n; ++i)
            std::cout << v_out_acc[i] << ' ';
      std::cout << '\n';
   }

   return 0;
}

Output:

3 2 1 5 3 3
r o s d t e

1 2 3 3 3 5
s o r t e d

Memory Requirements#

The algorithm uses global and local device memory (see SYCL 2020 Specification) for intermediate data storage. For the algorithm to operate correctly, there must be enough memory on the device. If there is not enough global device memory, a std::bad_alloc exception is thrown. The behavior is undefined if there is not enough local memory. The amount of memory that is required depends on input data and configuration parameters, as described below.

Global Memory Requirements#

Global memory is used for copying the input sequence(s) and storing internal data such as radix value counters. The amount used depends on many parameters; below is an upper bound approximation:

N_keys + N_values + C * N_keys

where the sequence with keys takes N_keys space, the sequence with values takes N_values space, and the additional space is C * N_keys.

The value of C depends on param.data_per_workitem, param.workgroup_size, and RadixBits. For param.data_per_workitem set to 10, param.workgroup_size to 512, and RadixBits to 8, C is typically less than 1. Doubling either param.data_per_workitem or param.workgroup_size leads to a halving of C.

Local Memory Requirements#

Local memory is used for reordering key-value pairs within a work-group, and for storing internal data such as radix value counters. The amount used depends on many parameters; below is an upper bound approximation:

N_{keys_per_workgroup} + N_{values_per_workgroup} + C

where N_{keys_per_workgroup} and N_{values_per_workgroup} are the amounts of memory to store keys and values, respectively. C is some additional space for storing internal data.

N_{keys_per_workgroup} is equal to sizeof(key_type) * param.data_per_workitem * param.workgroup_size, N_{values_per_workgroup} is equal to sizeof(value_type) * param.data_per_workitem * param.workgroup_size, C does not exceed 4KB.

Recommended Settings for Best Performance#

The general advice is to choose kernel parameters based on performance measurements and profiling information. The initial configuration may be selected according to these high-level guidelines:

When the number of elements to sort N is small (e.g., less than ~1M), utilizing all available compute cores may improve performance. Experiment with creating enough work chunks to feed all X^e-cores 2 on a GPU: param.data_per_workitem * param.workgroup_size ≈ N / xe_core_count in addition to the next configuration. The optimal settings may differ between hardware depending on the cost of inter-work group synchronization.
When the number of elements to sort is large (e.g., more than ~1M), maximizing the number of elements processed by a work-group, which equals to param.data_per_workitem * param.workgroup_size, reduces synchronization overheads between work-groups and usually benefits the overall performance.

The following table provides starting points for param.data_per_workitem and param.workgroup_size for large input sizes. This configuration performs well when sorting 2²⁸ std::uint32_t keys and values uniformly distributed over the range [0, UINT32_MAX]:

Sample Parameters for `std::uint32_t` Sort#
Platform	`param.data_per_workitem`	`param.workgroup_size`
Intel Data Center GPU Max	14	1024
Intel Arc B-Series	5	512

When tuning your own parameters, these param.data_per_workitem values may serve as a good initial starting point and can be thought of as upper-bound values to test with during experimentation. For smaller inputs, a lower param.data_per_workitem generally performs better.

Tip

Optimal parameters may differ by the data type and the entropy of the underlying data, so it is important that this is reflected in your tuning experiments.
Avoid setting too large param.data_per_workitem and param.workgroup_size values by ensuring that Memory requirements are satisfied.
Maximizing param.data_per_workitem generally improves scalable performance as long as private memory usage does not exceed available register capacity.
Large performance drops with an increase to param.data_per_workitem are indicative of excessive register spillage. Ahead-of-Time (AOT) Compilation may be used to emit warnings when this occurs.

1: Andy Adinets and Duane Merrill (2022). Onesweep: A Faster Least Significant Digit Radix Sort for GPUs. https://arxiv.org/abs/2206.01784.
2: The X^e-core term is described in the oneAPI GPU Optimization Guide. Check the number of cores in the device specification, such as Intel Arc B580 specification.

Radix Sort By Key

Contents

Radix Sort By Key#

radix_sort_by_key Function Templates#

Template Parameters#

Parameters#

Return Value#

Usage Examples#

In-Place Example#

Out-of-Place Example#

Memory Requirements#

Global Memory Requirements#

Local Memory Requirements#

Recommended Settings for Best Performance#