Pass Data to Algorithms#
For an algorithm to access data, it is important that the used execution policy matches the data storage type. The following table shows which execution policies can be used with various data storage types.
Data Storage |
Device Policies |
Host Policies |
---|---|---|
Yes |
No |
|
Device-allocated unified shared memory (USM) |
Yes |
No |
Shared and host-allocated USM |
Yes |
Yes |
|
Yes |
Yes |
|
See Use std::vector |
Yes |
Other data in host memory |
No |
Yes |
When using the standard-aligned (or host) execution policies, oneDPL supports data being passed to its algorithms as specified in the C++ standard (C++17 for algorithms working with iterators, C++20 for parallel range algorithms), with known restrictions and limitations.
According to the standard, the calling code must prevent data races when using algorithms with parallel execution policies.
Note
Implementations of std::vector<bool>
are not required to avoid data races for concurrent modifications
of vector elements. Some implementations may optimize multiple bool
elements into a bitfield, making it unsafe
for multithreading. For this reason, it is recommended to avoid std::vector<bool>
for anything but a read-only
input with the standard-aligned execution policies.
The following subsections describe proper ways to pass data to an algorithm invoked with a device execution policy.
Use oneapi::dpl::begin and oneapi::dpl::end Functions#
oneapi::dpl::begin
and oneapi::dpl::end
are special helper functions that
allow you to pass SYCL buffers to parallel algorithms. These functions accept
a SYCL buffer and return an object of an unspecified type that provides the following API:
It satisfies
CopyConstructible
andCopyAssignable
C++ named requirements and comparable withoperator==
andoperator!=
.It gives the following valid expressions:
a + n
,a - n
, anda - b
, wherea
andb
are objects of the type, andn
is an integer value. The effect of those operations is the same as for the type that satisfies theLegacyRandomAccessIterator
, a C++ named requirement.It provides the
get_buffer
method, which returns the buffer passed to thebegin
andend
functions.
The begin
and end
functions can take SYCL 2020 deduction tags and sycl::no_init
as arguments
to explicitly control which access mode should be applied to a particular buffer when submitting
a SYCL kernel to a device:
sycl::buffer<int> buf{/*...*/};
auto first_ro = oneapi::dpl::begin(buf, sycl::read_only);
auto first_wo = oneapi::dpl::begin(buf, sycl::write_only, sycl::no_init);
auto first_ni = oneapi::dpl::begin(buf, sycl::no_init);
To use the functions, add #include <oneapi/dpl/iterator>
to your code. For example:
#include <oneapi/dpl/execution>
#include <oneapi/dpl/algorithm>
#include <oneapi/dpl/iterator>
#include <random>
#include <sycl/sycl.hpp>
int main(){
std::vector<int> vec(1000);
std::generate(vec.begin(), vec.end(), std::minstd_rand{});
sycl::buffer<int> buf{ vec.data(), vec.size() };
auto buf_begin = oneapi::dpl::begin(buf);
auto buf_end = oneapi::dpl::end(buf);
oneapi::dpl::sort(oneapi::dpl::execution::dpcpp_default, buf_begin, buf_end);
return 0;
}
Use std::vector#
You can use iterators to an ordinary std::vector
with data in host memory, as shown in the following example:
#include <oneapi/dpl/execution>
#include <oneapi/dpl/algorithm>
#include <random>
#include <vector>
int main(){
std::vector<int> vec( 1000 );
std::generate(vec.begin(), vec.end(), std::minstd_rand{});
oneapi::dpl::sort(oneapi::dpl::execution::dpcpp_default, vec.begin(), vec.end());
return 0;
}
In this case a temporary SYCL buffer is created, the data is copied to this buffer, and it is processed according to the algorithm semantics. After processing on a device is complete, the modified data is copied from the temporary buffer back to the host container.
Note
For parallel range algorithms with device execution policies the use of ordinary std::vector
s is not supported.
While convenient, direct use of an ordinary std::vector
can lead to unintended copying between the host
and the device. We recommend working with SYCL buffers or with USM to reduce data copying.
Note
For specialized memory algorithms that begin or end the lifetime of data objects, that is,
uninitialized_*
and destroy*
families of functions, the data to initialize or destroy
should be accessible on the device without extra copying. Therefore these algorithms may not use
data storage on the host with device execution policies.
You can also use std::vector
with a sycl::usm_allocator
, as shown in the following example.
Make sure that the allocator and the execution policy use the same SYCL queue:
#include <oneapi/dpl/execution>
#include <oneapi/dpl/algorithm>
#include <random>
#include <vector>
#include <sycl/sycl.hpp>
int main(){
const int n = 1000;
auto policy = oneapi::dpl::execution::dpcpp_default;
sycl::usm_allocator<int, sycl::usm::alloc::shared> alloc(policy.queue());
std::vector<int, decltype(alloc)> vec(n, alloc);
std::generate(vec.begin(), vec.end(), std::minstd_rand{});
// Recommended to use USM pointers:
oneapi::dpl::sort(policy, vec.data(), vec.data() + vec.size());
/*
// Iterators for USM allocators might require extra copying - not a recommended method
oneapi::dpl::sort(policy, vec.begin(), vec.end());
*/
return 0;
}
For std::vector
with a USM allocator we recommend to use std::vector::data()
in
combination with std::vector::size()
as shown in the example above, rather than iterators to
std::vector
. That is because for some implementations of the C++ Standard Library it might not
be possible for oneDPL to detect that iterators are pointing to USM-allocated data. In that
case the data will be treated as if it were in host memory, with an extra copy made to a SYCL buffer.
Retrieving USM pointers from std::vector
as shown guarantees no unintended copying.
Use Range Views#
For parallel range algorithms with device execution policies,
place the data in USM or a USM-allocated std::vector
, and pass it to an algorithm
via a device-copyable range or view object such as std::ranges::subrange
or std::span
.
Note
Use of std::ranges::views::all
is not supported for algorithms with device execution policies.
These data ranges as well as supported range adaptors and factories may be combined into data transformation pipelines that also can be used with parallel range algorithms. For example:
#include <oneapi/dpl/execution>
#include <oneapi/dpl/algorithm>
#include <random>
#include <vector>
#include <span>
#include <ranges>
#include <functional>
#include <sycl/sycl.hpp>
int main(){
const int n = 1000;
auto policy = oneapi::dpl::execution::dpcpp_default;
sycl::queue q = policy.queue();
int* d_head = sycl::malloc_host<int>(n, q);
std::generate(d_head, d_head + n, std::minstd_rand{});
sycl::usm_allocator<int, sycl::usm::alloc::shared> alloc(q);
std::vector<int, decltype(alloc)> vec(n, alloc);
oneapi::dpl::ranges::copy(policy,
std::ranges::subrange(d_head, d_head + n) | std::views::transform(std::negate{}),
std::span(vec));
oneapi::dpl::ranges::sort(policy, std::span(vec));
sycl::free(d_head, q);
return 0;
}
Use oneDPL Iterators#
You can use iterators defined in <oneapi/dpl/iterator>
header in the oneapi::dpl
namespace:
counting_iterator
zip_iterator
transform_iterator
discard_iterator
permutation_iterator
Iterators section describes them in detail and provides usage examples.
If you want to use these iterators in combination with your own custom iterators, make sure to follow the guidelines in the Use Custom Iterators section.
Use Custom Iterators#
You can create your own iterators that can be used as input to oneDPL algorithms.
These custom iterators must meet the following requirements:
They must be random access iterators.
They must be SYCL device-copyable.
They must be indirectly device accessible if they point to data that is accessible on a device.
Note
If a custom iterator is not indirectly device accessible, the algorithm will create a temporary memory buffer and copy the data from the iterator to this buffer before processing on a device. This may lead to performance degradation, and also requires that the data be accessible on the host to be copied into the temporary memory buffer.
The term indirectly device accessible means that the data referenced by the iterator
can be accessed from within a SYCL kernel (that is, on a device).
oneDPL determines this by examining the return type of the free function
is_onedpl_indirectly_device_accessible(It)
, where It
is the iterator type.
If the return type is std::true_type
, the iterator is considered indirectly device accessible.
This function is defined in <oneapi/dpl/iterator>
in the oneapi::dpl
namespace.
To make a custom iterator indirectly device accessible (assume its type is It
),
define an overload of is_onedpl_indirectly_device_accessible
that accepts an argument of type It
and returns std::true_type
.
This overload must be visible through argument-dependent lookup.
If it is found, the trait oneapi::dpl::is_onedpl_indirectly_device_accessible_v<It>
,
which is also defined in <oneapi/dpl/iterator>
, evaluates to true
.
The example below shows how to define such an overload:
#include <oneapi/dpl/execution>
#include <oneapi/dpl/iterator>
#include <oneapi/dpl/numeric>
#include <oneapi/dpl/algorithm>
#include <iterator>
#include <cstddef>
#include <iostream>
#include <sycl/sycl.hpp>
template <typename It>
class strided_iterator
{
public:
using value_type = typename std::iterator_traits<It>::value_type;
using difference_type = typename std::iterator_traits<It>::difference_type;
using iterator_category = std::random_access_iterator_tag;
using reference = typename std::iterator_traits<It>::reference;
using pointer = typename std::iterator_traits<It>::pointer;
strided_iterator(It ptr, difference_type stride): ptr(ptr), stride(stride) {}
reference operator*() const { return *ptr; }
pointer operator->() const { return ptr; }
reference operator[](difference_type n) const { return *(*this + n);}
strided_iterator& operator++() { ptr += stride; return *this; }
strided_iterator& operator--() { ptr -= stride; return *this; }
strided_iterator& operator+=(difference_type n) { ptr += n * stride; return *this;}
strided_iterator& operator-=(difference_type n) { ptr -= n * stride; return *this; }
strided_iterator operator+(difference_type n) const { return strided_iterator(ptr + n * stride, stride);}
strided_iterator operator-(difference_type n) const { return strided_iterator(ptr - n * stride, stride); }
difference_type operator-(const strided_iterator& other) const {return (ptr - other.ptr) / stride; }
bool operator==(const strided_iterator& other) const { return ptr == other.ptr; }
bool operator!=(const strided_iterator& other) const { return ptr != other.ptr; }
bool operator<(const strided_iterator& other) const { return ptr < other.ptr; }
bool operator>(const strided_iterator& other) const { return ptr > other.ptr; }
bool operator<=(const strided_iterator& other) const { return ptr <= other.ptr; }
bool operator>=(const strided_iterator& other) const { return ptr >= other.ptr; }
// Another way to make this iterator indirectly device accessible
// friend oneapi::dpl::is_indirectly_device_accessible<It> is_onedpl_indirectly_device_accessible(strided_iterator) { return {}; }
private:
It ptr;
difference_type stride;
};
// Make strided_iterator indirectly device accessible when it wraps an indirectly device accessible type
template <typename It>
auto is_onedpl_indirectly_device_accessible(strided_iterator<It>) -> oneapi::dpl::is_indirectly_device_accessible<It>;
int main() {
sycl::queue q{};
const int n = 10;
int* d_head = sycl::malloc_device<int>(n, q);
// Fill the memory with values from 0 to 9
oneapi::dpl::copy(oneapi::dpl::execution::make_device_policy<class copy_kernel>(q),
oneapi::dpl::counting_iterator<int>(0),
oneapi::dpl::counting_iterator<int>(n),
d_head);
// Reduce every second element, 5 elements in total: 0, 2, 4, 6, 8
strided_iterator<int*> stride2(d_head, 2);
auto res = oneapi::dpl::reduce(oneapi::dpl::execution::make_device_policy<class reduce_kernel>(q),
stride2, stride2 + 5);
// is_indirectly_device_accessible_v: 1
// result: 20
std::cout << "is_indirectly_device_accessible_v: "
<< (oneapi::dpl::is_indirectly_device_accessible_v<strided_iterator<int*>>) << std::endl;
std::cout << "result: " << res << std::endl;
sycl::free(d_head, q);
return 0;
}
The example above uses oneapi::dpl::is_indirectly_device_accessible<It>
, where It
is int*
.
oneDPL predefines an overload of
oneapi::dpl::is_indirectly_device_accessible<int*>
that returns std::true_type
assuming that pointers refer to USM-allocated data.
It also automatically treats the following entities as indirectly device accessible:
Iterators to
std::vector
with a USM allocator.Objects returned by
oneapi::dpl::begin
andoneapi::dpl::end
functions.counting_iterator
anddiscard_iterator
.zip_iterator
,transform_iterator
, andpermutation_iterator
, if their underlying iterators are indirectly device accessible.
For more information, refer to the Iterators section of oneDPL specification.