parallel_for#
Suppose you want to apply a function Foo
to each element of an
array, and it is safe to process each element concurrently. Here is the
sequential code to do this:
void SerialApplyFoo( float a[], size_t n ) {
for( size_t i=0; i!=n; ++i )
Foo(a[i]);
}
The iteration space here is of type size_t
, and goes from 0
to
n-1
. The template function oneapi::tbb::parallel_for
breaks this iteration
space into chunks, and runs each chunk on a separate thread. The first
step in parallelizing this loop is to convert the loop body into a form
that operates on a chunk. The form is an STL-style function object,
called the body object, in which operator()
processes a chunk. The
following code declares the body object.
#include "oneapi/tbb.h"
using namespace oneapi::tbb;
class ApplyFoo {
float *const my_a;
public:
void operator()( const blocked_range<size_t>& r ) const {
float *a = my_a;
for( size_t i=r.begin(); i!=r.end(); ++i )
Foo(a[i]);
}
ApplyFoo( float a[] ) :
my_a(a)
{}
};
The using
directive in the example enables you to use the library
identifiers without having to write out the namespace prefix oneapi::tbb
before each identifier. The rest of the examples assume that such a
using
directive is present.
Note the argument to operator()
. A blocked_range<T>
is a
template class provided by the library. It describes a one-dimensional
iteration space over type T
. Class parallel_for
works with other
kinds of iteration spaces too. The library provides blocked_range2d
,
blocked_range3d
, and blocked_nd_range
for multidimensional spaces.
You can define your own spaces as explained
in Advanced Topic: Other Kinds of Iteration Spaces.
An instance of ApplyFoo
needs member fields that remember all the
local variables that were defined outside the original loop but used
inside it. Usually, the constructor for the body object will initialize
these fields, though parallel_for
does not care how the body object
is created. Template function parallel_for
requires that the body
object have a copy constructor, which is invoked to create a separate
copy (or copies) for each worker thread. It also invokes the destructor
to destroy these copies. In most cases, the implicitly generated copy
constructor and destructor work correctly. If they do not, it is almost
always the case (as usual in C++) that you must define both to be
consistent.
Because the body object might be copied, its operator()
should not
modify the body. Otherwise the modification might or might not become
visible to the thread that invoked parallel_for
, depending upon
whether operator()
is acting on the original or a copy. As a
reminder of this nuance, parallel_for
requires that the body
object’s operator()
be declared const
.
The example operator()
loads my_a
into a local variable a
.
Though not necessary, there are two reasons for doing this in the
example:
Style. It makes the loop body look more like the original.
Performance. Sometimes putting frequently accessed values into local variables helps the compiler optimize the loop better, because local variables are often easier for the compiler to track.
Once you have the loop body written as a body object, invoke the
template function parallel_for
, as follows:
#include "oneapi/tbb.h"
void ParallelApplyFoo( float a[], size_t n ) {
parallel_for(blocked_range<size_t>(0,n), ApplyFoo(a));
}
The blocked_range
constructed here represents the entire iteration
space from 0 to n-1, which parallel_for
divides into subspaces for
each processor. The general form of the constructor is
blocked_range<T>(begin,end,grainsize)
. The T
specifies the value
type. The arguments begin
and end
specify the iteration space
STL-style as a half-open interval [begin
,end
). The argument
grainsize is explained in the Controlling Chunking section. The
example uses the default grainsize of 1 because by default
parallel_for
applies a heuristic that works well with the default
grainsize.