parallel_for

parallel_for#

Suppose you want to apply a function Foo to each element of an array, and it is safe to process each element concurrently. Here is the sequential code to do this:

void SerialApplyFoo( float a[], size_t n ) {
    for( size_t i=0; i!=n; ++i )
        Foo(a[i]);
}

The iteration space here is of type size_t, and goes from 0 to n-1. The template function oneapi::tbb::parallel_for breaks this iteration space into chunks, and runs each chunk on a separate thread. The first step in parallelizing this loop is to convert the loop body into a form that operates on a chunk. The form is an STL-style function object, called the body object, in which operator() processes a chunk. The following code declares the body object.

#include "oneapi/tbb.h"

using namespace oneapi::tbb;

class ApplyFoo {
    float *const my_a;
public:
    void operator()( const blocked_range<size_t>& r ) const {
        float *a = my_a;
        for( size_t i=r.begin(); i!=r.end(); ++i )
           Foo(a[i]);
    }
    ApplyFoo( float a[] ) :
        my_a(a)
    {}
};

The using directive in the example enables you to use the library identifiers without having to write out the namespace prefix oneapi::tbb before each identifier. The rest of the examples assume that such a using directive is present.

Note the argument to operator(). A blocked_range<T> is a template class provided by the library. It describes a one-dimensional iteration space over type T. Class parallel_for works with other kinds of iteration spaces too. The library provides blocked_range2d, blocked_range3d, and blocked_nd_range for multidimensional spaces. You can define your own spaces as explained in Advanced Topic: Other Kinds of Iteration Spaces.

An instance of ApplyFoo needs member fields that remember all the local variables that were defined outside the original loop but used inside it. Usually, the constructor for the body object will initialize these fields, though parallel_for does not care how the body object is created. Template function parallel_for requires that the body object have a copy constructor, which is invoked to create a separate copy (or copies) for each worker thread. It also invokes the destructor to destroy these copies. In most cases, the implicitly generated copy constructor and destructor work correctly. If they do not, it is almost always the case (as usual in C++) that you must define both to be consistent.

Because the body object might be copied, its operator() should not modify the body. Otherwise the modification might or might not become visible to the thread that invoked parallel_for, depending upon whether operator() is acting on the original or a copy. As a reminder of this nuance, parallel_for requires that the body object’s operator() be declared const.

The example operator() loads my_a into a local variable a. Though not necessary, there are two reasons for doing this in the example:

  • Style. It makes the loop body look more like the original.

  • Performance. Sometimes putting frequently accessed values into local variables helps the compiler optimize the loop better, because local variables are often easier for the compiler to track.

Once you have the loop body written as a body object, invoke the template function parallel_for, as follows:

#include "oneapi/tbb.h"


void ParallelApplyFoo( float a[], size_t n ) {
    parallel_for(blocked_range<size_t>(0,n), ApplyFoo(a));
}

The blocked_range constructed here represents the entire iteration space from 0 to n-1, which parallel_for divides into subspaces for each processor. The general form of the constructor is blocked_range<T>(begin,end,grainsize). The T specifies the value type. The arguments begin and end specify the iteration space STL-style as a half-open interval [begin,end). The argument grainsize is explained in the Controlling Chunking section. The example uses the default grainsize of 1 because by default parallel_for applies a heuristic that works well with the default grainsize.