CPU Features Dispatching#

For each algorithm oneDAL provides several code paths for x86-64-compatible architectural extensions.

Following extensions are currently supported:

  • Intel® Streaming SIMD Extensions 2 (Intel® SSE2)

  • Intel® Streaming SIMD Extensions 4.2 (Intel® SSE4.2)

  • Intel® Advanced Vector Extensions 2 (Intel® AVX2)

  • Intel® Advanced Vector Extensions 512 (Intel® AVX-512)

The particular code path is chosen at runtime based on underlying hardware properties.

This chapter describes how the code is organized to support this variety of extensions.

Algorithm Implementation Options#

In addition to the architectural extensions, an algorithm in oneDAL may have various implementation options. Below is a description of these options to help you better understand the oneDAL code structure and conventions.

Computational Tasks#

An algorithm might have various tasks to compute. The most common options are:

Computational Stages#

An algorithm might have training and inference computation stages aimed at training a model on the input dataset and computing the inference results, respectively.

Computational Methods#

An algorithm can support several methods for the same type of computations. For example, kNN algorithm supports brute_force and kd_tree methods for algorithm training and inference.

Computational Modes#

oneDAL can provide several computational modes for an algorithm. See Computational Modes chapter for details.

Folders and Files#

Suppose that you are working on some algorithm Abc in oneDAL.

The part of the implementation of this algorithms that is running on CPU should be located in cpp/daal/src/algorithms/abc folder.

Suppose that it provides:

  • classification and regression learning tasks;

  • training and inference stages;

  • method1 and method2 for the training stage and only method1 for inference stage;

  • only batch computational mode.

Then the cpp/daal/src/algorithms/abc folder should contain at least the following files:

cpp/daal/src/algorithms/abc/
  |-- abc_classification_predict_method1_batch_fpt_cpu.cpp
  |-- abc_classification_predict_method1_impl.i
  |-- abc_classification_predict_kernel.h
  |-- abc_classification_train_method1_batch_fpt_cpu.cpp
  |-- abc_classification_train_method2_batch_fpt_cpu.cpp
  |-- abc_classification_train_method1_impl.i
  |-- abc_classification_train_method2_impl.i
  |-- abc_classification_train_kernel.h
  |-- abc_regression_predict_method1_batch_fpt_cpu.cpp
  |-- abc_regression_predict_method1_batch_fpt_cpu.cpp
  |-- abc_regression_predict_method1_impl.i
  |-- abc_regression_predict_kernel.h
  |-- abc_regression_train_method1_batch_fpt_cpu.cpp
  |-- abc_regression_train_method2_batch_fpt_cpu.cpp
  |-- abc_regression_train_method1_impl.i
  |-- abc_regression_train_method2_impl.i
  |-- abc_regression_train_kernel.h

Alternative variant of the folder structure to avoid storing too many files within a single folder could be:

cpp/daal/src/algorithms/abc/
  |-- classification/
  |     |-- abc_classification_predict_method1_batch_fpt_cpu.cpp
  |     |-- abc_classification_predict_method1_impl.i
  |     |-- abc_classification_predict_kernel.h
  |     |-- abc_classification_train_method1_batch_fpt_cpu.cpp
  |     |-- abc_classification_train_method2_batch_fpt_cpu.cpp
  |     |-- abc_classification_train_method1_impl.i
  |     |-- abc_classification_train_method2_impl.i
  |     |-- abc_classification_train_kernel.h
  |-- regression/
        |-- abc_regression_predict_method1_batch_fpt_cpu.cpp
        |-- abc_regression_predict_method1_impl.i
        |-- abc_regression_predict_kernel.h
        |-- abc_regression_train_method1_batch_fpt_cpu.cpp
        |-- abc_regression_train_method2_batch_fpt_cpu.cpp
        |-- abc_regression_train_method1_impl.i
        |-- abc_regression_train_method2_impl.i
        |-- abc_regression_train_kernel.h

The names of the files stay the same in this case, just the folder layout differs.

The folders of the algorithms that are already implemented can contain additional files. For example, files with container.h, dispatcher.cpp suffixes, etc. These files are used in the Data Analytics Acceleration Library (DAAL) interface implementation. That interface is still available to users, but it is not recommended for use in new code. The files related to the DAAL interface are not described here as they are not part of the CPU features dispatching mechanism.

Further the purpose and contents of each file are to be described on the example of classification training task. For other types of the tasks the structure of the code is similar.

*_kernel.h#

In the directory structure of the Abc algorithm, there are files with a _kernel.h suffix. These files contain the definitions of one or several template classes that define member functions that do the actual computations. Here is a variant of the Abc training algorithm kernel definition in the file abc_classification_train_kernel.h:

#ifndef __ABC_CLASSIFICATION_TRAIN_KERNEL_H__
#define __ABC_CLASSIFICATION_TRAIN_KERNEL_H__

#include "src/algorithms/kernel.h"
#include "data_management/data/numeric_table.h"    // NumericTable class
/* Other necessary includes go here */

using namespace daal::data_management;    // NumericTable class

namespace daal::algorithms::abc::training::internal
{
/* Dummy base template class */
template <typename algorithmFPType, Method method, CpuType cpu>
class AbcClassificationTrainingKernel : public Kernel
{};

/* Computational kernel for 'method1' of the Abc training algoirthm */
template <typename algorithmFPType, CpuType cpu>
class AbcClassificationTrainingKernel<algorithmFPType, method1, cpu> : public Kernel
{
public:
   services::Status compute(/* Input and output arguments for the 'method1' */);
};

/* Computational kernel for 'method2' of the Abc training algoirthm */
template <typename algorithmFPType, CpuType cpu>
class AbcClassificationTrainingKernel<algorithmFPType, method2, cpu> : public Kernel
{
public:
   services::Status compute(/* Input and output arguments for the 'method2' */);
};

} // namespace daal::algorithms::abc::training::internal

#endif // __ABC_CLASSIFICATION_TRAIN_KERNEL_H__

Typical template parameters are:

  • algorithmFPType Data type to use in intermediate computations for the algorithm,

    float or double.

  • method Computational methods of the algorithm. method1 or method2 in the case of Abc.

  • cpu Version of the cpu-specific implementation of the algorithm, daal::CpuType.

Implementations for different methods are usually defined using partial class templates specialization.

*_impl.i#

In the directory structure of the Abc algorithm, there are files with a _impl.i suffix. These files contain the implementations of the computational functions defined in the files with a _kernel.h suffix. Here is a variant of method1 implementation for Abc training algorithm that does not contain any instruction set specific code. The implementation is located in the file abc_classification_train_method1_impl.i:

/*
//++
//  Implementation of Abc training algorithm.
//--
*/

#include "src/algorithms/service_error_handling.h"
#include "src/data_management/service_numeric_table.h"

namespace daal::algorithms::abc::training::internal
{

template <typename algorithmFPType, CpuType cpu>
services::Status AbcClassificationTrainingKernel<algorithmFPType, method1, cpu>::compute(/* ... */)
{
    services::Status status;

    /* Implementation that does not contain instruction set specific code */

    return status;
}


} // namespace daal::algorithms::abc::training::internal

Although the implementation of the method1 does not contain any instruction set specific code, it is expected that the developers leverage SIMD related macros available in oneDAL. For example, PRAGMA_IVDEP, PRAGMA_VECTOR_ALWAYS, PRAGMA_VECTOR_ALIGNED and other pragmas defined in service_defines.h. This will guide the compiler to generate more efficient code for the target architecture.

Consider that the implementation of the method2 for the same algorithm will be different and will contain AVX-512-specific code located in cpuSpecificCode function. Note that all the compiler-specific code should be gated by values of compiler-specific defines. For example, the Intel® oneAPI DPC++/C++ Compiler specific code should be gated the existence of the DAAL_INTEL_CPP_COMPILER define. All the CPU-specific code should be gated on the value of CPU-specific define. For example, the AVX-512 specific code should be gated on the value __CPUID__(DAAL_CPU) == __avx512__.

Then the implementation of the method2 in the file abc_classification_train_method2_impl.i will look like:

/*
//++
//  Implementation of Abc training algorithm.
//--
*/

#include "src/algorithms/service_error_handling.h"
#include "src/data_management/service_numeric_table.h"

namespace daal::algorithms::abc::training::internal
{

/* Generic template implementation of cpuSpecificCode function for all data types
   and various instruction set architectures */
template <typename algorithmFPType, CpuType cpu>
services::Status cpuSpecificCode(/* arguments */)
{
   /* Implementation */
};

#if defined(DAAL_INTEL_CPP_COMPILER) && (__CPUID__(DAAL_CPU) == __avx512__)

/* Specialization of cpuSpecificCode function for double data type and Intel(R) AVX-512 instruction set */
template <>
services::Status cpuSpecificCode<double, avx512>(/* arguments */)
{
   /* Implementation */
};

/* Specialization of cpuSpecificCode function for float data type and Intel(R) AVX-512 instruction set */
template <>
services::Status cpuSpecificCode<float, avx512>(/* arguments */)
{
   /* Implementation */
};

#endif // DAAL_INTEL_CPP_COMPILER && (__CPUID__(DAAL_CPU) == __avx512__)

template <typename algorithmFPType, CpuType cpu>
services::Status AbcClassificationTrainingKernel<algorithmFPType, method2, cpu>::compute(/* arguments */)
{
    services::Status status;

    /* Implementation that calls CPU-specific code: */
    status = cpuSpecificCode<algorithmFPType, cpu>(/* ... */);
    DAAL_CHECK_STATUS_VAR(status);

    /* Implementation continues */

    return status;
}

} // namespace daal::algorithms::abc::training::internal

*_fpt_cpu.cpp#

In the directory structure of the Abc algorithm, there are files with a _fpt_cpu.cpp suffix. These files contain the instantiations of the template classes defined in the files with a _kernel.h suffix. The instantiation of the Abc training algorithm kernel for method1 is located in the file abc_classification_train_method1_batch_fpt_cpu.cpp:

/*
//++
//  instantiations of method1 of the Abc training algorithm.
//--
*/

#include "src/algorithms/abc/abc_classification_train_kernel.h"
#include "src/algorithms/abc/abc_classification_train_method1_impl.i"

namespace daal::algorithms::abc::training::internal
{
template class DAAL_EXPORT AbcClassificationTrainingKernel<DAAL_FPTYPE, method1, DAAL_CPU>;
} // namespace daal::algorithms::abc::training::internal

_fpt_cpu.cpp files are not compiled directly into object files. First, multiple copies of those files are made replacing the fpt, which stands for ‘floating point type’, and cpu parts of the file name as well as the corresponding DAAL_FPTYPE and DAAL_CPU macros with the actual data type and CPU type values. Then the resulting files are compiled with appropriate CPU-specific compiler optimization options.

The values for fpt file name part replacement are:

  • flt for float data type, and

  • dbl for double data type.

The values for DAAL_FPTYPE macro replacement are float and double, respectively.

The values for cpu file name part replacement are:

  • nrh for Intel® SSE2 architecture, which stands for Northwood,

  • neh for Intel® SSE4.2 architecture, which stands for Nehalem,

  • hsw for Intel® AVX2 architecture, which stands for Haswell,

  • skx for Intel® AVX-512 architecture, which stands for Skylake-X.

The values for DAAL_CPU macro replacement are:

  • __sse2__ for Intel® SSE2 architecture,

  • __sse42__ for Intel® SSE4.2 architecture,

  • __avx2__ for Intel® AVX2 architecture,

  • __avx512__ for Intel® AVX-512 architecture.

Build System Configuration#

This chapter describes which parts of the build system need to be modified to add new architectural extension or to remove an outdated one.

Makefile#

The most important definitions and functions for CPU features dispatching are located in the files 32e.mk for x86-64 architecture, riscv64.mk for RISC-V 64-bit architecture, and arm.mk for ARM architecture. Those files are included into operating system related makefiles. For example, the 32e.mk file is included into lnx32e.mk file:

include dev/make/function_definitions/32e.mk

And lnx32e.mk and similar files are included into the main Makefile:

include dev/make/function_definitions/$(PLAT).mk

Where $(PLAT) is the platform name, for example, lnx32e, win32e, lnxriscv64, etc.

To add a new architectural extension into 32e.mk file, CPUs and CPUs.files lists need to be updated. The functions like set_uarch_options_for_compiler and others should also be updated accordingly.

The compiler options for the new architectural extension should be added to the respective file in the compiler_definitions folder.

For example, gnu.32e.mk file contains the compiler options for the GNU compiler for x86-64 architecture in the form option_name.compiler_name:

p4_OPT.gnu   = $(-Q)march=nocona
mc3_OPT.gnu  = $(-Q)march=corei7
avx2_OPT.gnu = $(-Q)march=haswell
skx_OPT.gnu  = $(-Q)march=skylake

Bazel#

For now, Bazel build is supported only for Linux x86-64 platform It provides cpu option that allows to specify the list of target architectural extensions.

To add a new architectural extension into Bazel configuration, following steps should be done:

  • Add the new extension to the list of allowed values in the _ISA_EXTENSIONS variable in the config.bzl file;

  • Update the get_cpu_flags function in the flags.bzl file to provide the compiler flags for the new extension;

  • Update the cpu_defines dictionaries in dal.bzl and daal.bzl files accordingly.