CPU Features Dispatching#
For each algorithm oneDAL provides several code paths for x86-64-compatible architectural extensions.
Following extensions are currently supported:
Intel® Streaming SIMD Extensions 2 (Intel® SSE2)
Intel® Streaming SIMD Extensions 4.2 (Intel® SSE4.2)
Intel® Advanced Vector Extensions 2 (Intel® AVX2)
Intel® Advanced Vector Extensions 512 (Intel® AVX-512)
The particular code path is chosen at runtime based on underlying hardware properties.
This chapter describes how the code is organized to support this variety of extensions.
Algorithm Implementation Options#
In addition to the architectural extensions, an algorithm in oneDAL may have various implementation options. Below is a description of these options to help you better understand the oneDAL code structure and conventions.
Computational Tasks#
An algorithm might have various tasks to compute. The most common options are:
Computational Stages#
An algorithm might have training
and inference
computation stages aimed
at training a model on the input dataset and computing the inference results, respectively.
Computational Methods#
An algorithm can support several methods for the same type of computations. For example, kNN algorithm supports brute_force and kd_tree methods for algorithm training and inference.
Computational Modes#
oneDAL can provide several computational modes for an algorithm. See Computational Modes chapter for details.
Folders and Files#
Suppose that you are working on some algorithm Abc
in oneDAL.
The part of the implementation of this algorithms that is running on CPU should be located in cpp/daal/src/algorithms/abc folder.
Suppose that it provides:
classification
andregression
learning tasks;training
andinference
stages;method1
andmethod2
for thetraining
stage and onlymethod1
forinference
stage;only
batch
computational mode.
Then the cpp/daal/src/algorithms/abc folder should contain at least the following files:
cpp/daal/src/algorithms/abc/
|-- abc_classification_predict_method1_batch_fpt_cpu.cpp
|-- abc_classification_predict_method1_impl.i
|-- abc_classification_predict_kernel.h
|-- abc_classification_train_method1_batch_fpt_cpu.cpp
|-- abc_classification_train_method2_batch_fpt_cpu.cpp
|-- abc_classification_train_method1_impl.i
|-- abc_classification_train_method2_impl.i
|-- abc_classification_train_kernel.h
|-- abc_regression_predict_method1_batch_fpt_cpu.cpp
|-- abc_regression_predict_method1_batch_fpt_cpu.cpp
|-- abc_regression_predict_method1_impl.i
|-- abc_regression_predict_kernel.h
|-- abc_regression_train_method1_batch_fpt_cpu.cpp
|-- abc_regression_train_method2_batch_fpt_cpu.cpp
|-- abc_regression_train_method1_impl.i
|-- abc_regression_train_method2_impl.i
|-- abc_regression_train_kernel.h
Alternative variant of the folder structure to avoid storing too many files within a single folder could be:
cpp/daal/src/algorithms/abc/
|-- classification/
| |-- abc_classification_predict_method1_batch_fpt_cpu.cpp
| |-- abc_classification_predict_method1_impl.i
| |-- abc_classification_predict_kernel.h
| |-- abc_classification_train_method1_batch_fpt_cpu.cpp
| |-- abc_classification_train_method2_batch_fpt_cpu.cpp
| |-- abc_classification_train_method1_impl.i
| |-- abc_classification_train_method2_impl.i
| |-- abc_classification_train_kernel.h
|-- regression/
|-- abc_regression_predict_method1_batch_fpt_cpu.cpp
|-- abc_regression_predict_method1_impl.i
|-- abc_regression_predict_kernel.h
|-- abc_regression_train_method1_batch_fpt_cpu.cpp
|-- abc_regression_train_method2_batch_fpt_cpu.cpp
|-- abc_regression_train_method1_impl.i
|-- abc_regression_train_method2_impl.i
|-- abc_regression_train_kernel.h
The names of the files stay the same in this case, just the folder layout differs.
The folders of the algorithms that are already implemented can contain additional files.
For example, files with container.h
, dispatcher.cpp
suffixes, etc.
These files are used in the Data Analytics Acceleration Library (DAAL) interface implementation.
That interface is still available to users, but it is not recommended for use in new code.
The files related to the DAAL interface are not described here as they are not part of the CPU features
dispatching mechanism.
Further the purpose and contents of each file are to be described on the example of classification training task. For other types of the tasks the structure of the code is similar.
*_kernel.h#
In the directory structure of the Abc
algorithm, there are files with a _kernel.h suffix.
These files contain the definitions of one or several template classes that define member functions that
do the actual computations. Here is a variant of the Abc
training algorithm kernel definition in the file
abc_classification_train_kernel.h:
#ifndef __ABC_CLASSIFICATION_TRAIN_KERNEL_H__
#define __ABC_CLASSIFICATION_TRAIN_KERNEL_H__
#include "src/algorithms/kernel.h"
#include "data_management/data/numeric_table.h" // NumericTable class
/* Other necessary includes go here */
using namespace daal::data_management; // NumericTable class
namespace daal::algorithms::abc::training::internal
{
/* Dummy base template class */
template <typename algorithmFPType, Method method, CpuType cpu>
class AbcClassificationTrainingKernel : public Kernel
{};
/* Computational kernel for 'method1' of the Abc training algoirthm */
template <typename algorithmFPType, CpuType cpu>
class AbcClassificationTrainingKernel<algorithmFPType, method1, cpu> : public Kernel
{
public:
services::Status compute(/* Input and output arguments for the 'method1' */);
};
/* Computational kernel for 'method2' of the Abc training algoirthm */
template <typename algorithmFPType, CpuType cpu>
class AbcClassificationTrainingKernel<algorithmFPType, method2, cpu> : public Kernel
{
public:
services::Status compute(/* Input and output arguments for the 'method2' */);
};
} // namespace daal::algorithms::abc::training::internal
#endif // __ABC_CLASSIFICATION_TRAIN_KERNEL_H__
Typical template parameters are:
algorithmFPType
Data type to use in intermediate computations for the algorithm,float
ordouble
.
method
Computational methods of the algorithm.method1
ormethod2
in the case ofAbc
.cpu
Version of the cpu-specific implementation of the algorithm,daal::CpuType
.
Implementations for different methods are usually defined using partial class templates specialization.
*_impl.i#
In the directory structure of the Abc
algorithm, there are files with a _impl.i suffix.
These files contain the implementations of the computational functions defined in the files with a _kernel.h suffix.
Here is a variant of method1
implementation for Abc
training algorithm that does not contain any
instruction set specific code. The implementation is located in the file abc_classification_train_method1_impl.i:
/*
//++
// Implementation of Abc training algorithm.
//--
*/
#include "src/algorithms/service_error_handling.h"
#include "src/data_management/service_numeric_table.h"
namespace daal::algorithms::abc::training::internal
{
template <typename algorithmFPType, CpuType cpu>
services::Status AbcClassificationTrainingKernel<algorithmFPType, method1, cpu>::compute(/* ... */)
{
services::Status status;
/* Implementation that does not contain instruction set specific code */
return status;
}
} // namespace daal::algorithms::abc::training::internal
Although the implementation of the method1
does not contain any instruction set specific code, it is
expected that the developers leverage SIMD related macros available in oneDAL.
For example, PRAGMA_IVDEP
, PRAGMA_VECTOR_ALWAYS
, PRAGMA_VECTOR_ALIGNED
and other pragmas defined in
service_defines.h.
This will guide the compiler to generate more efficient code for the target architecture.
Consider that the implementation of the method2
for the same algorithm will be different and will contain
AVX-512-specific code located in cpuSpecificCode
function. Note that all the compiler-specific code
should be gated by values of compiler-specific defines.
For example, the Intel® oneAPI DPC++/C++ Compiler specific code should be gated the existence of the
DAAL_INTEL_CPP_COMPILER
define. All the CPU-specific code should be gated on the value of CPU-specific define.
For example, the AVX-512 specific code should be gated on the value __CPUID__(DAAL_CPU) == __avx512__
.
Then the implementation of the method2
in the file abc_classification_train_method2_impl.i will look like:
/*
//++
// Implementation of Abc training algorithm.
//--
*/
#include "src/algorithms/service_error_handling.h"
#include "src/data_management/service_numeric_table.h"
namespace daal::algorithms::abc::training::internal
{
/* Generic template implementation of cpuSpecificCode function for all data types
and various instruction set architectures */
template <typename algorithmFPType, CpuType cpu>
services::Status cpuSpecificCode(/* arguments */)
{
/* Implementation */
};
#if defined(DAAL_INTEL_CPP_COMPILER) && (__CPUID__(DAAL_CPU) == __avx512__)
/* Specialization of cpuSpecificCode function for double data type and Intel(R) AVX-512 instruction set */
template <>
services::Status cpuSpecificCode<double, avx512>(/* arguments */)
{
/* Implementation */
};
/* Specialization of cpuSpecificCode function for float data type and Intel(R) AVX-512 instruction set */
template <>
services::Status cpuSpecificCode<float, avx512>(/* arguments */)
{
/* Implementation */
};
#endif // DAAL_INTEL_CPP_COMPILER && (__CPUID__(DAAL_CPU) == __avx512__)
template <typename algorithmFPType, CpuType cpu>
services::Status AbcClassificationTrainingKernel<algorithmFPType, method2, cpu>::compute(/* arguments */)
{
services::Status status;
/* Implementation that calls CPU-specific code: */
status = cpuSpecificCode<algorithmFPType, cpu>(/* ... */);
DAAL_CHECK_STATUS_VAR(status);
/* Implementation continues */
return status;
}
} // namespace daal::algorithms::abc::training::internal
*_fpt_cpu.cpp#
In the directory structure of the Abc
algorithm, there are files with a _fpt_cpu.cpp suffix.
These files contain the instantiations of the template classes defined in the files with a _kernel.h suffix.
The instantiation of the Abc
training algorithm kernel for method1
is located in the file
abc_classification_train_method1_batch_fpt_cpu.cpp:
/*
//++
// instantiations of method1 of the Abc training algorithm.
//--
*/
#include "src/algorithms/abc/abc_classification_train_kernel.h"
#include "src/algorithms/abc/abc_classification_train_method1_impl.i"
namespace daal::algorithms::abc::training::internal
{
template class DAAL_EXPORT AbcClassificationTrainingKernel<DAAL_FPTYPE, method1, DAAL_CPU>;
} // namespace daal::algorithms::abc::training::internal
_fpt_cpu.cpp files are not compiled directly into object files. First, multiple copies of those files
are made replacing the fpt
, which stands for ‘floating point type’, and cpu
parts of the file name
as well as the corresponding DAAL_FPTYPE
and DAAL_CPU
macros with the actual data type and CPU type values.
Then the resulting files are compiled with appropriate CPU-specific compiler optimization options.
The values for fpt
file name part replacement are:
flt
forfloat
data type, anddbl
fordouble
data type.
The values for DAAL_FPTYPE
macro replacement are float
and double
, respectively.
The values for cpu
file name part replacement are:
nrh
for Intel® SSE2 architecture, which stands for Northwood,neh
for Intel® SSE4.2 architecture, which stands for Nehalem,hsw
for Intel® AVX2 architecture, which stands for Haswell,skx
for Intel® AVX-512 architecture, which stands for Skylake-X.
The values for DAAL_CPU
macro replacement are:
__sse2__
for Intel® SSE2 architecture,__sse42__
for Intel® SSE4.2 architecture,__avx2__
for Intel® AVX2 architecture,__avx512__
for Intel® AVX-512 architecture.
Build System Configuration#
This chapter describes which parts of the build system need to be modified to add new architectural extension or to remove an outdated one.
Makefile#
The most important definitions and functions for CPU features dispatching are located in the files
32e.mk for x86-64 architecture, riscv64.mk for RISC-V 64-bit architecture, and arm.mk
for ARM architecture.
Those files are included into operating system related makefiles.
For example, the 32e.mk file is included into lnx32e.mk
file:
include dev/make/function_definitions/32e.mk
And lnx32e.mk
and similar files are included into the main Makefile:
include dev/make/function_definitions/$(PLAT).mk
Where $(PLAT)
is the platform name, for example, lnx32e
, win32e
, lnxriscv64
, etc.
To add a new architectural extension into 32e.mk file, CPUs
and CPUs.files
lists need to be updated.
The functions like set_uarch_options_for_compiler
and others should also be updated accordingly.
The compiler options for the new architectural extension should be added to the respective file in the compiler_definitions folder.
For example, gnu.32e.mk
file contains the compiler options for the GNU compiler for x86-64 architecture in the form
option_name.compiler_name
:
p4_OPT.gnu = $(-Q)march=nocona
mc3_OPT.gnu = $(-Q)march=corei7
avx2_OPT.gnu = $(-Q)march=haswell
skx_OPT.gnu = $(-Q)march=skylake
Bazel#
For now, Bazel build is supported only for Linux x86-64 platform
It provides cpu
option
that allows to specify the list of target architectural extensions.
To add a new architectural extension into Bazel configuration, following steps should be done:
Add the new extension to the list of allowed values in the
_ISA_EXTENSIONS
variable in the config.bzl file;Update the
get_cpu_flags
function in the flags.bzl file to provide the compiler flags for the new extension;Update the
cpu_defines
dictionaries in dal.bzl and daal.bzl files accordingly.