oneMKL Architecture#

The oneMKL element of oneAPI has several general assumptions, requirements and recommendations for all domains contained therein. These will be addressed in this architecture section. In particular, DPC++ allows for a great control over the execution of kernels on the various devices. We discuss the supported execution models of oneMKL APIs in Execution Model. A discussion of how data is stored and passed in and out of the APIs is addressed in Memory Model. The general structure and design of oneMKL APIs including namespaces and common data types are expressed in API Design. The exceptions and error handling are described in Exceptions and Error Handling. Finally all the other necessary aspects related to oneMKL architecture can be found in Other Features including versioning and discussion of pre and post conditions. Other nonessential, but useful aspects of the oneMKL architecture and design may also be found in the oneMKL Appendix.

Execution Model#

This section describes the execution environment common to all oneMKL functionality. The execution environment includes how data is provided to computational routines in Use of Queues, support for several devices in Device Usage, synchronous and asynchronous execution models in Asynchronous Execution and Host Thread Safety.

Use of Queues#

The sycl::queue defined in the oneAPI DPC++ specification is used to specify the device and features enabled on that device on which a task will be enqueued. There are two forms of computational routines in oneMKL: class based Member Functions and standalone Non-Member Functions. As these may interact with the sycl::queue in different ways, we provide a section for each one to describe assumptions.

Non-Member Functions#

Each oneMKL non-member computational routine takes a sycl::queue reference as its first parameter:

mkl::domain::routine(sycl::queue &q, ...);

All computation performed by the routine shall be done on the hardware device(s) associated with this queue, with possible aid from the host, unless otherwise specified. In the case of an ordered queue, all computation shall also be ordered with respect to other kernels as if enqueued on that queue.

A particular oneMKL implementation may not support the execution of a given oneMKL routine on the specified device(s). In this case, the implementation may either perform the computation on the host or throw an exception. See Exceptions and Error Handling for the possible exceptions.

Member Functions#

oneMKL class-based APIs, such as those in the RNG and DFT domains, require a sycl::queue as an argument to the constructor or another setup routine. The execution requirements for computational routines from the previous section also apply to computational class methods.

Device Usage#

oneMKL itself does not currently provide any interfaces for controlling device usage: for instance, controlling the number of cores used on the CPU, or the number of execution units on a GPU. However, such functionality may be available by partitioning a sycl::device instance into subdevices, when supported by the device.

When given a queue associated with such a subdevice, a oneMKL implementation shall only perform computation on that subdevice.

Asynchronous Execution#

The oneMKL API is designed to allow asynchronous execution of computational routines, to facilitate concurrent usage of multiple devices in the system. Each computational routine enqueues work to be performed on the selected device, and may (but is not required to) return before execution completes.

Hence, it is the calling application’s responsibility to ensure that any inputs are valid until computation is complete, and likewise to wait for computation completion before reading any outputs. This can be done automatically when using DPC++ buffers, or manually when using Unified Shared Memory (USM) pointers, as described in the sections below.

Unless otherwise specified, asynchronous execution is allowed, but not guaranteed, by any oneMKL computational routine, and may vary between implementations and/or versions. oneMKL implementations must clearly document whether execution is guaranteed to be asynchronous for each supported routine. Regardless, calling applications shall not launch any oneMKL computational routine with a dependency on a future oneMKL API call, even if this computational routine executes asynchronously (i.e. a oneMKL implementation may assume no antidependencies are present). This guarantee allows oneMKL implementations to reserve resources for execution without risking deadlock.

Synchronization When Using Buffers#

sycl::buffer objects automatically manage synchronization between kernel launches linked by a data dependency (either read-after-write, write-after-write, or write-after-read).

oneMKL routines are not required to perform any additional synchronization of sycl::buffer arguments.

Synchronization When Using USM APIs#

When USM pointers are used as input to, or output from, a oneMKL routine, it becomes the calling application’s responsibility to manage possible asynchronicity.

To help the calling application, all oneMKL routines with at least one USM pointer argument also take an optional reference to a list of input events, of type std::vector<sycl::event>, and have a return value of type sycl::event representing computation completion:

sycl::event mkl::domain::routine(..., std::vector<sycl::event> &in_events = {});

The routine shall ensure that all input events (if the list is present and non-empty) have occurred before any USM pointers are accessed. Likewise, the routine’s output event shall not be complete until the routine has finished accessing all USM pointer arguments.

For class methods, “argument” includes any USM pointers previously provided to the object via the class constructor or other class methods.

Host Thread Safety#

All oneMKL member and non-member functions shall be host thread safe. That is, they may be safely called simultaneously from concurrent host threads. However, oneMKL objects in class-based APIs may not be shared between concurrent host threads unless otherwise specified.

Memory Model#

The oneMKL memory model shall follow directly from the oneAPI memory model. Mainly, oneMKL shall support two modes of encapsulating data for consumption on the device: the buffer memory abstraction model and the pointer-based memory model using Unified Shared Memory (USM). These two paradigms shall also support both synchronous and asynchronous execution models as described in Asynchronous Execution.

The Buffer Memory Model#

The SYCL 1.2.1 specification defines the buffer container templated on the provided data type which encapsulates the data in a SYCL application across both host and devices. It provides the concept of accessors as the mechanism to access the buffer data with different modes to read and or write into that data. These accessors allow SYCL to create and manage the data dependencies in the SYCL graph that order the kernel executions. With the buffer model, all data movement is handled by the SYCL runtime supporting both synchronous and asynchronous execution.

oneMKL provides APIs where buffers (in particular 1D buffers, sycl::buffer<T,1>) contain the memory for all non scalar input and output data arguments. See Synchronization When Using Buffers for details on how oneMKL routines manage any data dependencies with buffer arguments. Any higher dimensional buffer must be converted to a 1D buffer prior to use in oneMKL APIs, e.g., via buffer::reinterpret.

Unified Shared Memory Model#

While the buffer model is powerful and elegantly expresses data dependencies, it can be a burden for programmers to replace all pointers and arrays by buffers in their C++ applications. DPC++ also provides pointer-based addressing for device-accessible data, using the Unified Shared Memory (USM) model. Correspondingly, oneMKL provides USM APIs in which non-scalar input and output data arguments are passed by USM pointer.

USM devices and system configurations vary in their ability to share data between devices and between a device and the host. oneMKL implementations may only assume that user-provided USM pointers are accessible by the device associated with the user-provided queue. In particular, an implementation must not assume that USM pointers can be accessed by any other device, or by the host, without querying the DPC++ runtime. An implementation must accept any device-accessible USM pointer regardless of how it was created (sycl::malloc_device, sycl::malloc_shared, etc.).

Unlike buffers, USM pointers cannot automatically manage data dependencies between kernels. Users may use in-order queues to ensure ordered execution, or explicitly manage dependencies with sycl::event objects. To support the second use case, oneMKL USM APIs accept input events (prerequisites before computation can begin) and return an output event (indicating computation is complete). See Synchronization When Using USM APIs for details.

API Design#

This section discusses the general features of oneMKL API design. In particular, it covers the use of namespaces and data types from C++, from DPC++ and new ones introduced for oneMKL APIs.

oneMKL namespaces#

The oneMKL library uses C++ namespaces to organize routines by mathematical domain. All oneMKL objects and routines shall be contained within the oneapi::mkl base namespace. The individual oneMKL domains use a secondary namespace layer as follows:

namespace

oneMKL domain or content

oneapi::mkl

oneMKL base namespace, contains general oneMKL data types, objects, exceptions and routines

oneapi::mkl::blas

Dense linear algebra routines from BLAS and BLAS like extensions. The oneapi::mkl::blas namespace should contain two namespaces column_major and row_major to support both matrix layouts. See BLAS Routines

oneapi::mkl::lapack

Dense linear algebra routines from LAPACK and LAPACK like extensions. See LAPACK Routines

oneapi::mkl::sparse

Sparse linear algebra routines from Sparse BLAS and Sparse Solvers. See Sparse Linear Algebra

oneapi::mkl::dft

Discrete Fourier Transforms. See Discrete Fourier Transform Functions

oneapi::mkl::rng

Random number generator routines. See Random Number Generators

oneapi::mkl::vm

Vector mathematics routines, e.g. trigonometric, exponential functions acting on elements of a vector. See Vector Math

oneapi::mkl::stats

Routines that compute basic statistical estimates for single and double precision multi-dimensional datasets. See Summary Statistics

Note

Inside each oneMKL domain, there are many routines, classes, enums and objects defined which constitute the breadth and scope of that oneMKL domain. It is permitted for a library implementation of the oneMKL specification to implement either all, one or more than one of the domains in oneMKL. However, within an implementation of a specific domain, all relevant routines, classes, enums and objects (including those relevant enums and objects which live outside a particular domain in the general oneapi::mkl namespace must be both declared and defined in the library so that an application that uses that domain could build and link against that library implementation successfully.

It is however acceptable to throw the runtime exception oneapi::mkl::unimplemented inside of the routines or class member functions in that domain that have not been fully implemented. For instance, a library may choose to implement the oneMKL BLAS functionality and in particular may choose to implement only the gemm api for their library, in which case they must also include all the other blas namespaced routines and throw the oneapi::mkl::unimplemented exception inside all the others.

In such a case, the implemented routines in such a library should be communicated clearly and easily understood by users of that library.

Standard C++ datatype usage#

oneMKL uses C++ STL data types for scalars where applicable:

  • Integer scalars are C++ fixed-size integer types (std::intN_t, std::uintN_t).

  • Complex numbers are represented by C++ std::complex types.

In general, scalar integer arguments to oneMKL routines are 64-bit integers (std::int64_t or std::uint64_t). Integer vectors and matrices may have varying bit widths, defined on a per-routine basis.

DPC++ datatype usage#

oneMKL uses the following DPC++ data types:

Note

The class sycl::vector_class has been removed from SYCL 2020 and the standard class std::vector should be used instead for vector of SYCL events in oneMKL routines with USM pointers

oneMKL defined datatypes#

oneMKL dense and sparse linear algebra routines use scoped enum types as type-safe replacements for the traditional character arguments used in C/Fortran implementations of BLAS and LAPACK. These types all belong to the oneapi::mkl namespace.

Each enumeration value comes with two names: A single-character name (the traditional BLAS/LAPACK character) and a longer, more descriptive name. The two names are exactly equivalent and may be used interchangeably.

transpose

The transpose type specifies whether an input matrix should be transposed and/or conjugated. It can take the following values:

Short Name

Long Name

Description

transpose::N

transpose::nontrans

Do not transpose or conjugate the matrix.

transpose::T

transpose::trans

Transpose the matrix (without complex conjugation).

transpose::C

transpose::conjtrans

Perform Hermitian transpose (transpose and conjugate). Is the same as transpose::trans for real matrices.

uplo

The uplo type specifies whether the lower or upper triangle of a triangular, symmetric, or Hermitian matrix should be accessed. It can take the following values:

Short Name

Long Name

Description

uplo::U

uplo::upper

Access the upper triangle of the matrix.

uplo::L

uplo::lower

Access the lower triangle of the matrix.

In both cases, elements that are not in the selected triangle are not accessed or updated.

diag

The diag type specifies the values on the diagonal of a triangular matrix. It can take the following values:

Short Name

Long Name

Description

diag::N

diag::nonunit

The matrix is not unit triangular. The diagonal entries are stored with the matrix data.

diag::U

diag::unit

The matrix is unit triangular (the diagonal entries are all 1’s). The diagonal entries in the matrix data are not accessed.

side

The side type specifies the order of matrix multiplication when one matrix has a special form (triangular, symmetric, or Hermitian):

Short Name

Long Name

Description

side::L

side::left

The special form matrix is on the left in the multiplication.

side::R

side::right

The special form matrix is on the right in the multiplication.

offset

The offset type specifies whether the offset to apply to an output matrix is a fix offset, column offset or row offset. It can take the following values

Short Name

Long Name

Description

offset::F

offset::fix

The offset to apply to the output matrix is fix, all the inputs in the C_offset matrix has the same value given by the first element in the co array.

offset::C

offset::column

The offset to apply to the output matrix is a column offset, that is to say all the columns in the C_offset matrix are the same and given by the elements in the co array.

offset::R

offset::row

The offset to apply to the output matrix is a row offset, that is to say all the rows in the C_offset matrix are the same and given by the elements in the co array.

index_base

The index_base type specifies how values in index arrays are interpreted. For instance, a sparse matrix stores nonzero values and the indices that they correspond to. The indices are traditionally provided in one of two forms: C/C++-style using zero-based indices, or Fortran-style using one-based indices. The index_base type can take the following values:

Name

Description

index_base::zero

Index arrays for an input matrix are provided using zero-based (C/C++ style) index values. That is, indices start at 0.

index_base::one

Index arrays for an input matrix are provided using one-based (Fortran style) index values. That is, indices start at 1.

layout

The layout type specifies how a dense matrix A with leading dimension lda is stored as one dimensional array in memory. The layouts are traditionally provided in one of two forms: C/C++-style using row_major layout, or Fortran-style using column_major layout. The layout type can take the following values:

Short Name

Long Name

Description

layout::R

layout::row_major

For row major layout, the elements of each row of a dense matrix A are contiguous in memory while the elements of each column are at distance lda from the element in the same column and the previous row.

layout::C

layout::col_major

For column major layout, the elements of each column a dense matrix A are contiguous in memory while the elements of each row are at distance lda from the element in the same row and the previous column.

Note

oneMKL Appendix may contain other API design decisions or recommendations that may be of use to the general developer of oneMKL, but which may not necessarily be part of the oneMKL specification.

Exceptions and Error Handling#

oneMKL error handling relies on the mechanism of C++ exceptions. Should error occur, it will be propagated at the point of a function call where it is caught using standard C++ error handling mechanism.

Exception classification#

Exception classification in oneMKL is aligned with C++ Standard Library classification. oneMKL introduces class that defines the base class in the hierarchy of oneMKL exception classes. All oneMKL routines throw exceptions inherited from this base class. In the hierarchy of oneMKL exceptions, oneapi::mkl::exception is the base class inherited from std::exception class. All other oneMKL exception classes are derived from this base class.

This specification does not require implementations to perform error-checking. However, if an implementation does provide error-checking, it shall use the following exception classes. Additional implementation-specific exception classes can be used for exceptional conditions not fitting any of these classes.

Common exceptions#

Exception class

Description

oneapi::mkl::exception

Reports general unspecified problem

oneapi::mkl::unsupported_device

Reports a problem when the routine is not supported on a specific device

oneapi::mkl::host_bad_alloc

Reports a problem that occurred during memory allocation on the host

oneapi::mkl::device_bad_alloc

Reports a problem that occurred during memory allocation on a specific device

oneapi::mkl::unimplemented

Reports a problem when a specific routine has not been implemented for the specified parameters

oneapi::mkl::invalid_argument

Reports problem when arguments to the routine were rejected

oneapi::mkl::uninitialized

Reports problem when a handle (descriptor) has not been initialized

oneapi::mkl::computation_error

Reports any computation errors that have occurred inside a oneMKL routine

oneapi::mkl::batch_error

Reports errors that have occurred inside a batch oneMKL routine

LAPACK specific exceptions#

Exception class

Description

oneapi::mkl::lapack::exception

Base class for all LAPACK exceptions providing access to info code familiar to users of conventional LAPACK API. All LAPACK related exceptions can be handled with catch block for this class.

oneapi::mkl::lapack::invalid_argument

Reports errors when arguments provided to the LAPACK subroutine are inconsistent or do not match expected values. Class extends base oneapi::mkl::invalid_argument with ability to access conventional status info code.

oneapi::mkl::lapack::computation_error

Reports computation errors that have occurred during call to LAPACK subroutine. Class extends base oneapi::mkl::computation_error with ability to access conventional status info code familiar to LAPACK users.

oneapi::mkl::lapack::batch_error

Reports errors that have occurred during batch LAPACK computations. Class extends base oneapi::mkl::batch_error with ability to access individual exception objects for each of the issues observed in a batch and an info code. The info code contains the number of errors that occurred in a batch. Positions of problems in a supplied batch that experienced issues during computations can be retrieved with ids() method, and list of particular exceptions can be obtained with exceptions() method of the exception object. Possible exceptions for a batch are documented for corresponding non-batch API.

Other Features#

This section covers all other features in the design of oneMKL architecture.

Specification Version and Compliance#

Each oneMKL domain must define a preprocessor macro to represent the version of the specification that the implementation is compliant with.

The macros for each domain are listed as follows:

ONEMKL_BLAS_SPEC_VERSION
ONEMKL_LAPACK_SPEC_VERSION
ONEMKL_SPBLAS_SPEC_VERSION
ONEMKL_DFT_SPEC_VERSION
ONEMKL_RNG_SPEC_VERSION
ONEMKL_STATS_SPEC_VERSION
ONEMKL_VM_SPEC_VERSION

The specification version can be created by appending all digits of the specification version in the format of <MAJOR><MINOR>. MINOR version always uses two digits. This version can be used to check the compatibility of the implementation with the specification version. Note that the revision is not included here because it reflects changes only for the specification document without affecting the implementation. If the implementation is not compliant with any release of the specification, then the macro must have a numerical value of 000.

Version Example

oneAPI 1.1 rev 1 will be represented as a numerical value of 101
oneAPI 1.2 rev 1 will be represented as a numerical value of 102
oneAPI 1.2 rev 2 will be represented as a numerical value of 102

Macro Example

// For oneAPI 1.2 rev 1
#define ONEMKL_BLAS_SPEC_VERSION 102

// For oneAPI 1.2 rev 2
#define ONEMKL_DFT_SPEC_VERSION 102

// For oneAPI 1.3 rev 1
#define ONEMKL_VM_SPEC_VERSION 103

Versioning details are defined here: uxlfoundation/oneAPI-spec

Pre/Post Condition Checking#

The individual oneMKL computational routines will define any preconditions and postconditions and will define in this specification any specific checks or verifications that should be enabled for all implementations.