Design

This document aims to provide an overview of the design of the oneAPI Construction Kit. Its primary goal is to give developers a grounding in the project structure and a good idea of where specific components reside within the directory structure.

Project Structure

A common structure is used as much as possible throughout the oneAPI Construction Kit repository. Implementations of open standards, such as OpenCL, exist as subdirectories of the source directory and shared components, or modules, reside in the modules directory.

Throughout the repository the following layout is adhered to when applicable:

  • include the public interface of the component, except for APIs as the interface is defined by a third party

  • source source code implementing the component

  • test source code for building test suites for testing the component

  • tools tools which enable stand alone usage of the component

  • examples example applications detailing basic usage of the component

  • scripts utilities to aid with building and testing components

  • external contains external dependencies, usually with different license agreements

Modules

Many components of the oneAPI Construction Kit are designed to be reused by multiple open standard implementations or externally. These components are referred to as modules and can be found in the modules directory. Modules follow the same directory layout as the root directory, described above, with the external interface being found in the header files located in the include directory and the implementation located in the source directory. Not all modules have test suites but those that are shared between projects within the oneAPI Construction Kit umbrella do.

Builtins

OpenCL C specifies over 10000 builtin functions, which OpenCL C programs rely on. oneAPI Construction Kit defines builtin functions in the builtins module. The declarations of all the OpenCL C builtin functions can be found in include/builtins/builtins.h, this includes both type and function declarations. Builtin function definitions are spread across multiple files; those implemented in plain OpenCL C can be found in source/builtins.cl; builtins implemented using C++, to take advantage of templates, can be found in source/builtins.cpp. These files are automatically generated by running the bash script scripts/generate_header.sh.

In our build we create Pre-Compiled Header(PCH) files for builtins.h and embed them inside our library. This is then used as an implicit header for all OpenCL C kernels during compilation. Additionally the implementations of our builtins are compiled down to LLVM bitcode and also embedded, which is a substantial part of our build time. To reduce this latency a separate toolchain specifically for compiling the builtins can be set using CMake option CA_BUILTINS_TOOLS_DIR, useful for pointing to a release toolchain in a debug build. Alternatively, to avoid compiling the builtins completely, CA_EXTERNAL_BUILTINS can be set to ON and CA_EXTERNAL_BUILTINS_DIR set to point to the directory containing pre-generated builtins.

All builtins that implement math operations are provided by abacus, which is shared across multiple projects; and all builtins implementing image functionality are provided by libimg.

Compilation

The builtin functions are compiled offline into LLVM bitcode (.bc files) using clang, not the platform compiler used to build the OpenCL drivers shared library. This matches the frontend used to compile OpenCL C source code in an application using the oneAPI Construction Kit’s OpenCL driver. These bitcode files are compiled into the OpenCL driver, on platforms which support linking binary blobs. This is performed by the linker (an .rc file on Windows and a small .asm file accessing the data section on Linux), while the fallback mechanism transforms the binary into a header file containing a char array.

Compiling the builtins source code to LLVM bitcode is the key to being able to use multiple input languages, both OpenCL C and C++. The resulting bitcode files are linked together using llvm-link into a single bitcode file. Using this mechanism allows the builtins modules to take advantage of C++ for function overloading and function templates to increase code reuse, especially for the OpenCL C conversion and type casting builtins.

Additionally, the include/builtins/builtins.h header file is compiled into a (.pch file) pre-compiled header. This is done so that the frontend compiler in the oneAPI Construction Kit does not have to compile the entire header file, which is over 11000 lines long, each time an application invokes clCompileProgram or clBuildProgram. The pre-compiled header along with the bitcode files containing the definitions of the OpenCL C builtin functions are embedded into the OpenCL driver.

In order to access the embedded bitcode and the pre-compiled header from within the OpenCL driver, the builtins module provides a static library containing the binary along with an API to make the binary accessible. These are contained in the include/builtins/bakery.h and source/bakery.cpp header and source files, so named because the binary is baked into the library.

Abacus

Abacus is our high-precision math library crafted especially for the demanding precision requirements OpenCL has. Key features include:

  • OpenCL 1.2 floating point math functions.

  • Heavily optimized for GPU, DSP and vector architectures.

  • Satisfies the high precision requirements for OpenCL conformance.

  • High code quality and documentation for easy maintainability.

Abacus Integration

Abacus is integrated into the build process such that the .cl and .cpp files that implement the functionality are built into an LLVM bitcode file. We pass this module to ComputeMux Compiler backends via the builtins parameter of compiler::Target::init(). ComputeMux backends can then link against the bitcode file to bring in the definitions their kernel’s require.

libimg

Image support in OpenCL is optional, based on the CL_DEVICE_IMAGE_SUPPORT device property but when enabled the libimg module provides functionality for both the OpenCL API and implements the OpenCL C builtin functions. The image support provided by libimg is a software implementation intended for targeting CPU architectures but can also be used on other architecture which do not have dedicated texture hardware.

Note that libimg is actually a shared module however since it implements the OpenCL C image builtins it resides within the builtins module alongside abacus.

libimg Integration

In order to build libimg it is necessary to supply a header called image_library_integration.h that defines the types and functions used by the oneAPI Construction Kit, this filename is important as it is hard-coded into the libimg source files.

Shared

Both the OpenCL API and kernel sides of libimg, described below, depend on a shared set of types and constants defined in the include/libimg/shared.h header file. The constants defined in this header are defined by the OpenCL C specification. The header also defines the image descriptor used to describe an images dimensions, data format and layout, and provide access to the image’s data.

Host and Validate

When image support is enabled the image functions in the api module, such as api::CreateImage, call analogous functions in include/libimg/host.h header file, for this example libimg::HostCreateImage. Along with the implementations of OpenCL API calls, a set of helper functions is provided to aid with the integration into the OpenCL driver, these function serve various purposes related to the data layouts and memory offsets of the image.

Functions that implement the guts of OpenCL API entry points, such as libimg::HostFillImage, do not perform any validation of input parameters as defined in the OpenCL specification; instead the validation code resides in the include/libimg/validate.h and source/validate.cpp header and source files.

Kernel

The implementations of the OpenCL C builtin image functions are declared in the header include/libimg/kernel.h with the definitions living in the source/kernel.cpp source file, this code is only executed in OpenCL C kernels.

For each OpenCL C builtin, such as float4 read_imagef(image3d_t, sampler_t, int4) there is an analogous function, in this case Float4 __Codeplay_read_imagef_3d(Image, Sampler, Int4). Note that the Image type does not contain the dimensionality of the image, instead this has been moved into the function name to retain this information.

Because the libimg function signature is different from the signature of the builtin produced by the compiler frontend, the builtin must be replaced. Replacement of the builtin is performed in the pass compiler::ImageArgumentSubstitutionPass, documented above.

printf

The oneAPI Construction Kit’s printf implementation works by adding an extra buffer argument to kernels and then replacing the calls to printf by code that loads the arguments of the printf call into the buffer. Internally the buffer argument is created and added to the kernel just before nd-range commands are executed, then just after the nd-range command has finished the buffer is read, its contents are unpacked and printed out using the host printf.

printf buffer

The size of the buffer is provided by the device through the property CL_DEVICE_PRINTF_BUFFER.

Data in the printf buffer is organized in the following way:

[ wg0: [<length><overflow><id><args...> ...] | ... | wgn: [<length><overflow><id><args ...> ...] ]

First the buffer is split per work group so that each work group has its own chunk of buffer to use, this is necessary since we can’t synchronize between work groups. These chunks must be at least 8 bytes, if they are not the kernel execution will fail returning CL_OUT_OF_RESOUCES, this limitation only affects programs that use printf.

Then in each work group’s buffer chunk, the first 8 bytes are used to store the length of the data that was stored as well as the amount of this length that actually overflowed, these two values are used to synchronize between work items inside of the work group. These 8 bytes are initialized on the host for each work group to 8 for the length (accounting for these 8 bytes), and to 0 for the overflow value (no overflow in the beginning).

In work-group synchronization and overflows

In this part the buffer is assumed to be the chunk of buffer allocated to a work group.

Data in the buffer for a work group is organized in the following way:

[ <length><overflow><id><arg><arg>...<id>... ]

The first field, length, stores the amount of data in bytes that printf calls attempted to write in the buffer, the second field, overflow, stores the amount of data that printf calls couldn’t write in the buffer because they overflowed.

By subtracting the value of the length field by the value of the overflow field, we get exactly the amount of meaningful data that the buffer contains.

Each printf call firsts calls atomic_add on the length field with the amount of space it needs to store in the buffer. This effectively reserves a chunk of buffer for the printf call starting at the value returned by the atomic_add, and of the size required for the printf call.

Then the printf call will check if the reserved chunk of buffer is actually within the bounds of the buffer. If it isn’t, it means that the printf call is overflowing, in this case the call will not write anything to the buffer, and will return -1.

In addition, the overflowing printf call will also atomically add to the overflow field the amount of data it wanted to write in the buffer. This is necessary because at this point the length of this call is accounted for in the length field but we can’t simply subtract back the size of the call from the length field in a thread safe way, so instead we keep track of how much of the length field is data that wasn’t actually written to the buffer.

This also means that if a printf call overflows, every following printf call will overflow as well, because after a call overflowed the length field will hold a value bigger than the size of the buffer, so the chunk of buffer that new printf calls will attempt to reserve will necessarily be out of bounds. It also means that the part of the buffer that was reserved by the first printf call to overflow will be left unused.

Argument packing

This section describes how a printf call will write its data in the buffer chunk it reserved.

First the printf call will write four bytes corresponding to its id, the id is a value determined at compile time and is used by the host to recognize the printf calls and deduce how to unpack the argument data.

It will then write its argument data which may be nothing if the printf call doesn’t need to send data back to the host (typically calls with just the format string or just string arguments). The argument data if present contains the arguments packed one after the other in the buffer. As described by the printf descriptor matching the id of the printf call, and created on the host at compile time.

Format string and string arguments

During compilation, we go over all the printf calls and validate their format string. If they are invalid, we simply replace their return value with -1 as is mandated by the specification.

If a printf call is valid, we give it an id and store data about it, specifically we store the format string and the string arguments, since these are known at compile time there is no need to transfer them from the device, and we also store information about the arguments of the printf call, this will allow us to properly interpret the data retrieved from the device.

The compiler also transforms the OpenCL C printf format string into a C99 printf format string that can directly be used on the host.

Limitations
  • The * specifier for width and precision is not supported.

  • The buffer is split per work group, so a high number of work group will greatly limit the amount of space available for each work group, even if only one of them ever prints.

  • The host and the device are assumed to have the same endianness.

Arm denormal floating points support

  • Single precision: no support for denormals, they are flushed to zero.

  • Double precision: denormals are supported.

For floating point operations on Arm, we have access to the VFP which is able to run single and double precision operations, and which can also be configured to either support denormals or flush to zero. And we also have the faster NEON SIMD extension which can run vectors of single precision operations but doesn’t have any denormal support (they are flushed to zero).

The OpenCL 1.2 specification mandates that if doubles are supported, denormal double precision floating points must be supported as well, so because we want double support we can’t disable denormal support altogether.

So we enable the neon LLVM feature to run single precision vector floating point operations with NEON, and we also enable the neonfp LLVM feature to force the use of NEON for scalar single precision floating points as well. With this we can then enable denormal support on the VFP for double precision floating point operations.

Note that in this setup, scalar single precision floating points are not fully flushed to zero as some basic operations are still being run on the VFP which has denormal support enabled.

ComputeMux Runtime

The mux module defines an API layer providing an interface between hardware target specific code and general OpenCL implementation code. The mux API is set up to support multiple targets, one of which is the host CPU target that is described below. mux is a shared module and must support multiple open standards, not just OpenCL. For more detail see this section. Documentation on how the OpenCL API maps onto the ComputeMux spec in our implementation is also available here.

The include/mux/mux.h header file is generated, using a Python script, from the tools/api/mux.xml schema. The API defined in this header is the public API used in OpenCL code. Additional files are also generated from the schema in the build directory, where you will find; the include/mux/select.h header which marshals the selection of the desired mux target; the include/mux/config.h header, this defines an array of all target’s device creation entry points to initialize each target.

Each entry point in the API performs parameter checking before passing on the inputs to the selected target. For example, the error checking for the muxCreateBuffer entry point can be found in the source/buffer.cpp source file. The muxSelectCreateBuffer inline function is defined in the muxSelect.h header, it selects the desired mux target based on the device->id member of the mux_device_t object.

A specification for mux is available describing, in more detail the purpose of an entry point, valid usage of an entry point, expected error codes of an entry point.

ComputeMux Compiler

The compiler module defines an API layer and a set of C++ classes providing a compiler suite for ComputeMux Runtime targets.

The compiler module is structured as a set of virtual interfaces and a loader library, with a number of concrete implementations for different ComputeMux targets. The virtual interfaces reside in include/compiler/*.h, the library entry point resides in library/include/library.h and library/source/library.cpp, the dynamic loader library resides in loader/include/loader.h and loader/source/loader.cpp, and the various implementations reside in targets/*/. More information on the structure can be found here.

All compiler implementations report a static compiler::Info object describing the mux_device_info_t that it targets, and what features are available. An implementation can be selected by calling either compiler::compilers() or compiler::getCompilerForDevice().

A specification for compiler is available describing, in more detail the purpose of an entry point, valid usage of an entry point, expected error codes of an entry point.

Host

The host module is an implementation of the mux and compiler APIs targeting the host system’s CPU, this includes targets such as X86, Arm, and Aarch64. Documentation of the implementation detail of host can be found here. host is also shared with other oneAPI Construction Kit projects outside of OpenCL.

Following the same file structure as mux, host lays out code on a per object basis. The host::queue_s, which inherits from mux_queue_s, is declared in the include/host/queue.h header file; its definition, and the entry points acting on it, are defined in the source file source/queue.cpp.

Vecz

Vecz is a LLVM IR level SPMD (Single Program Multiple Data) vectorizer. It’s contained in a module so that it can be built as a standalone library and shipped as a separate product. This does not stop it being integrated into other modules, such as host which utilizes the vectorizer.

For detailed information about the design and implementation of VECZ see its documentation.

Cargo

The cargo module contains a collection of Standard Template Library (STL) like containers that take into consideration the specific needs of the oneAPI Construction Kit. The oneAPI Construction Kit embeds LLVM within the driver and LLVM requires being built without exceptions. The constructors of containers like std::vector perform allocations from the free store and then immediately dereference them yet with exceptions disabled there is no error reporting mechanism for allocation failures. The containers in cargo aim to never allocate in a constructor, this allows an error to be returned from member functions which may perform an allocation.

The cargo::small_vector class template, inspired by llvm::SmallVector, provides a std::vector like container with a tunable size small buffer optimization. Member functions like std::vector::insert may also perform an allocation, which might fail, which are specified to return an iterator. In order for cargo::small_vector::insert to maintain a familiar API it returns a cargo::error_or<iterator> object, this is a class which either contains a suitable error code or the value of the iterator.

Begin able to return an error code or the desired value is a step in the right direction, however we can do better. Member functions in cargo that return error codes also have the [[nodiscard]] attribute specified on compilers which support this functionality. Failing to check the return value of functions marked with [[nodiscard]] results in an error, guarding against mistakenly not checking for an error code.

Serialization Format

Serialized binaries have the following format, a null terminated string “codeplay” at the start, a kernel header which contains the kernel details, followed by the ELF data for the kernels.

Pseudocode:

/* Binary Prefix */
char[strlen("codeplay") + 1];

/* Kernel Header */
uint32_t type;
uint64_t number_of_printf_calls;

for (number_of_printf_calls) {
  uint64_t format_string_length;
  char[format_string_length] format_string;
  uint64_t types_length;
  uint32_t[types_length] types;
  uint64_t number_of_strings;
  for (number_of_strings) {
    uint64_t string_length;
    char[string_length] string;
  }
}

uint64_t number_of_kernels;

for (number_of_kernels) {
  uint32_t number_of_arguments;

  for (number_of_arguments) {
    uint32_t argument_type;
    char has_meta_data; /* 1 or 0 */
    if (has_meta_data) {
      uint32_t address_qualifier;
      uint32_t access_qualifier;
      uint32_t type_qualifier;
      uint64_t type_name_length;
      char[type_name_length] type_name_string;
      uint64_t argument_name_length;
      char[argument_name_length] argument_name_string;
    }
  }

  uint64_t[3] reqd_work_group_size;

  uint64_t kernel_name_length;
  char[kernel_name_length] kernel_name;
}

/* Rest of the file is ELF data for the kernels */

Note:

De-serialization has two contact points within CA, clc our offline compiler and the compiler itself.