ComputeMux Compiler
The ComputeMux Compiler is an OpenCL C and SPIR-V compiler that consumes the source code or IL provided by an application and compiles it into an executable that can be loaded by the ComputeMux Runtime.
The module aims to provide a boundary beyond which no LLVM type definitions pass in order to keep logical concerns separate.
Structure
The compiler module is structured as a set of virtual interfaces and a loader
library, with a number of concrete implementations for different ComputeMux
targets. The virtual interfaces reside in include/compiler/*.h, the library
entry point resides in library/include/library.h and
library/source/library.cpp, the dynamic loader library resides in
loader/include/loader.h and loader/source/loader.cpp, and the various
implementations reside in targets/*/.
Dynamic vs Static loading
The compiler is designed to be used either directly as a static library, or indirectly through a loader library.
Static Library
The simplest way to use the compiler library is to link with the
compiler-static target. Through this target, the compiler is accessed through
the compiler/library.h header. This target is unavailable if CMake is
configured with CA_RUNTIME_COMPILER_ENABLED set to OFF
Dynamic Loader
A more flexible option to use the compiler is to instead link wtih the
compiler-loader target. Through this target, the compiler is accessed through
the compiler/loader.h header. This header is similar to compiler/library.h,
however each method additionally requires a compiler::Library object,
created using compiler::loadLibrary.
The purpose of the loader is to provide a compiler interface that will be
available when compiling the oneAPI Construction Kit regardless of the value of
CA_RUNTIME_COMPILER_ENABLED, and regardless of whether the
compiler is loaded at runtime or linked statically.
If CA_RUNTIME_COMPILER_ENABLED is set to ON then the
compiler loader can operate in two different ways:
If
CA_COMPILER_ENABLE_DYNAMIC_LOADERis set toON, thencompiler::loadLibrarywill look forcompiler.dll(Windows) orlibcompiler.so(Linux) in the default library search paths, depending on the platform.If the environment variable
CA_COMPILER_PATHis set, then its value will be used as the library name instead. Additionally, ifCA_COMPILER_PATHis set to an empty string, thencompiler::loadLibrarywill skip loading entirely and will operate as if no compiler is available.In this configuration, targets which depend on
compiler-loadershould also add thecompilertarget (the compiler shared library) as a dependency usingadd_dependencies. Seesource/cl/CMakeLists.txtfor an example.If
CA_COMPILER_ENABLE_DYMAMIC_LOADERis set toOFF, thencompiler-loaderwill transitively depend oncompiler-static, andcompiler::loadLibrarywill instead immediately return an instance ofcompiler::Librarythat references the static functions directly.
If CA_RUNTIME_COMPILER_ENABLED is set to OFF, then
compiler::loadLibrary will always return nullptr, and therefore the compiler
will be disabled.
By default, the oneAPI Construction Kit is configured with
CA_RUNTIME_COMPILER_ENABLED set to ON and
CA_COMPILER_ENABLE_DYNAMIC_LOADER set to OFF.
Selecting a compiler implementation
A compiler implementation is represented by a singleton instance of a
compiler::Info object. A list of all available compilers can be obtained by
calling compiler::compilers(), whilst compiler::getCompilerForDevice can
be used to select the relevant compiler for a particular mux_device_info_t.
Info
The compiler::Info struct (include/compiler/info.h) describes a
particular compiler implementation that can be used to compile programs for a
particular mux_device_info_t. Info contains information about the
compiler capabilities and metadata, and additionally acts as an interface for
creating a compiler::Target object.
Context
The compiler::Context interface (include/compiler/context.h) serves as
an opaque wrapper over the LLVM context object. This object can also contain
other shared state used by compiler modules, and contains a mutex that is locked
when interacting with a specific instance of LLVM.
Target
The compiler::Target interface (include/compiler/target.h) represents a
particular target device to generate machine code for. This object is also
responsible for creating instances of compiler::Module (described below).
Module
The compiler::Module interface (include/compiler/module.h) is responsible
for driving the compilation process from source code all the way to machine
code. It acts as a container for LLVM IR by wrapping the LLVM Module object, and
executes the required passes.
Compile OpenCL C
The clang frontend is instantiated in the compiler::Module::compileOpenCLC
member function, this is where:
The OpenCL C language options are specified to the frontend
User specified macro definitions and include directories are set
muxdevice force-include headers (if present) are setA diagnostic handler is provided to report compilation errors
This compilation stage also introduces the pre-compiled builtins header
providing the OpenCL C builtin function declarations to the frontend.
Compilation occurs when the clang::EmitLLVMOnlyAction is invoked, then
ownership of the resulting llvm::Module is transferred to
compiler::Module to be used in the next stage. Any errors occurring
during compilation are returned in the error log specified during the
construction of compiler::Module, where they can be queried by the
application.
Note
In OpenCL, the compiler::Module::compileOpenCLC member function directly
maps to clCompileProgram but is also invoked by clBuildProgram.
Compile SPIR-V
The compiler::Module::compileSPIRV member function implements the SPIR-V
frontend. First, the SPIR-V module is handed to spirv_ll::Context::translate
to turn it into a llvm::Module, then some additional fixup passes are applied.
Link
During compiler::Module::link, the LLVM module is first cloned before the list
of all provided compile::Module‘s are linked into the current module. As
before, during compiler::Module::compileOpenCLC, a diagnostics handler is
specified. If linking was successful, the previous module is destroyed and the
linked modules ownership is moved to compiler::Module.
Note
In OpenCL, the compiler::Module::link member function directory maps to
clLinkProgram but is also invoked by clBuildProgram.
Finalize
Finalization is the final compilation stage which executes any remaining LLVM
passes and getting it ready to be passed to the backend implementation. This is
where the majority of the LLVM passes are run, once again on a clone of the
llvm::Module owned by the compiler::Module object. Once the
llvm::PassManager has run all of the desired passes, the LLVM module is
ready to be turned into machine code, either through
compiler::Module::createBinary, or possibly deferred at runtime through the
compiler::Kernel object.
Kernel
The compiler::Kernel interface (include/compiler/kernel.h) represents a
single function entry point in a finalized compiler::Module. It’s main purpose
is to provide an opportunity for the backend to perform optimizations and code
generation as late as possible. Most of the work is driven by the
compiler::Module::createSpecializedKernel method that creates a Mux runtime
kernel potentially optimized for a set of execution options that will be passed
to it during muxCommandNDRange.
OpenCL C Passes
The compiler module provides a number of LLVM passes, which are specific to
processing the LLVM IR produced by clang after compiling OpenCL C source code.
The IR is processed into a form that the backend can consume. The passes are
described immediately below in the order they are executed by the LLVM pass
manager.
Fast Math
The OpenCL standard defines an optional -cl-fast-relaxed-math flag that can be
set when building programs, allowing optimizations on floating point arithmetic
that could violate the IEEE-754 standard. When this flag is used we run the LLVM
module level pass FastMathPass to perform these optimizations straight after
frontend parsing from clang.
First the pass looks for any llvm::FPMathOperator instructions and for those
found sets the llvm::FastMathFlags attribute to enable all of:
Unsafe algebra - Operation can be algebraically transformed.
No
NaNs - Arguments and results can be treated as non-NaN.No
Infs - Arguments and results can be treated as non-Infinity.No Signed Zeros - Sign of zero can be treated as insignificant.
Allow Reciprocal - Reciprocal can be used instead of division.
As well as the above compiler::FastMathPass replaces maths and geometric
builtin functions with fast variants. Any math builtin functions which have a
native equivalent are replaced with the native function, specified as having an
implementation defined maximum error. For example exp2(float4) is replaced
with native_exp2(float4).
Geometric builtins distance, length, and normalize are all defined in
OpenCL as having fast variants fast_distance, fast_length, and
fast_normalize which use reduced precision maths. If any of these functions
are present we also replace them with the relaxed alternative.
These builtin replacements are done by searching the LLVM module for call instructions which invoke the mangled name of a builtin function we want to replace. If the fast version of the builtin isn’t already in the module, i.e. it wasn’t called explicitly somewhere else, then we also need to add a function declaration for the mangled name of the fast builtin. Finally a new call instruction is created invoking the fast function declaration and the old call it replaces is deleted.
Bit Shift Fixup
LLVM IR does not define the results of oversized shift amounts, however some high-level languages such as OpenCL C do. As a result shift instructions need to be updated to perform a ‘modulo N’ by the shift amount prior to the shift operation itself, where N is the bit width of the value to shift.
BitShiftFixupPass implements this as a LLVM function pass iterating over all
the function instructions looking for shifts. For each shift found the pass uses
the first operand to work out ‘N’ for the modulo based on the bit width of the
operand type. If the shift amount from the second operand is less than N
however, then we can skip the shift without inserting a modulo operation since
the shift is not oversized. We can also skip shift instructions that already
have the modulo applied, which can happen if the module was created by clang.
Otherwise the pass creates a modulo by generating a ‘logical and’ instruction
with operands N-1 and the original shift amount, this masked value is then
used to replace the second operand of the shift.
Software Division
The compiler pass SoftwareDivisionPass is a function level pass designed to
prevent undefined behaviour in division operations. To do this the pass adds
runtime checks using llvm::CmpInst instructions for two specific cases, divide
by zero and INT_MIN / -1. Due to the specification of undefined behaviour if
one of these cases is detected we are free to update the behaviour of the divide
operation. In both cases we set the divisor operand of the divide instruction to
be +1 using a llvm::SelectInst with the original operand based on the result
of our checks.
Since IEEE-754 defines these error cases for floating point types our runtime
checks only need to be applied to integer divides. This is ensured in the pass
by checking if the instruction opcode is one of SDiv, SRem, UDiv, URem.
Whereas floating point divide instructions will have opcode FDiv or FRem.
Image Argument Substitution
OpenCL image calls with opaque types are replaced to use those coming from the image library.
MemToReg
A manual implementation of LLVM’s MemToReg pass, which promotes allocas
which have only loads and stores as uses to register references. This is needed
because after LLVM 5.0 llvm:MemToReg has regressed and is not removing all
the allocas it should be.
Builtin Simplification
BuiltinSimplificationPass is a module level pass for simplifying builtin
function calls. The pass performs two kinds of optimization on builtins:
Converts builtins to more efficient variants where possible (for example, a call to the math function
pow(x, y), whereyis a constant that is representable by an integer, will be converted topown(x, y)).Replace builtins whose arguments are all constant (for example, a call to the math function
cos(x), wherexis a constant, will be replaced by a new constant value that is the calculation of the cosine ofx).
printf Replacement.
Of the myriad of architectures that have ComputeMux back ends, most do not have
access to an implementation of printf whereby they can route a call to
printf within a kernel to stdout of the process running on the host CPU
processor.
To enable our ComputeMux back ends to call printf, we provide an optimized
software implementation. An additional kernel argument buffer is implicitly
added to any kernel that uses printf, and our implementation of printf
that is run on the ComputeMux backend will write the results of the printf
into this buffer instead. Then, when the kernel has completed its execution, the
data that was written to this buffer is streamed out on the host CPU processor
via stdout.
Combine fpext fptrunc
CombineFPExtFPTruncPass is a function level pass, rather than a module pass,
for removing FPExt and FPTrunc instructions that cancel each out. This is
used after the printf replacement pass because var-args printf arguments
will be expanded to double by clang even if the device doesn’t support doubles.
So if the device doesn’t support doubles, the printf pass will fptrunc the
parameters back to float. CombineFPExtFPTruncPass will find and remove the
matching fpext (added by clang) and fptrunc (added by the printf pass) to
get rid of the doubles.
The pass is implemented by iterating over all the instructions looking for any
llvm::FPExtInst instructions. If one is found then we check its uses, if the
fpext is unused, remove it. Otherwise if the instruction only has one use and
it’s a llvm::FPTruncInst then we can replace all uses of the fptrunc with
the first operand of fpext and delete both the fptrunc and fpext.
Set convergent Attr
In clang the convergent attribute can be set on a function to indicate to
the optimizer that the function relies on cross work item semantics. For
OpenCL we need this attribute to be set on the barrier function, for example,
since it’s used to control the scheduling of threads. Recent versions of clang
will proactively set such functions in OpenCL-C kernels as convergent, but
we also set the attribute implicitly in the builtins header out of an abundance
of caution.
This pass iterates over all the functions in the module, including declarations
requiring the pass to be a module pass instead of a function pass. If the
function inspected may be convergent, identified by the compiler’s
BuiltinInfo analysis, then we assign the llvm::Attribute::Convergent
attribute to it. When the pass encounters a convergent function, all functions
calling that function are transitively marked convergent.