Overview¶

Introduction¶

This document serves to introduce users to the ComputeMux compiler pipeline. The reference compiler pipeline performs several key transformations over several phases that can be difficult to understand for new users.

The pipeline is constructed and run in compiler::BaseModule::finalize (overridable but provided by default) which defers to the pure virtual compiler::BaseModule::getLateTargetPasses which must be implemented by a target.

The pipeline described here is thus the reference model used by all in-tree ComputeMux targets, used as a suggestion of how to construct a basic target pipeline. Everything is ultimately customizable by the target.

Contents

Objective & Execution Model
- Host
Compiler Pipeline Overview
Early Vectorization Passes
Whole Function Vectorization
Barrier Scheduling
- Live Variables
Vectorization Scheduling

Objective & Execution Model ¶

The reference compiler pipeline’s objective is to compile incoming LLVM IR modules containing one or more kernel functions to object code ready for execution when invoked by the host-side runtime. The assumptions placed on the input and output kernels is as follows:

The original kernel is assumed to adhere to an implicit SIMT execution model; it runs once per each work-item in an NDRange.
The final compiled kernel is assumed to be invoked from the host-side runtime once per work-group in the NDRange.

This execution model is referenced elsewhere in documentation as a work-item per thread or core per work-item scheduling model.

The following diagram provides an overview of the main phases of the reference ComputeMux compiler pipeline in terms of the underlying and assumed kernel execution model.

The inner-most function is the original input kernel, which is wrapped by new functions in successive phases, until it is ready in a form to be executed by the ComputeMux driver.

digraph ExecutionModel { ranksep=0.75 compound = true; graph [colorscheme=ylgnbu9]; graph [fontname = "Courier New"]; node [fontname = "Courier New"]; edge [fontname = "Courier New"]; Driver [ shape = box; label = "Driver entry point"; ]; BarrierWrapper [ shape = box; label = "@foo.mux-barrier-wrapper(/* kernel params */)"; ]; OrigKernel [ shape = box; label = "@foo(/* kernel params */)"; ]; Link1 [ shape = octagon; label = "for (wg : nd_range)"; ] Link2 [ shape = octagon; label = "for (wi : wg)"; ] subgraph cluster_driver { bgcolor=1 label="ComputeMux Driver"; labeljust=r Driver; Link1; subgraph cluster_barrier_wrapper { bgcolor=3 label="WorkItemLoopsPass"; labeljust=r BarrierWrapper; Link2; subgraph cluster_foo { bgcolor=5 label="Original kernel"; OrigKernel; } } } Driver -> Link1 Link1 -> BarrierWrapper BarrierWrapper -> Link2 Link2 -> OrigKernel [lhead=cluster_foo] }

Overview of the passes involved in handling the kernel execution model¶

The WorkItemLoopsPass is the key pass which makes some of the implicit parallelism explicit. By introducing work-item loops around each kernel function, the new kernel entry point now runs on every work-group in an NDRange.

Note

Targets adhering to a work-item per thread or core per work-item scheduling model will likely wish to omit the WorkItemLoopsPass.

Host ¶

As mentioned, the above documents the reference compiler pipeline. The in-tree Host CPU target uses a different model, where the driver splits up the NDRange into multi-dimensional slices and calls the compiler-generated code for each slice. Each work-group in the slice is then traversed using multi-dimensional work-group loops generated by the AddEntryHookPass:

digraph ExecutionModel { ranksep=0.75 compound = true; graph [colorscheme=ylgnbu9]; graph [fontname = "Courier New"]; node [fontname = "Courier New"]; edge [fontname = "Courier New"]; Driver [ shape = box; label = "Driver entry point"; ]; AddEntryHookPass [ shape = box; label = "@foo.host-entry-hook(/* ABI params */)"; ]; BarrierWrapper [ shape = box; label = "@foo.mux-barrier-wrapper(/* kernel params */)"; ]; OrigKernel [ shape = box; label = "@foo(/* kernel params */)"; ]; Link1 [ shape = octagon; label = "for (slice : nd_range)"; ] Link2 [ shape = octagon; label = "for (wg : slice)"; ] Link3 [ shape = octagon; label = "for (wi : wg)"; ] subgraph cluster_driver { bgcolor=1 label="ComputeMux Driver"; labeljust=r Driver; Link1; subgraph cluster_entry_hook { bgcolor=2 label="AddEntryHookPass"; labeljust=r AddEntryHookPass; Link2; subgraph cluster_barrier_wrapper { bgcolor=3 label="WorkItemLoopsPass"; labeljust=r BarrierWrapper; Link3; subgraph cluster_foo { bgcolor=5 label="Original kernel"; OrigKernel; } } } } Driver -> Link1 Link1 -> AddEntryHookPass AddEntryHookPass -> Link2 Link2 -> BarrierWrapper BarrierWrapper -> Link3 Link3 -> OrigKernel [lhead=cluster_foo] }

Overview of the host target’s excecution model¶

Compiler Pipeline Overview ¶

With the overall execution model established, we can start to dive deeper into the key phases of the compilation pipeline.

digraph { ranksep=0.75 compound = true; graph [colorscheme=ylgnbu9]; graph [fontname = "Courier New"]; node [fontname = "Courier New"]; edge [fontname = "Courier New"]; N1 [ shape = box style=filled fillcolor="#f5f2ca" label = "Input IR" ] N2 [ shape = box style=filled fillcolor="#dfeabf" label = "Adding Metadata/Attributes" ] N3 [ shape = box style=filled fillcolor="#c8e2b7" label = "Vecz" ] N4 [ shape = box style=filled fillcolor="#afdab2" label = "Linking Builtins" ] N5 [ shape = box style=filled fillcolor="#96d2b0" label = "Work-item Scheduling & Barriers" ] N6 [ shape = box style=filled fillcolor="#7dc9b1" label = "Adding Scheduling Parameters" ] N7 [ shape = box style=filled fillcolor="#63c0b5" label = "Defining mux builtins" ] N8 [ shape = box style=filled fillcolor="#4bb5b9" label = "Kernel Wrapping" ] N9 [ shape = box style=filled fillcolor="#36aabd" label = "Binary Object Creation" ] // #289ec1 N1 -> N2 -> N3 -> N4 -> N5 -> N6 -> N7 -> N8 -> N9 N2 -> N4 }

Overview of compilation pipeline¶

Input IR ¶

The program begins as an LLVM module, either generated by a compiler frontend or deserialized from another intermediate form such as SPIRV. Kernels in the module are assumed to obey a SIMT programming model, as described earlier in Objective & Execution Model.

Simple fix-up passes take place at this stage: the IR is massaged to conform to specifications or to fix known deficiencies in earlier representations.

Adding Metadata/Attributes ¶

ComputeMux IR metadata and attributes are attached to kernels. This information is used by following passes to identify certain aspects of kernels which are not otherwise attainable or representable in LLVM IR.

The TransferKernelMetadataPass and EncodeKernelMetadataPass are responsible for adding this information.

See the ComputeMux Compiler Specification for details about specific metadata and attributes.

Early transformation passes and optimizations take place at this stage. Of note is LLVM’s SLP vectorizer.

Note

This takes place in BaseModule::finalize before calling BaseModule::getLateTargetPasses so metadata and attributes can be assumed to be present as an input to that method.

Vectorization ¶

Early Vectorization Passes ¶

These passes are activated when the VectorizeSLP and VectorizeLoop bits are set in clang::CodeGenOptions (corresponding to the -cl-vec={none|loop|slp|all} command line options). These options activate sequences of standard LLVM passes that attempt to vectorize the kernel for a single work item. The vectorization passes used are Loop Vectorization, SLP Vectorization, and Load/Store Vectorization. Some basic simplification passes are also applied (Loop Rotation, Instruction Combine, CFG Simplification and Dead Code Removal). Loop Rotation was found necessary for Loop Vectorization to work as expected. Loop Rotation can also generate redundant code in the case that the number of iterations is known at compile time to be a multiple of the vector width, and the CFG Simplification is able to clean it up. Load/Store vectorization is applied if either of Loop or SLP options are selected.

Whole Function Vectorization ¶

The Vecz whole-function vectorizer is optionally run.

Note that VECZ may perform its own scalarization, depending on the options passed to it, potentially undoing the work of any of the Early Vectorization Passes, although it is able to preserve or even widen pre-existing vector operations in many cases.

Linking Builtins ¶

Abacus builtins are linked into the module by the LinkBuiltinsPass.

This is historically done after whole-function vectorization, with the vectorizer generating scaled up vector forms of known builtins (e.g., round(float2) -> x4 -> round(float8)).

Note

It is also possible to link most builtins before vectorization, where the vectorizer will rely on inlining of any used builtins, and vectorize accordingly. Some builtins must still be linked after vectorization, however. See the pass’s documentation for more information.

Work-item Scheduling & Barriers ¶

The work-item loops are added to each kernel by the WorkItemLoopsPass.

The kernel execution model changes at this stage to replace some of the implicit parallelism with explicit looping, as described earlier in Objective & Execution Model.

Barrier Scheduling takes place at this stage, as well as Vectorization Scheduling if the vectorizer was run.

Adding Scheduling Parameters ¶

Scheduling parameters are added to builtin functions that need them. These extra parameters are used by software implementations of mux work-item builtins to provide extra data, used when lowering in a later phase.

The AddSchedulingParametersPass is responsible for this transformation.

A concrete example of this is the OpenCL get_global_id(uint) builtin. These are defined (by default) as calling the ComputeMux equivalent __mux_get_global_id. This function body will have been materialised when linking builtins earlier.

While some hardware may have all of the necessary features for implementing this builtin, many architectures don’t. Thus the software implementation needs to source extra data from somewhere external to the function. This is the role that scheduling parameters fill.

Note

The BuiltinInfo analysis controls which scheduling parameters are added. Targets may override BuiltinInfo to change their scheduling parameters whilst making use of this pass. See the tutorial on Custom Lowering Work-Item Builtins on how this may be accomplished.

Pseudo C code:

void foo() {
  size_t id = __mux_get_global_id(0);
}

size_t __mux_get_global_id(uint);

// The AddSchedulingParametersPass produces the following
// scheduling structures:
struct MuxWorkItemInfo { size_t local_ids[3]; ... };
struct MuxWorkGroupInfo { size_t group_ids[3]; ... };

// And this wrapper function
void foo.mux-sched-wrapper(MuxWorkItemInfo *wi, MuxWorkGroupInfo *wg) {
  size_t id = __mux_get_global_id(0, wi, wg);
}

// And a new version of __mux_get_global_id with scheduling parameters
size_t __mux_get_global_id(uint, MuxWorkItemInfo *wi, MuxWorkGroupInfo *wg);

A combination of the ComputeMux driver as well as outer loops are responsible for filling in all of the scheduling parameter data. For example:

The WorkItemLoopsPass sets the local ID at each loop level.
The ComputeMux driver externally sets up the work-group information such as the work-group IDs and sizes.

Defining mux Builtins ¶

The bodies of mux builtin function declarations are provided. They may use the extra information passed through parameters added by the AddSchedulingParametersPass in a previous phase.

The DefineMuxBuiltinsPass performs this transformation.

Some builtins may rely on others to complete their function. These dependencies are handled transitively.

Note

The BuiltinInfo analysis controls how mux builtins are defined. Targets may override BuiltinInfo to change how specific builtins are defined whilst making use of this pass. See the tutorial on Custom Lowering Work-Item Builtins on how this may be accomplished.

Pseudo C code:

struct MuxWorkItemInfo { size_t[3] local_ids; ... };
struct MuxWorkGroupInfo { size_t[3] group_ids; ... };

// And this wrapper function
void foo.mux-sched-wrapper(MuxWorkItemInfo *wi, MuxWorkGroupInfo *wg) {
  size_t id = __mux_get_global_id(0, wi, wg);
}

// The DefineMuxBuiltinsPass provides the definition
// of __mux_get_global_id:
size_t __mux_get_global_id(uint i, MuxWorkItemInfo *wi, MuxWorkGroupInfo *wg) {
  return (__mux_get_group_id(i, wi, wg) * __mux_get_local_size(i, wi, wg)) +
         __mux_get_local_id(i, wi, wg) + __mux_get_global_offset(i, wi, wg);
}

// And thus the definition of __mux_get_group_id...
size_t __mux_get_group_id(uint i, MuxWorkItemInfo *wi, MuxWorkGroupInfo *wg) {
  return i >= 3 ? 0 : wg->group_ids[i];
}

// and __mux_get_local_id, etc
size_t __mux_get_local_id(uint i, MuxWorkItemInfo *wi, MuxWorkGroupInfo *wg) {
  return i >= 3 ? 0 : wi->local_ids[i];
}

Kernel Wrapping ¶

kernel’s ABI is finalized, ready for it being called by the ComputeMux driver.

The AddKernelWrapperPass performs this transformation.

Kernel parameters are packed together into an auto-generated struct type. A pointer to this structure is passed as the first parameter to the new kernel.

Scheduling parameters such as the work-group info are also preserved and passed to this new kernel.

Scheduling parameters such as the work-item info that do not constitute the kernel ABI are initialized by the wrapper, before being passed to the wrapped kernel.

Note

The BuiltinInfo analysis controls scheduling parameters and how they interact with the kernel ABI. Targets may override BuiltinInfo to change how specific builtins are defined whilst making use of this pass. See the tutorial on Custom Lowering Work-Item Builtins on how this may be accomplished.

Pseudo C code:

struct MuxWorkItemInfo { ... };
struct MuxWorkGroupInfo { ... };

void foo(global int *a, double f, MuxWorkItemInfo *, MuxWorkGroupInfo *);

// AddKernelWrapperPass produces the following packed-argument struct:
struct MuxPackedArgs.foo { global int *a; double f; };

// AddKernelWrapperPass produces the following wrapper:
void foo.mux-kernel-wrapper(MuxPackedArgs.foo *args, MuxWorkGroupInfo *wg) {
  // Note - the default behaviour is to stack-allocate MuxWorkItemInfo,
  // leaving all fields uninitialized. The previous compiler passes always
  // 'set' values before they 'get' them. Targets can customize this
  // behaviour: see the tutoral linked above.
  MuxWorkItemInfo wi;
  global int *a = args->a;
  double f = args->f;
  return foo(a, f, &wi, wg);
}

Binary Object Creation ¶

Any final passes are run now before the module is passed off to generate an object file. For ahead-of-time targets, this may involve calling on LLVM to generate an ELF file. For just-in-time targets, nothing further may be done at this stage - instead deferring compilation until execution time.

When presented with a binary object, the host runtime needs to identify the kernel to call. A common approach used by ComputeMux compiler targets is to run the AddMetadataPass which helps to encode kernel metadata into the final ELF file. This can then be decoded

Barrier Scheduling ¶

The fact that the WorkItemLoopsPass handles both work-item loops and barriers can be confusing to newcomers. These two concepts are in fact linked. Taking the kernel code below, this section will show how the WorkItemLoopsPass lays out and schedules a kernel’s work-item loops in the face of barriers.

kernel void foo(global int *a, global int *b) {
  // pre barrier code - foo.mux-barrier-region.0()
  size_t id = get_global_id(0);
  a[id] += 4;
  // barrier
  barrier(CLK_GLOBAL_MEM_FENCE);
  // post barrier code - foo.mux-barrier-region.1()
  b[id] += 4;
}

The kernel has one global barrier, and one statement on either side of it. The WorkItemLoopsPass conceptually breaks down the kernel into barrier regions, which constitute the code following the control-flow between all barriers in the kernel. The example above has two regions: the first contains the call to get_global_id and the read/update/write of global memory pointed to by a; the second contains the read/update/write of global memory pointed to by b.

To correctly observe the barrier’s semantics, all work-items in the work-group need to execute the first barrier region before beginning the second. Thus the WorkItemLoopsPass produces two sets of work-item loops to schedule this kernel:

digraph ExecutionModel { ranksep=0.75 compound = true; graph [colorscheme=ylgnbu9]; graph [fontname = "Courier New"]; node [fontname = "Courier New"]; edge [fontname = "Courier New"]; BarrierWrapper [ shape = box; label = "@foo.mux-barrier-wrapper()"; ]; OrigKernel0 [ shape = box; label = "@foo.mux-barrier-region.0()\l a[id] += 4;\l"; ]; OrigKernel1 [ shape = box; label = "@foo.mux-barrier-region.1()\l b[id] += 4;\l"; ]; Fence [ shape = box; label = "fence"; ] Link1 [ shape = octagon; label = "for (wi : wg)"; ] Link2 [ shape = octagon; label = "for (wi : wg)"; ] subgraph cluster_barrier_wrapper { bgcolor=3 label="WorkItemLoopsPass"; labeljust=r BarrierWrapper Link1 subgraph cluster_foo { bgcolor=5 label=""; OrigKernel0; } Fence Link2 subgraph cluster_foo2 { bgcolor=5 label=""; OrigKernel1; } } BarrierWrapper -> Link1 Link1 -> OrigKernel0 [dir=both] OrigKernel0 -> Fence Fence -> Link2 Link2 -> OrigKernel1 [dir=both] }

How a simple barrier is laid out by WorkItemLoopsPass¶

Live Variables ¶

Note also that id is a live variable whose lifetime traverses the barrier. The WorkItemLoopsPass creates a structure of live variables which are passed between the successive barrier regions, containing data that needs to be live in future regions.

In this case, however, calls to certain builtins like get_global_id are treated specially and are materialized anew in each barrier region where they are used.

Vectorization Scheduling ¶

The WorkItemLoopsPass is responsible for laying out kernels which have been vectorized by the Vecz whole-function vectorizer.

The vectorizer creates multiple versions of the original kernel. Vectorized kernels on their own are generally unable to fulfill work-group scheduling requirements, as they operate only on a number of work-items equal to a multiple of the vectorization factor. As such, for the general case, several kernels must be combined to cover all work-items in the work-group; the WorkItemLoopsPass is responsible for this.

Note

The following diagram uses a vectorization width of 4.

For brevity, the diagram below only details in inner-most work-item loops. Most kernels will in reality have 2 outer levels of loops over the full Y and Z work-group dimensions.

digraph ExecutionModel { ranksep=0.75 compound = true; graph [colorscheme=ylgnbu9]; graph [fontname = "Courier New"]; node [fontname = "Courier New"]; edge [fontname = "Courier New"]; BarrierWrapper [ shape = box; label = "@foo.mux-barrier-wrapper()"; ]; OrigKernel0 [ shape = box; label = "@__vecz_v4_foo()"; ]; OrigKernel1 [ shape = box; label = "@foo()"; ]; ScalarPH [ shape = box; label = "<scalar check>"; ]; Exit [ shape = box; label = "return"; ]; Link1 [ shape = box; label = "unsigned i = 0;\lunsigned wg_size = get_local_size(0);\lunsigned peel = wg_size % 4;\l"; ] VectorPH [ shape = box; label = "for (unsigned e = wg_size - peel; i < e; i += 4)"; ] Link2 [ shape = box; label = "for (; i < wg_size; i++)"; ] subgraph cluster_barrier_wrapper { bgcolor=3 label="WorkItemLoopsPass"; labeljust=r BarrierWrapper Link1 VectorPH subgraph cluster_foo { bgcolor=5 label=""; OrigKernel0; } ScalarPH Link2 subgraph cluster_foo2 { bgcolor=5 label=""; OrigKernel1; } Exit } BarrierWrapper -> Link1 Link1 -> VectorPH [label="if (wg_size != peel)"] Link1 -> ScalarPH [label="if (wg_size == peel)"] VectorPH -> OrigKernel0 [dir=both] OrigKernel0 -> ScalarPH ScalarPH -> Link2 [label="if (peel)"] Link2 -> OrigKernel1 [dir=both] ScalarPH -> Exit [label="if (!peel)"] OrigKernel1 -> Exit }

How a vectorized kernel is laid out by WorkItemLoopsPass¶

In the above example, the vectorized kernel is called to execute as many work-items as possible, up to the largest multiple of the vectorization less than or equal to the work-group size.

In the case that there are work-items remaining (i.e., if the work-group size is not a multiple of 4) then the original scalar kernel is called on the up to 3 remaining work-items. These remaining work-items are typically called the ‘peel’ iterations by ComputeMux.

Note that other vectorized kernel layouts are possible. See the documentation for the WorkItemLoopsPass to find out other possibilities.