Quantized MatMul Fusion Patterns¶
Overview¶
oneDNN supports both floating-point and quantized MatMul fusion patterns to optimize performance and reduce memory bandwidth requirements. This document describes the supported quantized fusion patterns for MatMul. For floating-point MatMul fusion patterns, refer to MatMul Fusion Patterns for more details.
Pattern Structure¶
oneDNN defines quantized MatMul fusion patterns as follows. The blue nodes are required when defining a quantized MatMul fusion pattern while the brown nodes are optional.

Q2F Conversion Subgraph : Converts
src
andweights
tensors from quantized to floating-point. It can be one of the following subgraphs, while the last two subgraphs apply only toweights
. See Dequantize, TypeCast and Quantize operations in Graph API.F2F Conversion Subgraph : Converts
bias
tensor from floating-point to another floating-point. It is constructed by a TypeCast operation.MatMul Operation : Performs matrix multiplication between the
src
andweights
tensors. Thebias
tensor is optional. See the MatMul operation in the Graph API for more details.Epilogue Subgraph : Optional and can include the following operations:
BiasAdd operation.
Binary and Unary operations: refer to the Note in Fusion Patterns.
Select operation.
Combination rules:
BiasAdd : If present, must be the first op in the epilogue subgraph and can only appear once.
0 to 4 Binary or Unary operations are supported in the epilogue subgraph.
Select : If present, must follow binary/unary operations (if present) and can only appear once.
F2F/F2Q Conversion Subgraph : Converts the output tensor from floating-point to floating-point or quantized data type. It can be one of the following subgraphs, the last two subgraphs are implementations for SmoothQuant[1]. See TypeCast, Quantize and Multiply operations in Graph API.
Data Types¶
oneDNN supports the following combinations of data types for src, weights, bias and dst:
src |
weights |
bias |
dst |
---|---|---|---|
u8,s8 |
s8,f32 |
f32,bf16,f16 |
u8,s8,bf16,f16,f32 |
The definition of the data types and support status on different CPU and GPU platforms follow the general description in the Data Types Guide.
Limitations¶
F2F Conversion Subgraph used for
bias
tensor only supports f32 to bf16 data type conversion.
Reference¶
[1] SmoothQuant, https://arxiv.org/abs/2211.10438