.. index:: pair: page; Quantized MatMul Fusion Patterns .. _doxid-dev_guide_graph_quantized_matmul_fusion_patterns: Quantized MatMul Fusion Patterns ================================ Overview ~~~~~~~~ oneDNN supports both floating-point and quantized MatMul fusion patterns to optimize performance and reduce memory bandwidth requirements. This document describes the supported quantized fusion patterns for MatMul. For floating-point MatMul fusion patterns, refer to :ref:`MatMul Fusion Patterns ` for more details. Pattern Structure ~~~~~~~~~~~~~~~~~ oneDNN defines quantized MatMul fusion patterns as follows. The blue nodes are required when defining a quantized MatMul fusion pattern while the brown nodes are optional. .. image:: quantized_matmul_pattern.png :alt: quantized MatMul pattern #. Q2F Conversion Subgraph : Converts ``src`` and ``weights`` tensors from quantized to floating-point. It can be one of the following subgraphs, while the last two subgraphs apply only to ``weights``. See :ref:`Dequantize `, :ref:`TypeCast ` and :ref:`Quantize ` operations in Graph API. .. image:: q2f_conversion_quantized_conv_matmul.png :alt: q2f_conversion_subgraph #. F2F Conversion Subgraph : Converts ``bias`` tensor from floating-point to another floating-point. It is constructed by a :ref:`TypeCast ` operation. .. image:: f2f_conversion.png :alt: f2f_conversion_subgraph #. MatMul Operation : Performs matrix multiplication between the ``src`` and ``weights`` tensors. The ``bias`` tensor is optional. See the :ref:`MatMul ` operation in the Graph API for more details. #. Epilogue Subgraph : Optional and can include the following operations: * :ref:`BiasAdd ` operation. * Binary and Unary operations: refer to the Note in `Fusion Patterns `__. * :ref:`Select ` operation. Combination rules: .. image:: epilogue_subgraph_matmul.png :alt: epilogue subgraph * BiasAdd : If present, must be the first op in the epilogue subgraph and can only appear once. * 0 to 4 Binary or Unary operations are supported in the epilogue subgraph. * Select : If present, must follow binary/unary operations (if present) and can only appear once. #. F2F/F2Q Conversion Subgraph : Converts the output tensor from floating-point to floating-point or quantized data type. It can be one of the following subgraphs, the last two subgraphs are implementations for SmoothQuant[1]. See :ref:`TypeCast `, :ref:`Quantize ` and :ref:`Multiply ` operations in Graph API. .. image:: f2q_conversion_quantized_matmul.png :alt: f2q_conversion_subgraph Data Types ~~~~~~~~~~ oneDNN supports the following combinations of data types for src, weights, bias and dst: ====== ======== ============= =================== src weights bias dst ====== ======== ============= =================== u8,s8 s8,f32 f32,bf16,f16 u8,s8,bf16,f16,f32 ====== ======== ============= =================== The definition of the data types and support status on different CPU and GPU platforms follow the general description in the :ref:`Data Types Guide `. Limitations ~~~~~~~~~~~ * F2F Conversion Subgraph used for ``bias`` tensor only supports f32 to bf16 data type conversion. Reference ~~~~~~~~~ [1] SmoothQuant, `https://arxiv.org/abs/2211.10438 `__