Quantized Convolution Fusion Patterns

Overview

oneDNN supports both floating-point and quantized Convolution fusion patterns to optimize performance and reduce memory bandwidth requirements. This document describes the supported quantized fusion patterns for Convolution. For floating-point Convolution fusion patterns, refer to Convolution Fusion Patterns for more details.

Pattern Structure

oneDNN defines quantized Convolution fusion patterns as follows. The blue nodes are required when defining a quantized Convolution fusion pattern while the brown nodes are optional.

quantized Convolution pattern
  1. Q2F Conversion Subgraph : Converts src and weights tensors from quantized to floating-point. It can be one of the following subgraphs, while the last two subgraphs apply only to weights. See Dequantize, TypeCast and Quantize operations in Graph API.

    q2f_conversion_subgraph
  2. F2F Conversion Subgraph : Converts bias tensor from floating-point to another floating-point. It is constructed by a TypeCast operation.

    f2f_conversion_subgraph
  3. Convolution Operation : Performs convolution between the src and weights tensors. The bias tensor is optional. See the Convolution operation in the Graph API for more details.

  4. Epilogue Subgraph : Optional and can include the following operations:

    Combination Rules:

    epilogue subgraph
    • BiasAdd : If present, must be the first op in the epilogue subgraph and can only appear once.

    • 0 to 4 Binary or Unary operations are supported in the epilogue subgraph.

  5. F2F/F2Q Conversion Subgraph : Converts the output tensor from floating-point to floating-point or quantized data type. It can be one of the following subgraphs. See TypeCast and Quantize operations in Graph API.

    f2q_conversion_subgraph

Data Types

oneDNN supports the following combinations of data types for src, weights, bias and dst:

src

weights

bias

dst

u8,s8

s8,f32

f32,bf16,f16

u8,s8,bf16,f16,f32

The definition of the data types and support status on different CPU and GPU platforms follow the general description in the Data Types Guide.

Implementation Limitations

  1. F2Q Conversion Subgraph used for dst tensor only supports bf16 to f32 data type conversion.

Example

oneDNN provides a quantized Convolution example demonstrating how to construct a typical quantized Convolution pattern with oneDNN Graph API on CPU.