diff --git a/doc/graph/fusion_patterns/fusion_patterns.md b/doc/graph/fusion_patterns/fusion_patterns.md new file mode 100644 index 00000000000..b39374e1571 --- /dev/null +++ b/doc/graph/fusion_patterns/fusion_patterns.md @@ -0,0 +1,107 @@ +Fusion Patterns {#dev_guide_graph_fusion_patterns} +================================================== + +## Overview + +The following fusion patterns are subgraphs that the oneDNN Graph API recognizes +as candidates for fusion. The patterns are described using oneDNN Graph +operation (op) names with the following convention. + +@note oneDNN Graph performs limited input validation to minimize the performance +overheads. The application is responsible for sanitizing inputs passed to the +library. Because large `u8` or `s8` inputs may lead to accumulator overflow, you +can use floating-point patterns instead of quantized patterns. + +`"+"` describes a chain of two ops. The preceding op produces an output tensor, +which is consumed by the following op as its first operand. + +`"[]"` describes a component of the overall pattern description. For example, +it could include a subgraph or all the op choices within the bracket. + +`"|"` describes choices of multiple operations, say A+[B|C] means the graph +partition contains A followed by B or C. + +`","` describes a graph composed of multiple subgraphs, each subgraph marks its +output tensor explicitly, which is consumed by other subgraphs. + +`Superscript` denotes the numbers of repetition pattern. For example, +A+[B|C]\f$^{3}\f$ means the graph partition contains A followed by three ops, +each of them is either B or C. The superscript could be a range of number +meaning allowing a range of repetition. If the range is between 0 and 1, we use +superscript `"?"`. + +`Subscript` denotes the input and output tensors which need to explicitly mark +the producer and consumer relation within one graph partition. For example, +A\f$_{>t1}\f$+B+C\f$_{"` refers to the output +tensor, and `"<"` for input tensor. Input and output tensors between neighbor +ops are not explicitly marked, for example, B consumes t1 implicitly in the +example above. + +Subscript `"out"` marks the output tensor of a certain op to be the output of +a graph partition. For example, in +A\f$_{>t1}\f$+B\f$_{>out}\f$+C\f$_{out}\f$, B's output and C's output +are marked as output tensors. + +Subscript `"in"` marks the input tensor of a certain op to be the input of a +graph partition. For example, in A\f$_{t1}\f$+B+C\f$_{out}\f$ | This pattern is widely used in Convolution Neural Networks, for example ResNet, ResNext, SSD, etc. | +| ConvTranspose + BiasAdd\f$^?\f$ + [Unary \| Binary]\f$^{0-3}\f$\f$_{>out}\f$ | This pattern is widely used in Generative Adversarial Networks. | +| Interpolate + [Unary \| Binary]\f$^{0-3}\f$\f$_{>out}\f$ | This pattern is widely used for image processing. | +| MatMul + BiasAdd\f$^?\f$ + [Unary \| Binary]\f$^{0-3}\f$ + Select\f$^?\f$\f$_{>out}\f$ | This pattern is widely used in language models and recommendation models, for example BERT, DLRM, etc. | +| Reduction + [Unary \| Binary]\f$^{0-3}\f$\f$_{>out}\f$ | This pattern is widely used for data processing, for example loss reduction. | +| Unary + Binary\f$^{0-3}\f$\f$_{>out}\f$ | This pattern is widely used in Convolution Neural Networks. | +| Binary + [Unary \| Binary]\f$^{0-3}\f$\f$_{>out}\f$ | This pattern is widely used in Generative Adversarial Networks, for example ParallelWaveGAN. | +| [AvgPool \| MaxPool] + Binary\f$^{0-3}\f$\f$_{>out}\f$ | This pattern is widely used in Convolution Neural Networks. | +| BatchNormInference + ReLU\f$_{>out}\f$ | This pattern is widely used in Convolution Neural Networks, for example DenseNet. | +| Reciprocal + Multiply\f$_{>out}\f$ | N/A | +| Reorder + Add\f$_{>out}\f$ | N/A | + +#### Quantized Patterns + +| Pattern | Description | +|:--------|:-----------------------------| +| Quantize\f$^?\f$ + Dequantize\f$_{>t1}\f$, Dequantize\f$_{>t2}\f$\f$^{0-3}\f$, Dequantize + Convolution\f$_{out}\f$ | N/A | +| Quantize\f$^?\f$ + Dequantize\f$_{>t1}\f$, Dequantize\f$_{>t2}\f$\f$^{0-3}\f$, Dequantize + ConvTranspose\f$_{out}\f$ |N/A | +| Quantize\f$^?\f$ + Dequantize\f$_{>t1}\f$, Dequantize\f$_{>t2}\f$\f$^{0-3}\f$, Dequantize + MatMul\f$_{out}\f$ |N/A | +| Dequantize + [AvgPool \| MaxPool] + Quantize\f$_{>out}\f$ |N/A | +| Dequantize\f$_{>t1}\f$, Dequantize + [AvgPool \| MaxPool] + Add\f$_{out}\f$ |N/A | +| Dequantize + Reorder + Quantize\f$_{>out}\f$ |N/A | +| Dequantize\f$_{>t1}\f$, Dequantize + Reorder + Add\f$_{out}\f$ |N/A | +| [SoftMax \| LayerNorm \| GroupNorm] + [Unary \| Binary\f$_{out}\f$ | This pattern is used in SmoothQuant to fuse scales and quantization into previous layers | + +### Training + +| Pattern | Description | +|:--------|:-----------------------------| +| ConvolutionBackwardWeights + BiasAddBackward\f$_{>out}\f$ | N/A | +| ReLUBackward + BatchNormTrainingBackward\f$_{>out}\f$ |N/A | diff --git a/doc/graph/fusion_patterns/gated_mlp.md b/doc/graph/fusion_patterns/gated_mlp.md new file mode 100644 index 00000000000..1ee9e158af6 --- /dev/null +++ b/doc/graph/fusion_patterns/gated_mlp.md @@ -0,0 +1,123 @@ +Gated Multi-Layer Perceptron (Gated-MLP) {#dev_guide_graph_gated_mlp} +===================================================================== + +## Overview + +Gated Multi-Layer Perceptron (Gated-MLP) is a variant of MLP which is widely +used as the Feed Forward Network (FFN) in many Transformer-based Large Language +Models (LLMs). + +Typically, the FFN in Transformer architecture [1] is defined as a two layer MLP +with a ReLU activation in between which can be replaced with other activations. + +\f[ + + FFN(src,W,V) = ReLU(src \cdot W) \cdot V + +\f] + +Gated Linear Unit (GLU) is adopted to replace the first linear layer to +improve the quality of Transformer-based models [2]: + +\f[ + + GLU(src,W_1,W_2) = (src \cdot W_1) \otimes Sigmoid(src \cdot W_2) \\ + + FFN(src,W_1,W_2,V) = GLU(src,W_1,W_2) \cdot V + +\f] + +Where the \f$ src \cdot W_1 \f$ is usually called "FC (fully-connected) up", +\f$ src \cdot W_2 \f$ is called "FC gate", and the last linear is called +"FC down". + +Swish activation is further adopted to replace Sigmoid in the GLU to form +swiGLU. + +\f[ + + Swish(x) = x \otimes Sigmoid(x) \\ + + swiGLU(src,W_1,W_2) = (src \cdot W_1) \otimes Swish(src \cdot W_2) \\ + + FFN(src,W_1,W_2,V) = swiGLU(src,W_1,W_2) \cdot V + +\f] + +The Gated-MLP based on swiGLU is also adopted in LLMs like LLaMA [3], Qwen [4], +etc. + +## Gated-MLP patterns + +oneDNN supports Gated-MLP and its optimization through Graph API [5] by defining +the graph, getting partition from the graph, and optimizing the kernels +underneath. In general, a Gated-MLP pattern is defined as a directional acyclic +graph (DAG) using oneDNN Graph API. + +### Floating-point Gated-MLP + +oneDNN defines floating-point (f32, bf16, and f16) Gated-MLP as follows. The blue +nodes are required when defining a Gated-MLP pattern while the brown nodes are +optional. + +![Gated-MLP pattern](images/fp-gated-mlp.png) + +1. The first MatMul on the top left calculates "FC up": \f$ src \cdot W_1 \f$. + See [MatMul](@ref dev_guide_op_matmul) operation in Graph API. +2. The second MatMul on the top right calculates "FC gate": \f$ src \cdot W_2 \f$. +3. The Activation node is optional. If required, it can be constructed with the + activation operations in Graph API, for example, [ReLU](@ref dev_guide_op_relu), + [GELU](@ref dev_guide_op_gelu), [Sigmoid](@ref dev_guide_op_sigmoid), and so on. + For Swish activation, the node can be constructed with the [Sigmoid](@ref dev_guide_op_sigmoid) + and [Multiply](@ref dev_guide_op_multiply) as below. You can also refer the + [Gated-MLP example](https://github.com/oneapi-src/oneDNN/tree/main/examples/graph/gated_mlp.cpp) + for Swish definition. + + ![Swish Activation](images/gated-mlp-swish.png) + +4. The last MatMul on the bottom performs the "FC down" operation between the + GLU output and \f$V\f$. + +## Data Types + +oneDNN supports the floating-point Gated-MLP pattern with data types f32, bf16, +and f16. You can specify the data type via the input and output data type fields +of logical tensors for each operation. oneDNN does not support mixing different +floating data types in a floating-point Gated-MLP pattern. + +The definition of the data types and support status on different CPU and GPU +platforms follow the general description in @ref dev_guide_data_types. + +## Implementation limitations + +1. oneDNN primitive-based Gated-MLP is implemented as the reference + implementation on both Intel Architecture Processors and Intel Graphics + Products. In this case, floating-point Gated-MLP patterns are usually + implemented with three f32, bf16, or f16 matmul (with binary or eltwise + post-ops) primitives. +2. The Gated-MLP patterns functionally supports all input shapes meeting the + shape requirements of each operation in the graph. For example, the `MatMul` + operation requires shape consistency for `k` dimension. The `Multiply` + operation requires the input tensors to have the same shape or the shapes can + be properly broadcasted based on the operation attribute. + +## Examples + +oneDNN provides a [Gated-MLP +example](https://github.com/oneapi-src/oneDNN/tree/main/examples/graph/gated_mlp.cpp) +demonstrating how to construct a typical floating-point Gated-MLP pattern with +oneDNN Graph API on CPU and GPU with different runtimes. + +For applications where the weights of FC up and FC gate are combined as a single +tensor, oneDNN also provides an +[example](https://github.com/oneapi-src/oneDNN/tree/main/examples/graph/gated_mlp_wei_combined.cpp) +demonstrating how to create the weight tensors for the pattern with the offsets +and strides from the combined weight tensor. + +## References + +1. Attention is all you need, https://arxiv.org/abs/1706.03762v7 +2. GLU Variants Improve Transformer, https://arxiv.org/abs/2002.05202 +3. LLaMA: Open and Efficient Foundation Language Models, https://arxiv.org/abs/2302.13971 +4. Qwen Technical Report, https://arxiv.org/abs/2309.16609 +5. oneDNN Graph API documentation, https://oneapi-src.github.io/oneDNN/graph_extension.html diff --git a/doc/graph/fusion_patterns/gqa.md b/doc/graph/fusion_patterns/gqa.md new file mode 100644 index 00000000000..ea48cb6c1ea --- /dev/null +++ b/doc/graph/fusion_patterns/gqa.md @@ -0,0 +1,106 @@ +Grouped Query Attention (GQA) {#dev_guide_graph_gqa} +==================================================== + +## Overview + +In a typical Scaled Dot-Product Attention (SDPA) [1], the input Query, Key, and +Value tensors have the same head number. It becomes a performance bottleneck to +load the Key and Value tensors in each generation step, especially when the +sentence length gets longer. + +To reduce the memory bandwidth overhead of loading the Key and Value tensors, +Multi-Query Attention (MQA) [2] is created by reducing the head number of Key +and Value tensors to one which means multiple Queries will map to the same +single Key and Value tensor. However, MQA may lead to model quality degradation +and training instability. Therefore, Grouped-Query Attention (GQA) [3], an +interpolation between the typical SDPA and MQA, is proposed with single Key and +Value head per a subgroup of Query heads. The head number of Key and Value +equals to the group number of Query heads. + +The notations used in the document: + +- N: the mini-batch size. +- H_q: the head number of Query. +- H_kv: the head number of Key or Value. +- N_rep: H_q / H_kv, indicates how many Query heads are mapped to one Key head. +- S: the sequence length. +- D: the size of each head. + +## GQA Pattern + +Similar to how SDPA is supported, the GQA pattern is also defined as a +directional acyclic graph (DAG) using oneDNN Graph API. oneDNN extends the +[SDPA pattern](@ref dev_guide_graph_sdpa) to support floating-point (f32, bf16, +and f16) GQA as follows. The blue nodes are required when defining a GQA pattern +while the brown nodes are optional. + +![GQA pattern](images/gqa.png) + +Compared to a typical SDPA pattern, there are a few differences in the GQA +pattern: + +1. The input Query has shape (N, H_q, S, D). It will be reshaped to (N, H_kv, + N_rep, S, D) by splitting H_q dimension into H_kv and N_rep. The reshaping + can be constructed using the [StaticReshape](@ref dev_guide_op_staticreshape) + operation in Graph API. +2. Similarly, the input Key and Value have shape (N, H_kv, S, D). They will be + reshaped to (N, H_kv, 1, S, D) to meet the input shape requirement of + [MatMul](@ref dev_guide_op_matmul) operation. +3. The second MatMul calculates the dot products between the probabilities after + SoftMax and Value nodes and generates output with shape (N, H_kv, N_rep, S, D). +4. Another StaticReshape operation is applied to the output of the second MatMul + to convert the shape into (N, H_q, S, D) by combining H_kv and N_rep + dimensions. +5. The input scale factor and mask in the pattern also need to meet the + operations' shape requirement which can be achieved through StaticReshape + similarly. Besides that, they have the same definition as described in the + typical SDPA pattern. + +## Data Types + +oneDNN supports the floating-point GQA pattern with data types f32, bf16, and +f16. You can specify the data type via the input and output data type fields of +logical tensors for each operation. oneDNN does not support mixing different +floating data types in a floating-point GQA pattern. + +The definition of the data types and support status on different CPU and GPU +platforms follow the general description in @ref dev_guide_data_types. + +## Implementation Limitations + +1. oneDNN primitive-based GQA is implemented as the reference implementation on + both Intel Architecture Processors and Intel Graphics Products. The reference + implementation requires memory to store the intermediate results of the dot + products between Query and Key which takes \f$O(S^2)\f$ memory. It may lead + to Out-of-Memory error when computing long sequence length input on platforms with + limited memory. +2. The GQA patterns functionally support all input shapes meeting the shape + requirements of each operation in the graph. +3. CPU + - Optimized implementation is available for 4D Q/K/V tensors with shape + defined as (N, H_q, S, D) for Query and (N, H_kv, S, D) for Key and Value. + - Optimized implementation is available for OpenMP runtime and Threadpool + runtime on Intel Architecture Processors. + - Specifically for OpenMP runtime, the optimized implementation requires `N * + H_q > 2 * thread number` to get enough parallelism. +4. GPU + - Optimized implementation is available for 4D Q/K/V tensors with shape + defined as (N, H_q, S, D) for Query and (N, H_kv, S, D) for Key and Value. + - Optimized implementation is available for floating-point GQA with `f16` + data type and `D <= 256` on Intel Graphics Products with Intel(R) Xe Matrix + Extensions (Intel(R) XMX) support. + +## Example + +oneDNN provides a [GQA +example](https://github.com/oneapi-src/oneDNN/tree/main/examples/graph/gqa.cpp) +demonstrating how to construct a floating-point GQA pattern with oneDNN Graph +API on CPU and GPU with different runtimes. + +## References + +[1] Attention is all you need, https://arxiv.org/abs/1706.03762v7 + +[2] Fast Transformer Decoding: One Write-Head is All You Need, https://arxiv.org/abs/1911.02150 + +[3] GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints, https://arxiv.org/abs/2305.13245 diff --git a/doc/graph/fusion_patterns/images/fp-gated-mlp.png b/doc/graph/fusion_patterns/images/fp-gated-mlp.png new file mode 100644 index 00000000000..a52952ce87b Binary files /dev/null and b/doc/graph/fusion_patterns/images/fp-gated-mlp.png differ diff --git a/doc/graph/fusion_patterns/images/gated-mlp-swish.png b/doc/graph/fusion_patterns/images/gated-mlp-swish.png new file mode 100644 index 00000000000..2050ee8d871 Binary files /dev/null and b/doc/graph/fusion_patterns/images/gated-mlp-swish.png differ diff --git a/doc/graph/fusion_patterns/images/gqa.png b/doc/graph/fusion_patterns/images/gqa.png new file mode 100644 index 00000000000..0871903bcda Binary files /dev/null and b/doc/graph/fusion_patterns/images/gqa.png differ diff --git a/doc/graph/images/sdpa-mask-1.png b/doc/graph/fusion_patterns/images/sdpa-mask-1.png similarity index 100% rename from doc/graph/images/sdpa-mask-1.png rename to doc/graph/fusion_patterns/images/sdpa-mask-1.png diff --git a/doc/graph/images/sdpa-mask-2.png b/doc/graph/fusion_patterns/images/sdpa-mask-2.png similarity index 100% rename from doc/graph/images/sdpa-mask-2.png rename to doc/graph/fusion_patterns/images/sdpa-mask-2.png diff --git a/doc/graph/images/sdpa-reorder.png b/doc/graph/fusion_patterns/images/sdpa-reorder.png similarity index 100% rename from doc/graph/images/sdpa-reorder.png rename to doc/graph/fusion_patterns/images/sdpa-reorder.png diff --git a/doc/graph/images/sdpa.png b/doc/graph/fusion_patterns/images/sdpa.png similarity index 100% rename from doc/graph/images/sdpa.png rename to doc/graph/fusion_patterns/images/sdpa.png diff --git a/doc/graph/sdpa.md b/doc/graph/fusion_patterns/sdpa.md similarity index 76% rename from doc/graph/sdpa.md rename to doc/graph/fusion_patterns/sdpa.md index 1b0864a5c76..bead0f80974 100644 --- a/doc/graph/sdpa.md +++ b/doc/graph/fusion_patterns/sdpa.md @@ -1,9 +1,9 @@ Scaled Dot-Product Attention (SDPA) {#dev_guide_graph_sdpa} =========================================================== -## Background +## Overview -Scaled Dot-Product Attention (SDPA) was introduced in [1] as the core operation +Scaled Dot-Product Attention (SDPA) is introduced in [1] as the core operation of Transformer block which now becomes the backbone of many language models and generative models (BERT, Stable Diffusion, GPT, etc.). @@ -30,9 +30,9 @@ SDPA graph, getting partition from the graph, and optimizing the kernels underneath. In general, an SDPA pattern is defined as a directional acyclic graph (DAG) using oneDNN Graph API. -### Floating point SDPA +### Floating-point SDPA -oneDNN defines floating point (f32, bf16, or f16) SDPA as follows. The blue +oneDNN defines floating-point (f32, bf16, or f16) SDPA as follows. The blue nodes are required when defining an SDPA pattern while the brown parts are optional. @@ -74,12 +74,12 @@ optional. ![SDPA-Reorder](images/sdpa-reorder.png) -## Data types +## Data Types -oneDNN supports the floating point SDPA pattern with data types f32, bf16, and -f16. oneDNN users can specify the data type via the input and output logical -tensors' data type fields for each operation. oneDNN does not support mixing -different floating data types in a floating point SDPA pattern. +oneDNN supports the floating-point SDPA pattern with data types f32, bf16, and +f16. You can specify the data type via the input and output logical tensors' +data type fields for each operation. oneDNN does not support mixing different +floating data types in a floating-point SDPA pattern. oneDNN supports the quantized SDPA pattern with int8-f32 mixed precision, int8-bf16 mixed precision, and int8-f16 mixed precision data types. @@ -91,10 +91,13 @@ platforms follow the general description in @ref dev_guide_data_types. 1. oneDNN primitive-based SDPA is implemented as the reference implementation on both Intel Architecture Processors and Intel Graphics Products. In this case, - floating point SDPA patterns are usually implemented with f32/bf16/f16 matmul - (with post-ops) and softmax primitives, while quantized SDPA patterns are - implemented with int8 matmul (with post-ops) and f32/bf16/f16 softmax - primitives. + floating-point SDPA patterns are usually implemented with f32, bf16, or f16 + matmul (with post-ops) and softmax primitives, while quantized SDPA patterns + are implemented with int8 matmul (with post-ops) and f32, bf16, or f16 + softmax primitives. The reference implementation requires memory to store the + intermediate results of the dot products between Query and Key which takes + \f$O(S^2)\f$ memory. It may lead to out-of-memory error when computing long + sequence length input on platforms with limited memory. 2. The SDPA patterns functionally supports all input shapes meeting the shape requirements of each operation in the graph. For example, Add, Multiply, Divide, and Select operations require the input tensors to have the same @@ -110,7 +113,7 @@ platforms follow the general description in @ref dev_guide_data_types. 4. GPU - Optimized implementation is available for 4D Q/K/V tensors with shape defined as (N, H, S, D). - - Optimized implementation is available for floating point SDPA with `f16` + - Optimized implementation is available for floating-point SDPA with `f16` data type and `D <= 256` on Intel Graphics Products with Intel(R) Xe Matrix Extensions (Intel(R) XMX) support. @@ -118,11 +121,19 @@ platforms follow the general description in @ref dev_guide_data_types. oneDNN provides an [SDPA example](https://github.com/oneapi-src/oneDNN/tree/main/examples/graph/sdpa.cpp) -demonstrating how to construct a typical floating point SDPA pattern with oneDNN +demonstrating how to construct a typical floating-point SDPA pattern with oneDNN Graph API on CPU and GPU with different runtimes. +oneDNN also provides a [MQA (Multi-Query Attention) +example](https://github.com/oneapi-src/oneDNN/tree/main/examples/graph/mqa.cpp) [3] +demonstrating how to construct a floating-point MQA pattern with the same +pattern structure as in the SDPA example but different head number in Key and +Value tensors. In MQA, the head number of Key and Value is always one. + ## References [1] Attention is all you need, https://arxiv.org/abs/1706.03762v7 [2] oneDNN Graph API documentation, https://oneapi-src.github.io/oneDNN/graph_extension.html + +[3] Fast Transformer Decoding: One Write-Head is All You Need, https://arxiv.org/abs/1911.02150 diff --git a/doc/graph/supported_patterns.md b/doc/graph/supported_patterns.md deleted file mode 100644 index 6118a088929..00000000000 --- a/doc/graph/supported_patterns.md +++ /dev/null @@ -1,159 +0,0 @@ -Supported Fusion Patterns {#dev_guide_graph_fusion_patterns} -============================================================ - -@anchor fusion_patterns -## Fusion Patterns - -The following fusion patterns are subgraphs that the oneDNN Graph API recognizes -as candidate for fusion. The patterns are described using oneDNN Graph -operation (op) names with the following convention. - -@note oneDNN Graph performs limited input validation to minimize the performance -overheads. The application is responsible for sanitizing inputs passed to the -library. For large u8 or s8 inputs may lead to accumulator overflow, you can use -floating point patterns instead of quantized patterns. - -`"+"` describes a chain of two ops. The preceding op produces an output tensor, -which is consumed by the following op as its first operand. - -`"[]"` describes a component of the overall pattern description. For example, -it could include a subgraph or all the op choices within the bracket. - -`"|"` describes choices of multiple operations, say A+[B|C] means the graph -partition contains A followed by B or C. - -`","` describes a graph composed of multiple subgraphs, each subgraph marks its -output tensor explicitly, which is consumed by other subgraphs. - -`Superscript` denotes the numbers of repetition pattern. For example, -A+[B|C]\f$^{3}\f$ means the graph partition contains A followed by three ops, -each of them is either B or C. The superscript could be a range of number -meaning allowing a range of repetition. If the range is between 0 and 1, we use -superscript `"?"`. - -`Subscript` denotes the input and output tensors which need to explicitly mark -the producer and consumer relation within one graph partition. For example, -A\f$_{>t1}\f$+B+C\f$_{"` refers to the output -tensor, and `"<"` for input tensor. Input and output tensor between neighbor -ops are not explicitly marked, for example, B consumes t1 implicitly in the -example above. - -Subscript `"out"` marks the output tensor of a certain op to be the output of -a graph partition. For example, in -A\f$_{>t1}\f$+B\f$_{>out}\f$+C\f$_{out}\f$, B's output and C's output -are marked as output tensors. - -Subscript `"in"` marks the input tensor of a certain op to be the input of a -graph partition. For example, in A\f$_{t1}\f$+B+C\f$_{out}\f$ | This pattern is widely used in Convolution Neural Networks, for example ResNet, ResNext, SSD, etc. | -| ConvTranspose + BiasAdd\f$^?\f$ + [Unary \| Binary]\f$^{0-3}\f$\f$_{>out}\f$ | This pattern is widely used in Generative Adversarial Networks. | -| Interpolate + [Unary \| Binary]\f$^{0-3}\f$\f$_{>out}\f$ | This pattern is widely used for image processing. | -| MatMul + BiasAdd\f$^?\f$ + [Unary \| Binary]\f$^{0-3}\f$ + Select\f$^?\f$\f$_{>out}\f$ | This pattern is widely used in language models and recommendation models, for example BERT, DLRM, etc. | -| Reduction + [Unary \| Binary]\f$^{0-3}\f$\f$_{>out}\f$ | This pattern is widely used for data processing, for example loss reduction. | -| Unary + Binary\f$^{0-3}\f$\f$_{>out}\f$ | This pattern is widely used in Convolution Neural Networks. | -| Binary + [Unary \| Binary]\f$^{0-3}\f$\f$_{>out}\f$ | This pattern is widely used in Generative Adversarial Networks, for example ParallelWaveGAN. | -| [AvgPool \| MaxPool] + Binary\f$^{0-3}\f$\f$_{>out}\f$ | This pattern is widely used in Convolution Neural Networks. | -| BatchNormInference + ReLU\f$_{>out}\f$ | This pattern is widely used in Convolution Neural Networks, for example DenseNet. | -| Reciprocal + Multiply\f$_{>out}\f$ | N/A | -| Reorder + Add\f$_{>out}\f$ | N/A | -| Scaled Dot-Product Attention | Refer to @ref dev_guide_graph_sdpa for more details. | - -#### Quantized Patterns - -| Pattern | Description | -|:--------|:-----------------------------| -| Quantize\f$^?\f$ + Dequantize\f$_{>t1}\f$, Dequantize\f$_{>t2}\f$\f$^{0-3}\f$, Dequantize + Convolution\f$_{out}\f$ | N/A | -| Quantize\f$^?\f$ + Dequantize\f$_{>t1}\f$, Dequantize\f$_{>t2}\f$\f$^{0-3}\f$, Dequantize + ConvTranspose\f$_{out}\f$ |N/A | -| Quantize\f$^?\f$ + Dequantize\f$_{>t1}\f$, Dequantize\f$_{>t2}\f$\f$^{0-3}\f$, Dequantize + MatMul\f$_{out}\f$ |N/A | -| Dequantize + [AvgPool \| MaxPool] + Quantize\f$_{>out}\f$ |N/A | -| Dequantize\f$_{>t1}\f$, Dequantize + [AvgPool \| MaxPool] + Add\f$_{out}\f$ |N/A | -| Dequantize + Reorder + Quantize\f$_{>out}\f$ |N/A | -| Dequantize\f$_{>t1}\f$, Dequantize + Reorder + Add\f$_{out}\f$ |N/A | -| [SoftMax \| LayerNorm \| GroupNorm] + [Unary \| Binary\f$_{out}\f$ | This pattern is used in SmoothQuant to fuse scales and quantization into previous layers | - -### Training - -| Pattern | Description | -|:--------|:-----------------------------| -| ConvolutionBackwardWeights + BiasAddBackward\f$_{>out}\f$ | N/A | -| ReLUBackward + BatchNormTrainingBackward\f$_{>out}\f$ |N/A | - -All the above fusion patterns are supported by default. - -## Aggressive Fusion Patterns -Aggressive fusion patterns also follow the pattern description convention -defined in the [Fusion Patterns](@ref fusion_patterns) section. - -@note Aggressive fusion patterns are only supported when -[Graph Compiler](@ref dev_guide_graph_compiler) is enabled. - -The following categories will also be used to describe aggressive fusion -patterns. - -- ReshapeTranspose = [StaticReshape + StaticTranspose\f$^{1-2}\f$] - -- Activation = [ReLU \| Sigmoid \| GELU] - -- ActivationBackward = [ReLUBackward \| SigmoidBackward \| GELUBackward] - -### Inference - -#### Floating Point Patterns - -| Pattern | Description | -|:--------|:-----------------------------| -| MatMul + [Multiply \| Divide] + Add + Softmax + MatMul + StaticTranspose + Reorder\f$_{>out}\f$ | Multi-head Attention. This pattern is widely used in models containing encoder-decoder structures, for example BERT. | -| ReshapeTranspose\f$_{>t1}\f$, ReshapeTranspose\f$_{>t2}\f$, ReshapeTranspose + MatMul\f$_{out}\f$ | Multi-head Attention. | -| MatMul + Activation\f$_{>t1}\f$, [MatMul\f$_{t1}\f$]\f$^{0-4}\f$, MatMul\f$_{out}\f$ | Multi-layer Perceptron. This pattern is widely used in recommendation models, for example DLRM. | -| [Convolution + BiasAdd\f$^{?}\f$ + ReLU]\f$^{1-3}\f$ + Convolution + BiasAdd\f$^{?}\f$ + Add + ReLU\f$_{>out}\f$ | Identical Bottleneck. Enabled only in single thread runtime scenario. This pattern is widely used in Convolution Neural Networks, for example ResNet. | -| Convolution + BiasAdd\f$^{?}\f$\f$_{>t1}\f$, [Convolution + BiasAdd\f$^{?}\f$ + ReLU]\f$^{1-3}\f$ + Convolution + BiasAdd\f$^{?}\f$ + Add\f$_{out}\f$ | Convolutional Bottleneck. Enabled only in single thread runtime scenario. This pattern is widely used in Convolution Neural Networks, for example ResNet. | - -#### Quantized Patterns - -| Pattern | Description | -|:--------|:-----------------------------| -| Dequantize\f$_{>t1}\f$, Dequantize\f$_{>t2}\f$, Dequantize + MatMul\f$_{out}\f$ | Quantized Multi-head Attention. | -| Dequantize + ReshapeTranspose\f$_{>t1}\f$, Dequantize + ReshapeTranspose\f$_{>t2}\f$, Dequantize + MatMul\f$_{out}\f$ | Quantized Multi-head Attention. | -| Dequantize\f$_{>t1}\f$, Dequantize + MatMul\f$_{t2}\f$, [Dequantize\f$_{>t3}\f$, Dequantize\f$_{t2}\f$]\f$^{0-4}\f$, Dequantize\f$_{>t4}\f$, Dequantize\f$_{out}\f$ | Quantized Multi-layer Perceptron. | -| Dequantize\f$_{>t2}\f$, Dequantize\f$_{>t3}\f$, [Dequantize\f$_{>t1}\f$, Dequantize + Convolution\f$_{out}\f$ | Quantized Identical Bottleneck. Enabled only in single thread runtime scenario. | -| [Dequantize\f$_{>t1}\f$, Dequantize + Convolution\f$_{t2}\f$, Dequantize\f$_{>t4}\f$, [Dequantize\f$_{>t3}\f$, Dequantize + Convolution\f$_{out}\f$ | Quantized Convolutional Bottleneck. Enabled only in single thread runtime scenario. | - -### Training - -| Pattern | Description | -|:--------|:-----------------------------| -| Dequantize\f$_{>t1}\f$, Dequantize\f$_{>t2}\f$, Dequantize + MatMul\f$_{out}\f$ | Multi-head Attention Training Forward Pattern. | -| StaticReshape + StaticTranspose\f$_{>t1}\f$ + MatMul + Multiply\f$_{>t2}\f$ + Subtract\f$_{t4}\f$ + MatMul\f$_{>out1}\f$, Multiply\f$_{t3}\f$, MatMul\f$_{out2}\f$, MatMul\f$_{out3}\f$ | Multi-head Attention Training Backward Pattern. | -| MatMul\f$_{>out1}\f$ + Activation\f$_{>t1,>out2}\f$, [MatMul\f$_{out3}\f$ + Activation\f$_{>t1,>out4}\f$]\f$^{0-4}\f$, MatMul\f$_{out5}\f$ + Activation\f$_{>out6}\f$ | Multi-layer Perceptron Training Forward Pattern. | -| StaticTranspose\f$^{?}\f$\f$_{>t0}\f$, ActivationBackward\f$_{>t2}\f$ + MatMul\f$_{t1}\f$, ReduceSum\f$^{?}\f$\f$_{out1}\f$, StaticTranspose\f$^{?}\f$ + MatMul\f$_{out2}\f$, [StaticTranspose\f$^{?}\f$\f$_{>t3}\f$, ActivationBackward\f$_{>t4,t1}\f$, ReduceSum\f$^{?}\f$\f$_{out3}\f$, StaticTranspose\f$^{?}\f$ + MatMul\f$_{out4}\f$]\f$^{0-4}\f$, StaticTranspose\f$^{?}\f$\f$_{>t5}\f$, ActivationBackward\f$_{>t6,out5}\f$, ReduceSum\f$^{?}\f$\f$_{out6}\f$, StaticTranspose\f$^{?}\f$ + MatMul\f$_{out7}\f$ | Multi-layer Perceptron Training Backward Pattern. | -| Convolution\f$_{>out1}\f$ + BatchNormForwardTraining\f$_{>out2}\f$ + ReLU\f$_{>out3}\f$ + Convolution\f$_{>out4}\f$ + BatchNormForwardTraining\f$_{>out5}\f$ + ReLU\f$_{>out6}\f$ + Convolution\f$_{>out7}\f$ + BatchNormForwardTraining\f$_{>out8}\f$ + Add + ReLU\f$_{>out9}\f$ | Identical Bottleneck Training Forward Pattern. | -| Convolution\f$_{>out1}\f$ + BatchNormForwardTraining\f$_{>t1,>out2}\f$, Convolution\f$_{>out3}\f$ + BatchNormForwardTraining\f$_{>out4}\f$ + ReLU\f$_{>out5}\f$ + Convolution\f$_{>out6}\f$ + BatchNormForwardTraining\f$_{>out7}\f$ + ReLU\f$_{>out8}\f$ + Convolution\f$_{>out9}\f$ + BatchNormForwardTraining\f$_{>out10}\f$ + Add\f$_{out11}\f$ | Convolutional Bottleneck Training Forward Pattern. | -| ReLUBackward\f$_{>t1}\f$ + BatchNormTrainingBackward\f$_{>t2,>out1}\f$ + ConvolutionBackwardData + ReLUBackward + BatchNormTrainingBackward\f$_{>t3,>out2}\f$ + ConvolutionBackwardData + ReLUBackward + BatchNormTrainingBackward\f$_{>t4,>out3}\f$ + ConvolutionBackwardData + Add\f$_{out4}\f$, ConvolutionBackwardWeights\f$_{out5}\f$, ConvolutionBackwardWeights\f$_{out6}\f$, ConvolutionBackwardWeights\f$_{out7}\f$ | Identical Bottleneck Training Backward Pattern. | -| ReLUBackward\f$_{>t1}\f$ + BatchNormTrainingBackward\f$_{>t2,>out1}\f$ + ConvolutionBackwardData + ReLUBackward + BatchNormTrainingBackward\f$_{>t3,>out2}\f$ + ConvolutionBackwardData + ReLUBackward + BatchNormTrainingBackward\f$_{>t4,>out3}\f$ + ConvolutionBackwardData + Add\f$_{out4}\f$, BatchNormTrainingBackward\f$_{t5,>out5}\f$ + ConvolutionBackwardData\f$_{>t6}\f$, ConvolutionBackwardWeights\f$_{out6}\f$, ConvolutionBackwardWeights\f$_{out7}\f$, ConvolutionBackwardWeights\f$_{out8}\f$, ConvolutionBackwardWeights\f$_{out9}\f$ | Convolutional Bottleneck Training Backward Pattern. |