rfcs: graph api: support swish operation

TaoLv · web-flow · commit 8fd566edd89a · 2024-11-21T16:33:51.000+08:00
diff --git a/rfcs/20241008-graph-api-swish/README.md b/rfcs/20241008-graph-api-swish/README.md
@@ -0,0 +1,199 @@
+# Support Swish operation in Graph API
+
+## Background
+
+Swish is an activation operation introduced and experimented in [[#1]][1] and
+[[#2]][2]. It is also known as SiLU (Sigmoid Linear Unit) in some papers and
+frameworks. In this document, we choose to call the operation Swish following
+the naming convention of oneDNN. Swish operation is defined as:
+
+$$Swish(x) = x * sigmoid(factor * x)$$
+
+where $factor = 1.f$ by default for most real models.
+
+### Adoption in models
+
+Swish operation is widely adopted to improve the quality of deep learning
+networks. For examples:
+
+- EfficientNet series [[#3]][3]: Swish is used as the activation in
+  Convolutional Neural Networks.
+- Large language models like LLaMA [[#4]][4], Qwen [[#5]][5], etc.: Swish is
+  used to construct SwiGLU [[#6]][6] by replacing the Sigmoid activation in
+  typical GLU (Gated Linear Unit). SwiGLU is further used to build Gated MLP in
+  the models.
+
+### Support in frameworks and libraries
+
+- PyTorch supports Swish via the SiLU operation [[#7]][7]. The operation does
+  not support specifying `factor` in the formula.
+- OpenVINO supports Swish via the Swish operation [[#8]][8]. Unlike PyTorch's
+  SiLU operation, OpenVINO's Swish also accepts a scalar input `Beta` as the
+  multiplication `factor` for Sigmoid.
+- For ONNX, a PR is working in progress to add Swish operation [[#9]][9].
+- oneDNN supports Swish as an algorithm of eltwise primitive [[#10]][10] which
+  accepts a scalar `alpha` in primitive descriptor creation as the
+  multiplication `factor` for Sigmoid.
+- cuDNN backend API supports Swish as a mode (`CUDNN_POINTWISE_SWISH_FWD`) of
+  its Pointwise operation [[#11]][11] and accepts attribute
+  `CUDNN_ATTR_POINTWISE_SWISH_BETA` as the multiplication `factor`.
+- Please note that, even PyTorch has SiLU operation, there are still many model
+  scripts choosing to implement swish with a composition of smaller operations
+  [[#12]][12].
+
+## Proposals
+
+### Option 1: Support Swish via Sigmoid and Multiply operation
+
+As indicated by the formula of Swish, the proposal is to support it via the
+combination of Sigmoid and Multiply operations which are already supported in
+oneDNN Graph API.
+
+- [Sigmoid operation](https://oneapi-src.github.io/oneDNN/dev_guide_op_sigmoid.html)
+- [Multiply operation](https://oneapi-src.github.io/oneDNN/dev_guide_op_multiply.html)
+
+With that, a Swish operation with default `factor` can ben programed as below:
+
+```cpp
+using namespace dnnl::graph;
+
+graph swish = graph(engine::kind::cpu);
+
+logical_tensor src = logical_tensor(ID_SRC, dt, shape);
+logical_tensor res = logical_tensor(ID_RES, dt, shape);
+logical_tensor dst = logical_tensor(ID_DST, dt, shape);
+
+op sig = op(ID_SIG, op::kind::Sigmoid, "sig");
+sig.add_input(src);
+sig.add_output(res);
+
+op mul = op(ID_MUL, op::kind::Multiply, "mul");
+mul.add_inputs({src, res});
+mul.add_output(dst);
+
+swish.add_op(sig);
+swish.add_op(mul);
+swish.finalize();
+```
+
+Pros:
+
+- There is no need to define and maintain a new operation in oneDNN Graph API.
+- Composition of smaller operations makes it possible and scalable to extend whe
+  the activation has more variants or flavors in the future.
+- The approach of composition of Multiply and Sigmoid is also adopted in models
+  as mentioned above.
+
+Cons:
+
+- Compared to a dedicate Swish operation, this proposal requires more users code
+  (at least one more logical tensor and one more operation).
+- It also requires complex logic in the backend to detect `Sigmoid + Multiply`
+  and map to the existing Swish kernels in oneDNN. It requires the input of
+  Sigmoid and the second input of Multiply to be the same tensor.
+- Considering that SiLU is a built-in operation in PyTorch, mapping it to two
+  operations in oneDNN Graph is troublesome for some integrations.
+- Currently, oneDNN Graph Sigmoid operation does not support a multiplication
+  `factor`. We may need to extend either the proposed Swish graph or the Sigmoid
+  operation to support cases where `factor != 1.f`.
+
+### Option 2: Support Swish as a dedicate operation
+
+As aforementioned, main stream frameworks and libraries all support Swish as a
+dedicate operation. We think that it's reasonable to add a new Swish operation
+in oneDNN Graph API. The proposed operation schema is as follow:
+
+- Operation Kind: `Swish` (C++), `dnnl_graph_op_swish` (C).
+- Input/output: Single input, single output.
+- Attribute: `alpha` (optional) for the multiplication factor in the formula.
+  `alpha = 1.f` if not provided.
+- Data types: f32, bf16, f16.
+
+With the new operation being defined, a Swish operation can be programed as
+below:
+
+```cpp
+using namespace dnnl::graph;
+
+graph swish = graph(engine::kind::cpu);
+
+logical_tensor src = logical_tensor(ID_SRC, dt, shape);
+logical_tensor dst = logical_tensor(ID_DST, dt, shape);
+
+op swi = op(ID_SWI, op::kind::Swish, "swi");
+swi.set_attr<float>(op::attr::alpha, 0.5f); // optional
+swi.add_input(src);
+swi.add_output(dst);
+
+swish.add_op(swi);
+swish.finalize();
+```
+
+Pros:
+
+- It simplifies the user code, especially when Swish is used to construct a
+  complex fusion pattern.
+- The operation can be directly dispatched to the existing Swish kernels in
+  oneDNN.
+- It can be integrated easily into PyTorch to optimize the SiLU operation. It
+  also helps when converting cuDNN code into oneDNN code.
+- Attribute `beta` is considered to support the cases where `factor != 1.f`.
+- The granularity of operations is consistent within oneDNN and with other
+  frameworks and libraries.
+
+Cons:
+
+- It adds an new operation into oneDNN Graph API which may need additional
+  maintenance effort.
+- To some extend, supporting all Sigmoid, Multiply, and Swish operations is kind
+  of duplication.
+- We will need to break the API or add a new operation if the operation formula
+  changes (eg. the `factor` is extended from a scalar to a vector or full
+  tensor) in the future. But with option 1, we just need to define a new pattern
+  without bloating the API.
+
+## Conclusions
+
+The decision is to implement the option 1.
+
+The library will support Sigmoid + Multiply fusions for Swish without
+considering `factor != 1.f` which is the most common case. In this case, Sigmoid
+\+ Multiply will be fused into swish algorithm of eltwise primitive or post-op
+with `alpha = 1.f`.
+
+For other cases where `factor != 1.f` is specified, once they are requested, we
+can extend the library in the following options:
+
+- Extend Sigmoid operation with a multiplication factor attribute, so the swish
+  can still be represented as Sigmoid + Multiply.
+- Represent the multiplication with another Multiply operation in case the
+  factor is not known at graph build stage or is not a scalar. In case, Swish
+  will be fused and implemented as Multiply + Sigmoid + Multiply.
+
+## References
+
+1. Swish: a Self-Gated Activation Function, https://arxiv.org/abs/1710.05941v1
+2. Gaussian Error Linear Units (GELUs), https://arxiv.org/abs/1606.08415
+3. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks, https://arxiv.org/abs/1905.11946
+4. LLaMA: Open and Efficient Foundation Language Models, https://arxiv.org/abs/2302.13971
+5. Qwen Technical Report, https://arxiv.org/abs/2309.16609
+6. GLU Variants Improve Transformer, https://arxiv.org/abs/2002.05202
+7. SiLU operation in PyTorch, https://pytorch.org/docs/stable/generated/torch.nn.SiLU.html
+8. Swish operation in OpenVINO, https://docs.openvino.ai/2024/documentation/openvino-ir-format/operation-sets/operation-specs/activation/swish-4.html
+9. PR for Swish operation in ONNX, https://github.com/onnx/onnx/pull/5964
+10. Swish in oneDNN, https://oneapi-src.github.io/oneDNN/dev_guide_eltwise.html
+11. Swish in cuDNN, https://docs.nvidia.com/deeplearning/cudnn/latest/api/cudnn-graph-library.html#cudnnpointwisemode-t
+12. Swish implementation in Huggingface repository, https://github.com/search?q=org%3Ahuggingface%20swish&type=code
+
+[1]: https://arxiv.org/abs/1710.05941v1
+[2]: https://arxiv.org/abs/1606.08415
+[3]: https://arxiv.org/abs/1905.11946
+[4]: https://arxiv.org/abs/2302.13971
+[5]: https://arxiv.org/abs/2309.16609
+[6]: https://arxiv.org/abs/2002.05202
+[7]: https://pytorch.org/docs/stable/generated/torch.nn.SiLU.html
+[8]: https://docs.openvino.ai/2024/documentation/openvino-ir-format/operation-sets/operation-specs/activation/swish-4.html
+[9]: https://github.com/onnx/onnx/pull/5964
+[10]: https://oneapi-src.github.io/oneDNN/dev_guide_eltwise.html
+[11]: https://docs.nvidia.com/deeplearning/cudnn/latest/api/cudnn-graph-library.html#cudnnpointwisemode-t
+[12]: https://github.com/search?q=org%3Ahuggingface%20swish&type=code