Skip to content

Commit 8fd566e

Browse files
authored
rfcs: graph api: support swish operation
1 parent a913d51 commit 8fd566e

File tree

1 file changed

+199
-0
lines changed

1 file changed

+199
-0
lines changed
+199
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,199 @@
1+
# Support Swish operation in Graph API
2+
3+
## Background
4+
5+
Swish is an activation operation introduced and experimented in [[#1]][1] and
6+
[[#2]][2]. It is also known as SiLU (Sigmoid Linear Unit) in some papers and
7+
frameworks. In this document, we choose to call the operation Swish following
8+
the naming convention of oneDNN. Swish operation is defined as:
9+
10+
$$Swish(x) = x * sigmoid(factor * x)$$
11+
12+
where $factor = 1.f$ by default for most real models.
13+
14+
### Adoption in models
15+
16+
Swish operation is widely adopted to improve the quality of deep learning
17+
networks. For examples:
18+
19+
- EfficientNet series [[#3]][3]: Swish is used as the activation in
20+
Convolutional Neural Networks.
21+
- Large language models like LLaMA [[#4]][4], Qwen [[#5]][5], etc.: Swish is
22+
used to construct SwiGLU [[#6]][6] by replacing the Sigmoid activation in
23+
typical GLU (Gated Linear Unit). SwiGLU is further used to build Gated MLP in
24+
the models.
25+
26+
### Support in frameworks and libraries
27+
28+
- PyTorch supports Swish via the SiLU operation [[#7]][7]. The operation does
29+
not support specifying `factor` in the formula.
30+
- OpenVINO supports Swish via the Swish operation [[#8]][8]. Unlike PyTorch's
31+
SiLU operation, OpenVINO's Swish also accepts a scalar input `Beta` as the
32+
multiplication `factor` for Sigmoid.
33+
- For ONNX, a PR is working in progress to add Swish operation [[#9]][9].
34+
- oneDNN supports Swish as an algorithm of eltwise primitive [[#10]][10] which
35+
accepts a scalar `alpha` in primitive descriptor creation as the
36+
multiplication `factor` for Sigmoid.
37+
- cuDNN backend API supports Swish as a mode (`CUDNN_POINTWISE_SWISH_FWD`) of
38+
its Pointwise operation [[#11]][11] and accepts attribute
39+
`CUDNN_ATTR_POINTWISE_SWISH_BETA` as the multiplication `factor`.
40+
- Please note that, even PyTorch has SiLU operation, there are still many model
41+
scripts choosing to implement swish with a composition of smaller operations
42+
[[#12]][12].
43+
44+
## Proposals
45+
46+
### Option 1: Support Swish via Sigmoid and Multiply operation
47+
48+
As indicated by the formula of Swish, the proposal is to support it via the
49+
combination of Sigmoid and Multiply operations which are already supported in
50+
oneDNN Graph API.
51+
52+
- [Sigmoid operation](https://oneapi-src.github.io/oneDNN/dev_guide_op_sigmoid.html)
53+
- [Multiply operation](https://oneapi-src.github.io/oneDNN/dev_guide_op_multiply.html)
54+
55+
With that, a Swish operation with default `factor` can ben programed as below:
56+
57+
```cpp
58+
using namespace dnnl::graph;
59+
60+
graph swish = graph(engine::kind::cpu);
61+
62+
logical_tensor src = logical_tensor(ID_SRC, dt, shape);
63+
logical_tensor res = logical_tensor(ID_RES, dt, shape);
64+
logical_tensor dst = logical_tensor(ID_DST, dt, shape);
65+
66+
op sig = op(ID_SIG, op::kind::Sigmoid, "sig");
67+
sig.add_input(src);
68+
sig.add_output(res);
69+
70+
op mul = op(ID_MUL, op::kind::Multiply, "mul");
71+
mul.add_inputs({src, res});
72+
mul.add_output(dst);
73+
74+
swish.add_op(sig);
75+
swish.add_op(mul);
76+
swish.finalize();
77+
```
78+
79+
Pros:
80+
81+
- There is no need to define and maintain a new operation in oneDNN Graph API.
82+
- Composition of smaller operations makes it possible and scalable to extend whe
83+
the activation has more variants or flavors in the future.
84+
- The approach of composition of Multiply and Sigmoid is also adopted in models
85+
as mentioned above.
86+
87+
Cons:
88+
89+
- Compared to a dedicate Swish operation, this proposal requires more users code
90+
(at least one more logical tensor and one more operation).
91+
- It also requires complex logic in the backend to detect `Sigmoid + Multiply`
92+
and map to the existing Swish kernels in oneDNN. It requires the input of
93+
Sigmoid and the second input of Multiply to be the same tensor.
94+
- Considering that SiLU is a built-in operation in PyTorch, mapping it to two
95+
operations in oneDNN Graph is troublesome for some integrations.
96+
- Currently, oneDNN Graph Sigmoid operation does not support a multiplication
97+
`factor`. We may need to extend either the proposed Swish graph or the Sigmoid
98+
operation to support cases where `factor != 1.f`.
99+
100+
### Option 2: Support Swish as a dedicate operation
101+
102+
As aforementioned, main stream frameworks and libraries all support Swish as a
103+
dedicate operation. We think that it's reasonable to add a new Swish operation
104+
in oneDNN Graph API. The proposed operation schema is as follow:
105+
106+
- Operation Kind: `Swish` (C++), `dnnl_graph_op_swish` (C).
107+
- Input/output: Single input, single output.
108+
- Attribute: `alpha` (optional) for the multiplication factor in the formula.
109+
`alpha = 1.f` if not provided.
110+
- Data types: f32, bf16, f16.
111+
112+
With the new operation being defined, a Swish operation can be programed as
113+
below:
114+
115+
```cpp
116+
using namespace dnnl::graph;
117+
118+
graph swish = graph(engine::kind::cpu);
119+
120+
logical_tensor src = logical_tensor(ID_SRC, dt, shape);
121+
logical_tensor dst = logical_tensor(ID_DST, dt, shape);
122+
123+
op swi = op(ID_SWI, op::kind::Swish, "swi");
124+
swi.set_attr<float>(op::attr::alpha, 0.5f); // optional
125+
swi.add_input(src);
126+
swi.add_output(dst);
127+
128+
swish.add_op(swi);
129+
swish.finalize();
130+
```
131+
132+
Pros:
133+
134+
- It simplifies the user code, especially when Swish is used to construct a
135+
complex fusion pattern.
136+
- The operation can be directly dispatched to the existing Swish kernels in
137+
oneDNN.
138+
- It can be integrated easily into PyTorch to optimize the SiLU operation. It
139+
also helps when converting cuDNN code into oneDNN code.
140+
- Attribute `beta` is considered to support the cases where `factor != 1.f`.
141+
- The granularity of operations is consistent within oneDNN and with other
142+
frameworks and libraries.
143+
144+
Cons:
145+
146+
- It adds an new operation into oneDNN Graph API which may need additional
147+
maintenance effort.
148+
- To some extend, supporting all Sigmoid, Multiply, and Swish operations is kind
149+
of duplication.
150+
- We will need to break the API or add a new operation if the operation formula
151+
changes (eg. the `factor` is extended from a scalar to a vector or full
152+
tensor) in the future. But with option 1, we just need to define a new pattern
153+
without bloating the API.
154+
155+
## Conclusions
156+
157+
The decision is to implement the option 1.
158+
159+
The library will support Sigmoid + Multiply fusions for Swish without
160+
considering `factor != 1.f` which is the most common case. In this case, Sigmoid
161+
\+ Multiply will be fused into swish algorithm of eltwise primitive or post-op
162+
with `alpha = 1.f`.
163+
164+
For other cases where `factor != 1.f` is specified, once they are requested, we
165+
can extend the library in the following options:
166+
167+
- Extend Sigmoid operation with a multiplication factor attribute, so the swish
168+
can still be represented as Sigmoid + Multiply.
169+
- Represent the multiplication with another Multiply operation in case the
170+
factor is not known at graph build stage or is not a scalar. In case, Swish
171+
will be fused and implemented as Multiply + Sigmoid + Multiply.
172+
173+
## References
174+
175+
1. Swish: a Self-Gated Activation Function, https://arxiv.org/abs/1710.05941v1
176+
2. Gaussian Error Linear Units (GELUs), https://arxiv.org/abs/1606.08415
177+
3. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks, https://arxiv.org/abs/1905.11946
178+
4. LLaMA: Open and Efficient Foundation Language Models, https://arxiv.org/abs/2302.13971
179+
5. Qwen Technical Report, https://arxiv.org/abs/2309.16609
180+
6. GLU Variants Improve Transformer, https://arxiv.org/abs/2002.05202
181+
7. SiLU operation in PyTorch, https://pytorch.org/docs/stable/generated/torch.nn.SiLU.html
182+
8. Swish operation in OpenVINO, https://docs.openvino.ai/2024/documentation/openvino-ir-format/operation-sets/operation-specs/activation/swish-4.html
183+
9. PR for Swish operation in ONNX, https://github.com/onnx/onnx/pull/5964
184+
10. Swish in oneDNN, https://oneapi-src.github.io/oneDNN/dev_guide_eltwise.html
185+
11. Swish in cuDNN, https://docs.nvidia.com/deeplearning/cudnn/latest/api/cudnn-graph-library.html#cudnnpointwisemode-t
186+
12. Swish implementation in Huggingface repository, https://github.com/search?q=org%3Ahuggingface%20swish&type=code
187+
188+
[1]: https://arxiv.org/abs/1710.05941v1
189+
[2]: https://arxiv.org/abs/1606.08415
190+
[3]: https://arxiv.org/abs/1905.11946
191+
[4]: https://arxiv.org/abs/2302.13971
192+
[5]: https://arxiv.org/abs/2309.16609
193+
[6]: https://arxiv.org/abs/2002.05202
194+
[7]: https://pytorch.org/docs/stable/generated/torch.nn.SiLU.html
195+
[8]: https://docs.openvino.ai/2024/documentation/openvino-ir-format/operation-sets/operation-specs/activation/swish-4.html
196+
[9]: https://github.com/onnx/onnx/pull/5964
197+
[10]: https://oneapi-src.github.io/oneDNN/dev_guide_eltwise.html
198+
[11]: https://docs.nvidia.com/deeplearning/cudnn/latest/api/cudnn-graph-library.html#cudnnpointwisemode-t
199+
[12]: https://github.com/search?q=org%3Ahuggingface%20swish&type=code

0 commit comments

Comments
 (0)