Skip to content

Commit 0bbb39c

Browse files
vpirogovranukundmgouicem
authored
oneDNN v3.6 release notes (uxlfoundation#2113)
Co-authored-by: Ranu Kundu <ranu.kundu@intel.com> Co-authored-by: Mourad Gouicem <mourad.gouicem@intel.com>
1 parent 7becab8 commit 0bbb39c

File tree

1 file changed

+159
-0
lines changed

1 file changed

+159
-0
lines changed

RELEASE_NOTES.md

+159
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,159 @@
1+
oneDNN v3.6 Release Notes
2+
=========================
3+
4+
# Performance Optimizations
5+
6+
## Intel Architecture Processors
7+
8+
* Improved performance for 4th generation Intel Xeon Scalable processors
9+
(formerly Sapphire Rapids).
10+
* Improved performance for Intel Xeon 6 processors (formerly Granite Rapids).
11+
* Improved performance of group normalization primitive.
12+
* Improved `bf16` matmul performance with `int4` compressed weights on processors
13+
with Intel AMX instruction set support.
14+
* Improved performance of `fp8` matmul, pooling, and eltwise primitives on
15+
processors with Intel AMX instruction set support.
16+
* Improved `fp32` RNN primitive performance on processors with Intel AVX2
17+
instruction set support.
18+
* Improved performance of the following subgraphs with Graph API:
19+
- `convolution` and `binary` operation fusions with better layout selection
20+
in Graph API.
21+
- `fp8` `convolution` and `unary` or `binary` on processors with Intel AMX
22+
instruction set support.
23+
- Scaled Dot Product Attention (SDPA) without scale,
24+
Multi-Query Attention (MQA), and Grouped Query Attention (GQA) patterns.
25+
- `LayerNorm`, `GroupNorm`, and `SoftMax` with `int8` quantized output
26+
and zero-points.
27+
28+
## Intel Graphics Products
29+
30+
* Improved performance for the Intel Data Center GPU Max Series (formerly
31+
Ponte Vecchio).
32+
* Introduced broad production quality optimizations for Intel Arc Graphics for
33+
Intel Core Ultra processors (Series 2) (formerly Lunar Lake).
34+
* Introduced broad production quality optimizations for future discrete GPU
35+
based on Xe2 architecture (code name Battlemage).
36+
* Introduced support for Intel Arc Graphics for future
37+
Intel Core Ultra processor (code name Arrow Lake-H).
38+
* Improved performance of `fp8_e5m2` primitives on
39+
Intel Data Center GPU Max Series (formerly Ponte Vecchio).
40+
* Improved matmul and inner product primitives performance for shapes relevant
41+
to large language models (LLMs) on GPUs with Intel XMX support.
42+
* Improved `int8` convolution performance with weight zero-points.
43+
* Reduced primitive creation time for softmax, layer normalization, and concat
44+
primitives via kernel reuse.
45+
* Improved performance of the following subgraphs with Graph API:
46+
- SDPA without scale, MQA, and GQA patterns. `f16` variants of these
47+
patterns significantly benefit from Intel(R) Xe Matrix Extensions
48+
(Intel(R) XMX) support.
49+
- `fp8`, `convolution`, and `unary` or `binary` on the
50+
Intel Data Center GPU Max Series.
51+
- `LayerNorm`, `GroupNorm`, and `SoftMax` with `int8` quantized output and
52+
zero-points.
53+
54+
## AArch64-based Processors
55+
56+
* Improved `fp32` convolution backpropagation performance on processors with
57+
SVE support.
58+
* Improved reorder performance for blocked format on processors with
59+
SVE support.
60+
* Improved `bf16` softmax performance on processors with SVE support.
61+
* Improved batch normalization performance on processors with SVE support.
62+
* Improved matmul performance on processors with SVE support.
63+
* Improved `fp16` convolution with Arm Compute Library (ACL).
64+
* Improved matmul performance with ACL.
65+
* Switched matmul and convolution implementation with ACL to stateless API
66+
significantly improving primitive creation time and increasing caching
67+
efficiency and performance for these operators.
68+
69+
# Functionality
70+
71+
* Introduced [generic GPU] support. This implementation relies on portable
72+
SYCL kernels and can be used as a starting point to enable new devices in
73+
oneDNN.
74+
* Extended functionality supported on NVIDIA GPUs and AMD GPUs with SYCL-based
75+
implementations.
76+
* Enabled support for `int8` activations with grouped scales and `int8`
77+
or `int4` compressed weights in matmul primitive. This functionality
78+
is implemented on Intel GPUs.
79+
* Introduces support for stochastic rounding for `fp8` data type
80+
functionality.
81+
* **[experimental]** Extended [microkernel API]:
82+
- Introduced `int8` quantization support.
83+
- Extended transform microkernel with transposition support and support for
84+
arbitrary strides.
85+
- Introduced verbose diagnostics support.
86+
* **[experimental]** Extended [sparse API]:
87+
- Introduced support for sparse memory with coordinate (COO) storage format.
88+
- Extended matmul primitive to work with sparse memory in COO format. This
89+
functionality is implemented on CPUs and Intel GPUs.
90+
* Introduced `int8` support in eltwise primitive with 'clip' algorithm. This
91+
functionality is implemented on CPUs.
92+
* Graph API:
93+
- Introduced `GroupNorm` operation and fusions in Graph API.
94+
- Introduced support for standalone `StaticReshape` and `StaticTranspose`
95+
operations.
96+
97+
[generic GPU]: https://github.com/oneapi-src/oneDNN/blob/rls-v3.6/src/gpu/generic/sycl/README.md
98+
[microkernel API]: https://oneapi-src.github.io/oneDNN/v3.6/ukernels.html
99+
[sparse API]: https://oneapi-src.github.io/oneDNN/v3.6/dev_guide_experimental.html#onednn-experimental-sparse
100+
101+
# Usability
102+
103+
* Added [examples][Graph API examples] for SDPA, MQA, and GQA patterns
104+
implementation with Graph API.
105+
* Added [an example][deconvolution example] for deconvolution primitive.
106+
* Added examples for [Vanilla RNN][Vanilla RNN example] and
107+
[LBR GRU][LBR GRU example] RNN cells.
108+
* Introduced support for Intel oneAPI DPC++/C++ Compiler 2025.0.
109+
* Introduced interoperability with [SYCL Graph] record/replay mode.
110+
* Removed dependency on OpenCL runtime for NVIDIA and AMD GPUs.
111+
* **[experimental]** Introduced [logging mechanism][spdlog] based on spdlog
112+
library.
113+
* Introduced support for `ONEDNN_ENABLE_WORKLOAD` build knob for Graph API.
114+
* Improved performance of `get_partitions()` function in Graph API.
115+
116+
[Graph API examples]: https://github.com/oneapi-src/oneDNN/tree/rls-v3.6/examples/graph
117+
[deconvolution example]: https://github.com/oneapi-src/oneDNN/blob/rls-v3.6/examples/primitives/deconvolution.cpp
118+
[Vanilla RNN example]: https://github.com/oneapi-src/oneDNN/blob/rls-v3.6/examples/primitives/vanilla_rnn.cpp
119+
[LBR GRU example]: https://github.com/oneapi-src/oneDNN/blob/rls-v3.6/examples/primitives/lbr_gru.cpp
120+
[SYCL Graph]: https://codeplay.com/portal/blogs/2024/01/22/sycl-graphs
121+
[spdlog]: https://oneapi-src.github.io/oneDNN/v3.6/dev_guide_experimental.html#onednn-experimental-logging
122+
123+
# Validation
124+
125+
* Introduced protection from out-of-memory scenarios in benchdnn Graph API
126+
driver.
127+
128+
# Deprecated Functionality
129+
130+
* Experimental [Graph Compiler] is deprecated and will be removed in future releases.
131+
132+
[Graph Compiler]: https://oneapi-src.github.io/oneDNN/v3.6/dev_guide_graph_compiler.html
133+
134+
# Breaking Changes
135+
136+
* Experimental [microkernel API] in this release is not compatible with
137+
[the version available][microkernel API v3.5] in oneDNN v3.5.
138+
* Updated minimal supported ACL version to 24.08.1 (was 24.04).
139+
140+
[microkernel API v3.5]: https://oneapi-src.github.io/oneDNN/v3.5/ukernels.html
141+
142+
# Thanks to these Contributors
143+
144+
This release contains contributions from the [project core team] as well as
145+
Abdel @quickwritereader, Adam Jackson @nwnk, Aleksandr Voron @alvoron,
146+
Alexey Makarevich @amakarev, Annop Wongwathanarat @annop-w, Daniel Kuts
147+
@apach301, @deepeshfujitsu, Fadi Arafeh @fadara01, Fritz Heckel @fwph,
148+
Gorokhov Dmitriy @dmitry-gorokhov, Deeksha Kasture @kasturedeeksha,
149+
Kentaro Kawakami @kawakami-k, Marek Michalowski @michalowski-arm,
150+
@matthias-bonne, @Menooker, Michael Froelich @MichaelFroelich,
151+
Nicolas Miller @npmiller, Nikhil Sharma @nikhilfujitsu, @nishith-fujitsu,
152+
Permanence AI Coder @Permanence-AI-Coder, Radu Salavat @Radu2k, Renato Barros
153+
Arantes @renato-arantes, Robert Cohn @rscohn2, Robert Hardwick @robert-hardwick,
154+
Ryo Suzuki @Ryo-not-rio, Shreyas-fuj @Shreyas-fuj, Shu Chen @shu1chen,
155+
Siddhartha Menon @Sqvid, Song Jiaming @Litchilitchy, Vladimir Paramuzov
156+
@vladimir-paramuzov, Yifei Zhang @yifeizh2. We would also like to thank everyone
157+
who asked questions and reported issues.
158+
159+
[project core team]: https://github.com/oneapi-src/oneDNN/blob/rls-v3.6/MAINTAINERS.md

0 commit comments

Comments
 (0)