|
| 1 | +oneDNN v3.6 Release Notes |
| 2 | +========================= |
| 3 | + |
| 4 | +# Performance Optimizations |
| 5 | + |
| 6 | +## Intel Architecture Processors |
| 7 | + |
| 8 | + * Improved performance for 4th generation Intel Xeon Scalable processors |
| 9 | + (formerly Sapphire Rapids). |
| 10 | + * Improved performance for Intel Xeon 6 processors (formerly Granite Rapids). |
| 11 | + * Improved performance of group normalization primitive. |
| 12 | + * Improved `bf16` matmul performance with `int4` compressed weights on processors |
| 13 | + with Intel AMX instruction set support. |
| 14 | + * Improved performance of `fp8` matmul, pooling, and eltwise primitives on |
| 15 | + processors with Intel AMX instruction set support. |
| 16 | + * Improved `fp32` RNN primitive performance on processors with Intel AVX2 |
| 17 | + instruction set support. |
| 18 | + * Improved performance of the following subgraphs with Graph API: |
| 19 | + - `convolution` and `binary` operation fusions with better layout selection |
| 20 | + in Graph API. |
| 21 | + - `fp8` `convolution` and `unary` or `binary` on processors with Intel AMX |
| 22 | + instruction set support. |
| 23 | + - Scaled Dot Product Attention (SDPA) without scale, |
| 24 | + Multi-Query Attention (MQA), and Grouped Query Attention (GQA) patterns. |
| 25 | + - `LayerNorm`, `GroupNorm`, and `SoftMax` with `int8` quantized output |
| 26 | + and zero-points. |
| 27 | + |
| 28 | +## Intel Graphics Products |
| 29 | + |
| 30 | + * Improved performance for the Intel Data Center GPU Max Series (formerly |
| 31 | + Ponte Vecchio). |
| 32 | + * Introduced broad production quality optimizations for Intel Arc Graphics for |
| 33 | + Intel Core Ultra processors (Series 2) (formerly Lunar Lake). |
| 34 | + * Introduced broad production quality optimizations for future discrete GPU |
| 35 | + based on Xe2 architecture (code name Battlemage). |
| 36 | + * Introduced support for Intel Arc Graphics for future |
| 37 | + Intel Core Ultra processor (code name Arrow Lake-H). |
| 38 | + * Improved performance of `fp8_e5m2` primitives on |
| 39 | + Intel Data Center GPU Max Series (formerly Ponte Vecchio). |
| 40 | + * Improved matmul and inner product primitives performance for shapes relevant |
| 41 | + to large language models (LLMs) on GPUs with Intel XMX support. |
| 42 | + * Improved `int8` convolution performance with weight zero-points. |
| 43 | + * Reduced primitive creation time for softmax, layer normalization, and concat |
| 44 | + primitives via kernel reuse. |
| 45 | + * Improved performance of the following subgraphs with Graph API: |
| 46 | + - SDPA without scale, MQA, and GQA patterns. `f16` variants of these |
| 47 | + patterns significantly benefit from Intel(R) Xe Matrix Extensions |
| 48 | + (Intel(R) XMX) support. |
| 49 | + - `fp8`, `convolution`, and `unary` or `binary` on the |
| 50 | + Intel Data Center GPU Max Series. |
| 51 | + - `LayerNorm`, `GroupNorm`, and `SoftMax` with `int8` quantized output and |
| 52 | + zero-points. |
| 53 | + |
| 54 | +## AArch64-based Processors |
| 55 | + |
| 56 | + * Improved `fp32` convolution backpropagation performance on processors with |
| 57 | + SVE support. |
| 58 | + * Improved reorder performance for blocked format on processors with |
| 59 | + SVE support. |
| 60 | + * Improved `bf16` softmax performance on processors with SVE support. |
| 61 | + * Improved batch normalization performance on processors with SVE support. |
| 62 | + * Improved matmul performance on processors with SVE support. |
| 63 | + * Improved `fp16` convolution with Arm Compute Library (ACL). |
| 64 | + * Improved matmul performance with ACL. |
| 65 | + * Switched matmul and convolution implementation with ACL to stateless API |
| 66 | + significantly improving primitive creation time and increasing caching |
| 67 | + efficiency and performance for these operators. |
| 68 | + |
| 69 | +# Functionality |
| 70 | + |
| 71 | + * Introduced [generic GPU] support. This implementation relies on portable |
| 72 | + SYCL kernels and can be used as a starting point to enable new devices in |
| 73 | + oneDNN. |
| 74 | + * Extended functionality supported on NVIDIA GPUs and AMD GPUs with SYCL-based |
| 75 | + implementations. |
| 76 | + * Enabled support for `int8` activations with grouped scales and `int8` |
| 77 | + or `int4` compressed weights in matmul primitive. This functionality |
| 78 | + is implemented on Intel GPUs. |
| 79 | + * Introduces support for stochastic rounding for `fp8` data type |
| 80 | + functionality. |
| 81 | + * **[experimental]** Extended [microkernel API]: |
| 82 | + - Introduced `int8` quantization support. |
| 83 | + - Extended transform microkernel with transposition support and support for |
| 84 | + arbitrary strides. |
| 85 | + - Introduced verbose diagnostics support. |
| 86 | + * **[experimental]** Extended [sparse API]: |
| 87 | + - Introduced support for sparse memory with coordinate (COO) storage format. |
| 88 | + - Extended matmul primitive to work with sparse memory in COO format. This |
| 89 | + functionality is implemented on CPUs and Intel GPUs. |
| 90 | + * Introduced `int8` support in eltwise primitive with 'clip' algorithm. This |
| 91 | + functionality is implemented on CPUs. |
| 92 | + * Graph API: |
| 93 | + - Introduced `GroupNorm` operation and fusions in Graph API. |
| 94 | + - Introduced support for standalone `StaticReshape` and `StaticTranspose` |
| 95 | + operations. |
| 96 | + |
| 97 | +[generic GPU]: https://github.com/oneapi-src/oneDNN/blob/rls-v3.6/src/gpu/generic/sycl/README.md |
| 98 | +[microkernel API]: https://oneapi-src.github.io/oneDNN/v3.6/ukernels.html |
| 99 | +[sparse API]: https://oneapi-src.github.io/oneDNN/v3.6/dev_guide_experimental.html#onednn-experimental-sparse |
| 100 | + |
| 101 | +# Usability |
| 102 | + |
| 103 | + * Added [examples][Graph API examples] for SDPA, MQA, and GQA patterns |
| 104 | + implementation with Graph API. |
| 105 | + * Added [an example][deconvolution example] for deconvolution primitive. |
| 106 | + * Added examples for [Vanilla RNN][Vanilla RNN example] and |
| 107 | + [LBR GRU][LBR GRU example] RNN cells. |
| 108 | + * Introduced support for Intel oneAPI DPC++/C++ Compiler 2025.0. |
| 109 | + * Introduced interoperability with [SYCL Graph] record/replay mode. |
| 110 | + * Removed dependency on OpenCL runtime for NVIDIA and AMD GPUs. |
| 111 | + * **[experimental]** Introduced [logging mechanism][spdlog] based on spdlog |
| 112 | + library. |
| 113 | + * Introduced support for `ONEDNN_ENABLE_WORKLOAD` build knob for Graph API. |
| 114 | + * Improved performance of `get_partitions()` function in Graph API. |
| 115 | + |
| 116 | +[Graph API examples]: https://github.com/oneapi-src/oneDNN/tree/rls-v3.6/examples/graph |
| 117 | +[deconvolution example]: https://github.com/oneapi-src/oneDNN/blob/rls-v3.6/examples/primitives/deconvolution.cpp |
| 118 | +[Vanilla RNN example]: https://github.com/oneapi-src/oneDNN/blob/rls-v3.6/examples/primitives/vanilla_rnn.cpp |
| 119 | +[LBR GRU example]: https://github.com/oneapi-src/oneDNN/blob/rls-v3.6/examples/primitives/lbr_gru.cpp |
| 120 | +[SYCL Graph]: https://codeplay.com/portal/blogs/2024/01/22/sycl-graphs |
| 121 | +[spdlog]: https://oneapi-src.github.io/oneDNN/v3.6/dev_guide_experimental.html#onednn-experimental-logging |
| 122 | + |
| 123 | +# Validation |
| 124 | + |
| 125 | + * Introduced protection from out-of-memory scenarios in benchdnn Graph API |
| 126 | + driver. |
| 127 | + |
| 128 | +# Deprecated Functionality |
| 129 | + |
| 130 | + * Experimental [Graph Compiler] is deprecated and will be removed in future releases. |
| 131 | + |
| 132 | +[Graph Compiler]: https://oneapi-src.github.io/oneDNN/v3.6/dev_guide_graph_compiler.html |
| 133 | + |
| 134 | +# Breaking Changes |
| 135 | + |
| 136 | + * Experimental [microkernel API] in this release is not compatible with |
| 137 | + [the version available][microkernel API v3.5] in oneDNN v3.5. |
| 138 | + * Updated minimal supported ACL version to 24.08.1 (was 24.04). |
| 139 | + |
| 140 | +[microkernel API v3.5]: https://oneapi-src.github.io/oneDNN/v3.5/ukernels.html |
| 141 | + |
| 142 | +# Thanks to these Contributors |
| 143 | + |
| 144 | +This release contains contributions from the [project core team] as well as |
| 145 | +Abdel @quickwritereader, Adam Jackson @nwnk, Aleksandr Voron @alvoron, |
| 146 | +Alexey Makarevich @amakarev, Annop Wongwathanarat @annop-w, Daniel Kuts |
| 147 | +@apach301, @deepeshfujitsu, Fadi Arafeh @fadara01, Fritz Heckel @fwph, |
| 148 | +Gorokhov Dmitriy @dmitry-gorokhov, Deeksha Kasture @kasturedeeksha, |
| 149 | +Kentaro Kawakami @kawakami-k, Marek Michalowski @michalowski-arm, |
| 150 | +@matthias-bonne, @Menooker, Michael Froelich @MichaelFroelich, |
| 151 | +Nicolas Miller @npmiller, Nikhil Sharma @nikhilfujitsu, @nishith-fujitsu, |
| 152 | +Permanence AI Coder @Permanence-AI-Coder, Radu Salavat @Radu2k, Renato Barros |
| 153 | +Arantes @renato-arantes, Robert Cohn @rscohn2, Robert Hardwick @robert-hardwick, |
| 154 | +Ryo Suzuki @Ryo-not-rio, Shreyas-fuj @Shreyas-fuj, Shu Chen @shu1chen, |
| 155 | +Siddhartha Menon @Sqvid, Song Jiaming @Litchilitchy, Vladimir Paramuzov |
| 156 | +@vladimir-paramuzov, Yifei Zhang @yifeizh2. We would also like to thank everyone |
| 157 | +who asked questions and reported issues. |
| 158 | + |
| 159 | +[project core team]: https://github.com/oneapi-src/oneDNN/blob/rls-v3.6/MAINTAINERS.md |
0 commit comments