1
+ OpenVINO Release Notes
2
+ =============================
3
+
1
4
.. meta ::
2
5
:description: See what has changed in OpenVINO with the latest release, as well as all
3
6
previous releases in this year's cycle.
4
7
5
- OpenVINO Release Notes
6
- =============================
7
8
8
9
.. toctree ::
9
10
:maxdepth: 1
@@ -14,7 +15,7 @@ OpenVINO Release Notes
14
15
15
16
16
17
17
- 2024.3 - 30 July 2024
18
+ 2024.3 - 31 July 2024
18
19
#############################
19
20
20
21
:doc: `System Requirements <./release-notes-openvino/system-requirements >` | :doc: `Release policy <./release-notes-openvino/release-policy >` | :doc: `Installation Guides <./../get-started/install-openvino >`
@@ -23,21 +24,21 @@ OpenVINO Release Notes
23
24
What's new
24
25
+++++++++++++++++++++++++++++
25
26
26
- More Gen AI coverage and framework integrations to minimize code changes.
27
+ * More Gen AI coverage and framework integrations to minimize code changes.
27
28
28
- * OpenVINO pre-optimized models are now available in Hugging Face making it easier for developers
29
- to get started with these models.
29
+ * OpenVINO pre-optimized models are now available in Hugging Face making it easier for developers
30
+ to get started with these models.
30
31
31
- Broader Large Language Model (LLM) support and more model compression techniques.
32
+ * Broader Large Language Model (LLM) support and more model compression techniques.
32
33
33
- * Significant improvement in LLM performance on Intel built-in and discrete GPUs with the addition
34
- of dynamic quantization, Multi-Head Attention (MHA), and OneDNN enhancements.
34
+ * Significant improvement in LLM performance on Intel discrete GPUs with the addition of
35
+ Multi-Head Attention (MHA) and OneDNN enhancements.
35
36
36
- More portability and performance to run AI at the edge, in the cloud, or locally.
37
+ * More portability and performance to run AI at the edge, in the cloud, or locally.
37
38
38
- * Improved CPU performance when serving LLMs with the inclusion of vLLM and continuous batching
39
- in the OpenVINO Model Server (OVMS). vLLM is an easy-to-use open-source library that supports
40
- efficient LLM inferencing and model serving.
39
+ * Improved CPU performance when serving LLMs with the inclusion of vLLM and continuous batching
40
+ in the OpenVINO Model Server (OVMS). vLLM is an easy-to-use open-source library that supports
41
+ efficient LLM inferencing and model serving.
41
42
42
43
43
44
59
60
60
61
* Increasing support for models like YoloV10 or PixArt-XL-2, thanks to enabling Squeeze and
61
62
Concat layers.
62
- * Performance of precision conversion fp16/bf16 -> fp32 .
63
+ * Performance of precision conversion FP16/BF16 -> FP32 .
63
64
64
65
65
66
@@ -97,9 +98,6 @@ GPU Device Plugin
97
98
98
99
* LLMs and Stable Diffusion on discrete GPUs, due to latency decrease, through optimizations
99
100
such as Multi-Head Attention (MHA) and oneDNN improvements.
100
- * First token latency of LLMs for large input cases on Core Ultra integrated GPU. It can be
101
- further improved with dynamic quantization enabled with an application
102
- `interface <https://docs.openvino.ai/2024/api/c_cpp_api/group__ov__dev__exec__model.html#_CPPv4N2ov4hint31dynamic_quantization_group_sizeE >`__.
103
101
* Whisper models on discrete GPU.
104
102
105
103
@@ -191,7 +189,7 @@ Neural Network Compression Framework
191
189
Act->MatMul and Act->MUltiply->MatMul to cover the Phi family models.
192
190
* The representation of symmetrically quantized weights has been updated to a signed data type
193
191
with no zero point. This allows NPU to support compressed LLMs with the symmetric mode.
194
- * bf16 models in Post-Training Quantization are now supported; nncf.quantize().
192
+ * BF16 models in Post-Training Quantization are now supported; nncf.quantize().
195
193
* `Activation Sparsity <https://arxiv.org/abs/2310.17157 >`__ (Contextual Sparsity) algorithm in
196
194
the Weight Compression method is now supported (preview), speeding up LLM inference.
197
195
The algorithm is enabled by setting the ``target_sparsity_by_scope `` option in
@@ -431,7 +429,7 @@ Previous 2024 releases
431
429
compression of LLMs. Enabled by `gptq=True`` in nncf.compress_weights().
432
430
* Scale Estimation algorithm for more accurate 4-bit compressed LLMs. Enabled by
433
431
`scale_estimation=True`` in nncf.compress_weights().
434
- * Added support for models with bf16 weights in nncf.compress_weights().
432
+ * Added support for models with BF16 weights in nncf.compress_weights().
435
433
* nncf.quantize() method is now the recommended path for quantization initialization of
436
434
PyTorch models in Quantization-Aware Training. See example for more details.
437
435
* compressed_model.nncf.get_config() and nncf.torch.load_from_config() API have been added to
0 commit comments