Skip to content

Commit 2c1287f

Browse files
[DOCS] genAI main article tweaks mstr (#27791)
1 parent 7670715 commit 2c1287f

File tree

9 files changed

+708
-761
lines changed

9 files changed

+708
-761
lines changed

docs/articles_en/learn-openvino.rst

+2-2
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ Learn OpenVINO
1414

1515
Interactive Tutorials (Python) <learn-openvino/interactive-tutorials-python>
1616
Sample Applications (Python & C++) <learn-openvino/openvino-samples>
17-
Large Language Model Inference Guide <learn-openvino/llm_inference_guide>
17+
Generative AI workflow <learn-openvino/llm_inference_guide>
1818

1919

2020

@@ -29,5 +29,5 @@ as well as an experienced user.
2929
| :doc:`OpenVINO Samples <learn-openvino/openvino-samples>`
3030
| The OpenVINO samples (Python and C++) are simple console applications that show how to use specific OpenVINO API features. They can assist you in executing tasks such as loading a model, running inference, querying particular device capabilities, etc.
3131
32-
| :doc:`Large Language Models in OpenVINO <learn-openvino/llm_inference_guide>`
32+
| :doc:`Generative AI workflow <learn-openvino/llm_inference_guide>`
3333
| Detailed information on how OpenVINO accelerates Generative AI use cases and what models it supports. This tutorial provides instructions for running Generative AI models using Hugging Face Optimum Intel and Native OpenVINO APIs.
Original file line numberDiff line numberDiff line change
@@ -1,140 +1,106 @@
1-
Large Language Model Inference Guide
1+
Generative AI workflow
22
========================================
33

44
.. meta::
5-
:description: Explore learning materials, including interactive
6-
Python tutorials and sample console applications that explain
7-
how to use OpenVINO features.
5+
:description: learn how to use OpenVINO to run generative AI models.
86

97

108
.. toctree::
119
:maxdepth: 1
1210
:hidden:
1311

14-
Run LLMs with Optimum Intel <llm_inference_guide/llm-inference-hf>
15-
Run LLMs on OpenVINO GenAI Flavor <llm_inference_guide/genai-guide>
16-
Run LLMs on Base OpenVINO <llm_inference_guide/llm-inference-native-ov>
12+
Inference with OpenVINO GenAI <llm_inference_guide/genai-guide>
13+
Inference with Optimum Intel <llm_inference_guide/llm-inference-hf>
14+
Generative AI with Base OpenVINO (not recommended) <llm_inference_guide/llm-inference-native-ov>
1715
OpenVINO Tokenizers <llm_inference_guide/ov-tokenizers>
1816

19-
Large Language Models (LLMs) like GPT are transformative deep learning networks capable of a
20-
broad range of natural language tasks, from text generation to language translation. OpenVINO
21-
optimizes the deployment of these models, enhancing their performance and integration into
22-
various applications. This guide shows how to use LLMs with OpenVINO, from model loading and
23-
conversion to advanced use cases.
17+
18+
19+
Generative AI is a specific area of Deep Learning models used for producing new and “original”
20+
data, based on input in the form of image, sound, or natural language text. Due to their
21+
complexity and size, generative AI pipelines are more difficult to deploy and run efficiently.
22+
OpenVINO simplifies the process and ensures high-performance integrations, with the following
23+
options:
24+
25+
.. tab-set::
26+
27+
.. tab-item:: OpenVINO GenAI
28+
29+
| - Suggested for production deployment for the supported use cases.
30+
| - Smaller footprint and fewer dependencies.
31+
| - More optimization and customization options.
32+
| - Available in both Python and C++.
33+
| - A limited set of supported use cases.
34+
35+
:doc:`Install the OpenVINO GenAI package <../get-started/install-openvino/install-openvino-genai>`
36+
and run generative models out of the box. With custom
37+
API and tokenizers, among other components, it manages the essential tasks such as the
38+
text generation loop, tokenization, and scheduling, offering ease of use and high
39+
performance.
40+
41+
.. tab-item:: Hugging Face integration
42+
43+
| - Suggested for prototyping and, if the use case is not covered by OpenVINO GenAI, production.
44+
| - Bigger footprint and more dependencies.
45+
| - Limited customization due to Hugging Face dependency.
46+
| - Not usable for C++ applications.
47+
| - A very wide range of supported models.
48+
49+
Using Optimum Intel is a great way to experiment with different models and scenarios,
50+
thanks to a simple interface for the popular API and infrastructure offered by Hugging Face.
51+
It also enables weight compression with
52+
`Neural Network Compression Framework (NNCF) <https://github.com/openvinotoolkit/nncf>`__,
53+
as well as conversion on the fly. For integration with the final product it may offer
54+
lower performance, though.
55+
56+
`Check out the GenAI Quick-start Guide [PDF] <https://docs.openvino.ai/2024/_static/download/GenAI_Quick_Start_Guide.pdf>`__
2457

2558
The advantages of using OpenVINO for LLM deployment:
2659

27-
* **OpenVINO offers optimized LLM inference**:
28-
provides a full C/C++ API, leading to faster operation than Python-based runtimes; includes a
29-
Python API for rapid development, with the option for further optimization in C++.
30-
* **Compatible with diverse hardware**:
31-
supports CPUs, GPUs, and neural accelerators across ARM and x86/x64 architectures, integrated
32-
Intel® Processor Graphics, discrete Intel® Arc™ A-Series Graphics, and discrete Intel® Data
33-
Center GPU Flex Series; features automated optimization to maximize performance on target
34-
hardware.
35-
* **Requires fewer dependencies**:
36-
than frameworks like Hugging Face and PyTorch, resulting in a smaller binary size and reduced
37-
memory footprint, making deployments easier and updates more manageable.
38-
* **Provides compression and precision management techniques**:
39-
such as 8-bit and 4-bit weight compression, including embedding layers, and storage format
40-
reduction. This includes fp16 precision for non-compressed models and int8/int4 for compressed
41-
models, like GPTQ models from `Hugging Face <https://huggingface.co/models>`__.
42-
* **Supports a wide range of deep learning models and architectures**:
43-
including text, image, and audio generative models like Llama 2, MPT, OPT, Stable Diffusion,
44-
Stable Diffusion XL. This enables the development of multimodal applications, allowing for
45-
write-once, deploy-anywhere capabilities.
46-
* **Enhances inference capabilities**:
47-
fused inference primitives such as Scaled Dot Product Attention, Rotary Positional Embedding,
48-
Group Query Attention, and Mixture of Experts. It also offers advanced features like in-place
49-
KV-cache, dynamic quantization, KV-cache quantization and encapsulation, dynamic beam size
50-
configuration, and speculative sampling.
51-
* **Provides stateful model optimization**:
52-
models from the Hugging Face Transformers are converted into a stateful form, optimizing
53-
inference performance and memory usage in long-running text generation tasks by managing past
54-
KV-cache tensors more efficiently internally. This feature is automatically activated for many
55-
supported models, while unsupported ones remain stateless. Learn more about the
56-
:doc:`Stateful models and State API <../openvino-workflow/running-inference/stateful-models>`.
57-
58-
OpenVINO offers three main paths for Generative AI use cases:
59-
60-
* **Hugging Face**: use OpenVINO as a backend for Hugging Face frameworks (transformers,
61-
diffusers) through the `Optimum Intel <https://huggingface.co/docs/optimum/intel/inference>`__
62-
extension.
63-
* **OpenVINO GenAI Flavor**: use OpenVINO GenAI APIs (Python and C++).
64-
* **Base OpenVINO**: use OpenVINO native APIs (Python and C++) with
65-
`custom pipeline code <https://github.com/openvinotoolkit/openvino.genai>`__.
66-
67-
In both cases, the OpenVINO runtime is used for inference, and OpenVINO tools are used for
68-
optimization. The main differences are in footprint size, ease of use, and customizability.
69-
70-
The Hugging Face API is easy to learn, provides a simple interface and hides the complexity of
71-
model initialization and text generation for a better developer experience. However, it has more
72-
dependencies, less customization, and cannot be ported to C/C++.
73-
74-
The OpenVINO GenAI Flavor reduces the complexity of LLMs implementation by
75-
automatically managing essential tasks like the text generation loop, tokenization,
76-
and scheduling. The Native OpenVINO API provides a more hands-on experience,
77-
requiring manual setup of these functions. Both methods are designed to minimize dependencies
78-
and the overall application footprint and enable the use of generative models in C++ applications.
79-
80-
It is recommended to start with Hugging Face frameworks to experiment with different models and
81-
scenarios. Then the model can be used with OpenVINO APIs if it needs to be optimized
82-
further. Optimum Intel provides interfaces that enable model optimization (weight compression)
83-
using `Neural Network Compression Framework (NNCF) <https://github.com/openvinotoolkit/nncf>`__,
84-
and export models to the OpenVINO model format for use in native API applications.
85-
86-
Proceed to run LLMs with:
60+
.. dropdown:: Fewer dependencies and smaller footprint
61+
:animate: fade-in-slide-down
62+
:color: secondary
63+
64+
Less bloated than frameworks such as Hugging Face and PyTorch, with a smaller binary size and reduced
65+
memory footprint, makes deployments easier and updates more manageable.
66+
67+
.. dropdown:: Compression and precision management
68+
:animate: fade-in-slide-down
69+
:color: secondary
70+
71+
Techniques such as 8-bit and 4-bit weight compression, including embedding layers, and storage
72+
format reduction. This includes fp16 precision for non-compressed models and int8/int4 for
73+
compressed models, like GPTQ models from `Hugging Face <https://huggingface.co/models>`__.
74+
75+
.. dropdown:: Enhanced inference capabilities
76+
:animate: fade-in-slide-down
77+
:color: secondary
78+
79+
Advanced features like in-place KV-cache, dynamic quantization, KV-cache quantization and
80+
encapsulation, dynamic beam size configuration, and speculative sampling, and more are
81+
available.
82+
83+
.. dropdown:: Stateful model optimization
84+
:animate: fade-in-slide-down
85+
:color: secondary
86+
87+
Models from the Hugging Face Transformers are converted into a stateful form, optimizing
88+
inference performance and memory usage in long-running text generation tasks by managing past
89+
KV-cache tensors more efficiently internally. This feature is automatically activated for
90+
many supported models, while unsupported ones remain stateless. Learn more about the
91+
:doc:`Stateful models and State API <../openvino-workflow/running-inference/stateful-models>`.
92+
93+
.. dropdown:: Optimized LLM inference
94+
:animate: fade-in-slide-down
95+
:color: secondary
96+
97+
Includes a Python API for rapid development and C++ for further optimization, offering
98+
better performance than Python-based runtimes.
99+
100+
101+
Proceed to guides on:
87102

88-
* :doc:`Hugging Face and Optimum Intel <./llm_inference_guide/llm-inference-hf>`
89103
* :doc:`OpenVINO GenAI Flavor <./llm_inference_guide/genai-guide>`
90-
* :doc:`Native OpenVINO API <./llm_inference_guide/llm-inference-native-ov>`
91-
92-
The table below summarizes the differences between Hugging Face and the native OpenVINO API
93-
approaches.
94-
95-
.. dropdown:: Differences between Hugging Face and the native OpenVINO API
96-
97-
.. list-table::
98-
:widths: 20 25 55
99-
:header-rows: 1
100-
101-
* -
102-
- Hugging Face through OpenVINO
103-
- OpenVINO Native API
104-
* - Model support
105-
- Supports transformer-based models such as LLMs
106-
- Supports all model architectures from most frameworks
107-
* - APIs
108-
- Python (Hugging Face API)
109-
- Python, C++ (OpenVINO API)
110-
* - Model Format
111-
- Source Framework / OpenVINO
112-
- Source Framework / OpenVINO
113-
* - Inference code
114-
- Hugging Face based
115-
- Custom inference pipelines
116-
* - Additional dependencies
117-
- Many Hugging Face dependencies
118-
- Lightweight (e.g. numpy, etc.)
119-
* - Application footprint
120-
- Large
121-
- Small
122-
* - Pre/post-processing and glue code
123-
- Provided through high-level Hugging Face APIs
124-
- Must be custom implemented (see OpenVINO samples and notebooks)
125-
* - Performance
126-
- Good, but less efficient compared to native APIs
127-
- Inherent speed advantage with C++, but requires hands-on optimization
128-
* - Flexibility
129-
- Constrained to Hugging Face API
130-
- High flexibility with Python and C++; allows custom coding
131-
* - Learning Curve and Effort
132-
- Lower learning curve; quick to integrate
133-
- Higher learning curve; requires more effort in integration
134-
* - Ideal Use Case
135-
- Ideal for quick prototyping and Python-centric projects
136-
- Best suited for high-performance, resource-optimized production environments
137-
* - Model Serving
138-
- Paid service, based on CPU/GPU usage with Hugging Face
139-
- Free code solution, run script for own server; costs may incur for cloud services
140-
like AWS but generally cheaper than Hugging Face rates
104+
* :doc:`Hugging Face and Optimum Intel <./llm_inference_guide/llm-inference-hf>`
105+
106+

docs/articles_en/learn-openvino/llm_inference_guide/genai-guide-npu.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
Run LLMs with OpenVINO GenAI Flavor on NPU
1+
Inference with OpenVINO GenAI
22
==========================================
33

44
.. meta::

0 commit comments

Comments
 (0)