|
1 |
| -Large Language Model Inference Guide |
| 1 | +Generative AI workflow |
2 | 2 | ========================================
|
3 | 3 |
|
4 | 4 | .. meta::
|
5 |
| - :description: Explore learning materials, including interactive |
6 |
| - Python tutorials and sample console applications that explain |
7 |
| - how to use OpenVINO features. |
| 5 | + :description: learn how to use OpenVINO to run generative AI models. |
8 | 6 |
|
9 | 7 |
|
10 | 8 | .. toctree::
|
11 | 9 | :maxdepth: 1
|
12 | 10 | :hidden:
|
13 | 11 |
|
14 |
| - Run LLMs with Optimum Intel <llm_inference_guide/llm-inference-hf> |
15 |
| - Run LLMs on OpenVINO GenAI Flavor <llm_inference_guide/genai-guide> |
16 |
| - Run LLMs on Base OpenVINO <llm_inference_guide/llm-inference-native-ov> |
| 12 | + Inference with OpenVINO GenAI <llm_inference_guide/genai-guide> |
| 13 | + Inference with Optimum Intel <llm_inference_guide/llm-inference-hf> |
| 14 | + Generative AI with Base OpenVINO (not recommended) <llm_inference_guide/llm-inference-native-ov> |
17 | 15 | OpenVINO Tokenizers <llm_inference_guide/ov-tokenizers>
|
18 | 16 |
|
19 |
| -Large Language Models (LLMs) like GPT are transformative deep learning networks capable of a |
20 |
| -broad range of natural language tasks, from text generation to language translation. OpenVINO |
21 |
| -optimizes the deployment of these models, enhancing their performance and integration into |
22 |
| -various applications. This guide shows how to use LLMs with OpenVINO, from model loading and |
23 |
| -conversion to advanced use cases. |
| 17 | + |
| 18 | + |
| 19 | +Generative AI is a specific area of Deep Learning models used for producing new and “original” |
| 20 | +data, based on input in the form of image, sound, or natural language text. Due to their |
| 21 | +complexity and size, generative AI pipelines are more difficult to deploy and run efficiently. |
| 22 | +OpenVINO simplifies the process and ensures high-performance integrations, with the following |
| 23 | +options: |
| 24 | + |
| 25 | +.. tab-set:: |
| 26 | + |
| 27 | + .. tab-item:: OpenVINO GenAI |
| 28 | + |
| 29 | + | - Suggested for production deployment for the supported use cases. |
| 30 | + | - Smaller footprint and fewer dependencies. |
| 31 | + | - More optimization and customization options. |
| 32 | + | - Available in both Python and C++. |
| 33 | + | - A limited set of supported use cases. |
| 34 | +
|
| 35 | + :doc:`Install the OpenVINO GenAI package <../get-started/install-openvino/install-openvino-genai>` |
| 36 | + and run generative models out of the box. With custom |
| 37 | + API and tokenizers, among other components, it manages the essential tasks such as the |
| 38 | + text generation loop, tokenization, and scheduling, offering ease of use and high |
| 39 | + performance. |
| 40 | + |
| 41 | + .. tab-item:: Hugging Face integration |
| 42 | + |
| 43 | + | - Suggested for prototyping and, if the use case is not covered by OpenVINO GenAI, production. |
| 44 | + | - Bigger footprint and more dependencies. |
| 45 | + | - Limited customization due to Hugging Face dependency. |
| 46 | + | - Not usable for C++ applications. |
| 47 | + | - A very wide range of supported models. |
| 48 | +
|
| 49 | + Using Optimum Intel is a great way to experiment with different models and scenarios, |
| 50 | + thanks to a simple interface for the popular API and infrastructure offered by Hugging Face. |
| 51 | + It also enables weight compression with |
| 52 | + `Neural Network Compression Framework (NNCF) <https://github.com/openvinotoolkit/nncf>`__, |
| 53 | + as well as conversion on the fly. For integration with the final product it may offer |
| 54 | + lower performance, though. |
| 55 | + |
| 56 | +`Check out the GenAI Quick-start Guide [PDF] <https://docs.openvino.ai/2024/_static/download/GenAI_Quick_Start_Guide.pdf>`__ |
24 | 57 |
|
25 | 58 | The advantages of using OpenVINO for LLM deployment:
|
26 | 59 |
|
27 |
| -* **OpenVINO offers optimized LLM inference**: |
28 |
| - provides a full C/C++ API, leading to faster operation than Python-based runtimes; includes a |
29 |
| - Python API for rapid development, with the option for further optimization in C++. |
30 |
| -* **Compatible with diverse hardware**: |
31 |
| - supports CPUs, GPUs, and neural accelerators across ARM and x86/x64 architectures, integrated |
32 |
| - Intel® Processor Graphics, discrete Intel® Arc™ A-Series Graphics, and discrete Intel® Data |
33 |
| - Center GPU Flex Series; features automated optimization to maximize performance on target |
34 |
| - hardware. |
35 |
| -* **Requires fewer dependencies**: |
36 |
| - than frameworks like Hugging Face and PyTorch, resulting in a smaller binary size and reduced |
37 |
| - memory footprint, making deployments easier and updates more manageable. |
38 |
| -* **Provides compression and precision management techniques**: |
39 |
| - such as 8-bit and 4-bit weight compression, including embedding layers, and storage format |
40 |
| - reduction. This includes fp16 precision for non-compressed models and int8/int4 for compressed |
41 |
| - models, like GPTQ models from `Hugging Face <https://huggingface.co/models>`__. |
42 |
| -* **Supports a wide range of deep learning models and architectures**: |
43 |
| - including text, image, and audio generative models like Llama 2, MPT, OPT, Stable Diffusion, |
44 |
| - Stable Diffusion XL. This enables the development of multimodal applications, allowing for |
45 |
| - write-once, deploy-anywhere capabilities. |
46 |
| -* **Enhances inference capabilities**: |
47 |
| - fused inference primitives such as Scaled Dot Product Attention, Rotary Positional Embedding, |
48 |
| - Group Query Attention, and Mixture of Experts. It also offers advanced features like in-place |
49 |
| - KV-cache, dynamic quantization, KV-cache quantization and encapsulation, dynamic beam size |
50 |
| - configuration, and speculative sampling. |
51 |
| -* **Provides stateful model optimization**: |
52 |
| - models from the Hugging Face Transformers are converted into a stateful form, optimizing |
53 |
| - inference performance and memory usage in long-running text generation tasks by managing past |
54 |
| - KV-cache tensors more efficiently internally. This feature is automatically activated for many |
55 |
| - supported models, while unsupported ones remain stateless. Learn more about the |
56 |
| - :doc:`Stateful models and State API <../openvino-workflow/running-inference/stateful-models>`. |
57 |
| - |
58 |
| -OpenVINO offers three main paths for Generative AI use cases: |
59 |
| - |
60 |
| -* **Hugging Face**: use OpenVINO as a backend for Hugging Face frameworks (transformers, |
61 |
| - diffusers) through the `Optimum Intel <https://huggingface.co/docs/optimum/intel/inference>`__ |
62 |
| - extension. |
63 |
| -* **OpenVINO GenAI Flavor**: use OpenVINO GenAI APIs (Python and C++). |
64 |
| -* **Base OpenVINO**: use OpenVINO native APIs (Python and C++) with |
65 |
| - `custom pipeline code <https://github.com/openvinotoolkit/openvino.genai>`__. |
66 |
| - |
67 |
| -In both cases, the OpenVINO runtime is used for inference, and OpenVINO tools are used for |
68 |
| -optimization. The main differences are in footprint size, ease of use, and customizability. |
69 |
| - |
70 |
| -The Hugging Face API is easy to learn, provides a simple interface and hides the complexity of |
71 |
| -model initialization and text generation for a better developer experience. However, it has more |
72 |
| -dependencies, less customization, and cannot be ported to C/C++. |
73 |
| - |
74 |
| -The OpenVINO GenAI Flavor reduces the complexity of LLMs implementation by |
75 |
| -automatically managing essential tasks like the text generation loop, tokenization, |
76 |
| -and scheduling. The Native OpenVINO API provides a more hands-on experience, |
77 |
| -requiring manual setup of these functions. Both methods are designed to minimize dependencies |
78 |
| -and the overall application footprint and enable the use of generative models in C++ applications. |
79 |
| - |
80 |
| -It is recommended to start with Hugging Face frameworks to experiment with different models and |
81 |
| -scenarios. Then the model can be used with OpenVINO APIs if it needs to be optimized |
82 |
| -further. Optimum Intel provides interfaces that enable model optimization (weight compression) |
83 |
| -using `Neural Network Compression Framework (NNCF) <https://github.com/openvinotoolkit/nncf>`__, |
84 |
| -and export models to the OpenVINO model format for use in native API applications. |
85 |
| - |
86 |
| -Proceed to run LLMs with: |
| 60 | +.. dropdown:: Fewer dependencies and smaller footprint |
| 61 | + :animate: fade-in-slide-down |
| 62 | + :color: secondary |
| 63 | + |
| 64 | + Less bloated than frameworks such as Hugging Face and PyTorch, with a smaller binary size and reduced |
| 65 | + memory footprint, makes deployments easier and updates more manageable. |
| 66 | + |
| 67 | +.. dropdown:: Compression and precision management |
| 68 | + :animate: fade-in-slide-down |
| 69 | + :color: secondary |
| 70 | + |
| 71 | + Techniques such as 8-bit and 4-bit weight compression, including embedding layers, and storage |
| 72 | + format reduction. This includes fp16 precision for non-compressed models and int8/int4 for |
| 73 | + compressed models, like GPTQ models from `Hugging Face <https://huggingface.co/models>`__. |
| 74 | + |
| 75 | +.. dropdown:: Enhanced inference capabilities |
| 76 | + :animate: fade-in-slide-down |
| 77 | + :color: secondary |
| 78 | + |
| 79 | + Advanced features like in-place KV-cache, dynamic quantization, KV-cache quantization and |
| 80 | + encapsulation, dynamic beam size configuration, and speculative sampling, and more are |
| 81 | + available. |
| 82 | + |
| 83 | +.. dropdown:: Stateful model optimization |
| 84 | + :animate: fade-in-slide-down |
| 85 | + :color: secondary |
| 86 | + |
| 87 | + Models from the Hugging Face Transformers are converted into a stateful form, optimizing |
| 88 | + inference performance and memory usage in long-running text generation tasks by managing past |
| 89 | + KV-cache tensors more efficiently internally. This feature is automatically activated for |
| 90 | + many supported models, while unsupported ones remain stateless. Learn more about the |
| 91 | + :doc:`Stateful models and State API <../openvino-workflow/running-inference/stateful-models>`. |
| 92 | + |
| 93 | +.. dropdown:: Optimized LLM inference |
| 94 | + :animate: fade-in-slide-down |
| 95 | + :color: secondary |
| 96 | + |
| 97 | + Includes a Python API for rapid development and C++ for further optimization, offering |
| 98 | + better performance than Python-based runtimes. |
| 99 | + |
| 100 | + |
| 101 | +Proceed to guides on: |
87 | 102 |
|
88 |
| -* :doc:`Hugging Face and Optimum Intel <./llm_inference_guide/llm-inference-hf>` |
89 | 103 | * :doc:`OpenVINO GenAI Flavor <./llm_inference_guide/genai-guide>`
|
90 |
| -* :doc:`Native OpenVINO API <./llm_inference_guide/llm-inference-native-ov>` |
91 |
| - |
92 |
| -The table below summarizes the differences between Hugging Face and the native OpenVINO API |
93 |
| -approaches. |
94 |
| - |
95 |
| -.. dropdown:: Differences between Hugging Face and the native OpenVINO API |
96 |
| - |
97 |
| - .. list-table:: |
98 |
| - :widths: 20 25 55 |
99 |
| - :header-rows: 1 |
100 |
| - |
101 |
| - * - |
102 |
| - - Hugging Face through OpenVINO |
103 |
| - - OpenVINO Native API |
104 |
| - * - Model support |
105 |
| - - Supports transformer-based models such as LLMs |
106 |
| - - Supports all model architectures from most frameworks |
107 |
| - * - APIs |
108 |
| - - Python (Hugging Face API) |
109 |
| - - Python, C++ (OpenVINO API) |
110 |
| - * - Model Format |
111 |
| - - Source Framework / OpenVINO |
112 |
| - - Source Framework / OpenVINO |
113 |
| - * - Inference code |
114 |
| - - Hugging Face based |
115 |
| - - Custom inference pipelines |
116 |
| - * - Additional dependencies |
117 |
| - - Many Hugging Face dependencies |
118 |
| - - Lightweight (e.g. numpy, etc.) |
119 |
| - * - Application footprint |
120 |
| - - Large |
121 |
| - - Small |
122 |
| - * - Pre/post-processing and glue code |
123 |
| - - Provided through high-level Hugging Face APIs |
124 |
| - - Must be custom implemented (see OpenVINO samples and notebooks) |
125 |
| - * - Performance |
126 |
| - - Good, but less efficient compared to native APIs |
127 |
| - - Inherent speed advantage with C++, but requires hands-on optimization |
128 |
| - * - Flexibility |
129 |
| - - Constrained to Hugging Face API |
130 |
| - - High flexibility with Python and C++; allows custom coding |
131 |
| - * - Learning Curve and Effort |
132 |
| - - Lower learning curve; quick to integrate |
133 |
| - - Higher learning curve; requires more effort in integration |
134 |
| - * - Ideal Use Case |
135 |
| - - Ideal for quick prototyping and Python-centric projects |
136 |
| - - Best suited for high-performance, resource-optimized production environments |
137 |
| - * - Model Serving |
138 |
| - - Paid service, based on CPU/GPU usage with Hugging Face |
139 |
| - - Free code solution, run script for own server; costs may incur for cloud services |
140 |
| - like AWS but generally cheaper than Hugging Face rates |
| 104 | +* :doc:`Hugging Face and Optimum Intel <./llm_inference_guide/llm-inference-hf>` |
| 105 | + |
| 106 | + |
0 commit comments