Skip to content

Commit 27ae75d

Browse files
committed
add OpenTelemetry_OPEA_Guide.rst and ChatQnA.md for telemetry support
Signed-off-by: Tsai, Louie <louie.tsai@intel.com>
1 parent 85c9248 commit 27ae75d

24 files changed

+288
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,180 @@
1+
.. _OpenTelemetry_OPEA_Guide:
2+
3+
OpenTelemetry on OPEA Guide
4+
#############################
5+
6+
Overview
7+
********
8+
OpenTelemetry (also referred to as OTel) is an open source observability framework made up of a collection of tools, APIs, and SDKs.
9+
OTel enables developers to instrument, generate, collect, and export telemetry data for analysis and to understand software performance and behavior.
10+
The telemetry data can come in the form of traces, metrics, and logs.
11+
OPEA integrates OpenTelemetry's metrics and tracing capabilities to enhance its telemetry support, providing users with valuable insights into system performance.
12+
13+
14+
How It Works
15+
************
16+
OPEA Comps offers telemetry functionalities for metrics and tracing by integrating with tools such as Prometheus, Grafana, and Jaeger. Below is a brief introduction to the workflows of those tools:
17+
18+
.. image:: assets/opea_telemetry.jpg
19+
:width: 800
20+
:alt: Alternative text
21+
22+
23+
24+
The majority of OPEA's micro and mega services are equipped to support OpenTelemetry metrics, which are exported in Prometheus format via the /metrics endpoint.
25+
For further guidance, please refer to the section on `Telemetry Metrics <https://github.com/opea-project/GenAIComps/tree/main/comps/cores/telemetry#metrics>`_.
26+
Prometheus plays a crucial role in collecting metrics from OPEA service endpoints, while Grafana leverages Prometheus as a data source to visualize these metrics on pre-configured dashboards.
27+
28+
OPEA also supports OpenTelemetry tracing, with several OPEA GenAIExamples instrumented to trace key functions such as microservice execution and LLM generations.
29+
Additionally, HuggingFace's Text Embedding Inference and Text Generation Inference services are enabled for select OPEA GenAIExamples.
30+
The Jaeger UI monitors trace events from OPEA microservices, TEI, and TGI. Once Jaeger endpoints are configured in OPEA microservices, TEI, and TGI,
31+
trace data will automatically be reported and visualized in the Jaeger UI.
32+
33+
34+
Deployment
35+
**********
36+
37+
In the OpenTelemetry-enabled GenAIExamples, OpenTelemetry Metrics is activated by default, while OpenTelemetry Tracing is initially disabled.
38+
Similarly, the Telemetry UI services, including Grafana, Prometheus, and Jaeger, are also disabled by default.
39+
To enable OTel tracing along with Grafana, Prometheus, and Jaeger for an example, you can include an additional telemetry Docker Compose YAML file.
40+
For instance, adding compose.telemetry.yaml alongside compose.yaml will activate all telemetry features for the example.
41+
42+
43+
.. code-block:: bash
44+
45+
source ./set_env.sh
46+
docker compose -f compose.yaml -f compose.telemetry.yaml up -d
47+
48+
49+
Below are the GenAIExamples that include support for Grafana, Prometheus, and Jaeger services.
50+
51+
.. toctree::
52+
:maxdepth: 1
53+
54+
ChatQnA <deploy/ChatQnA>
55+
56+
How to Monitor
57+
****************
58+
59+
OpenTelemetry metrics and tracing can be visualized through three primary monitoring UI web pages.
60+
61+
1. Prometheus
62+
+++++++++++++++
63+
64+
The Prometheus UI provides insights into which services have active metrics endpoints.
65+
By default, Prometheus operates on port 9090.
66+
You can access the Prometheus UI web page using the following URL.
67+
68+
.. code-block:: bash
69+
70+
http://${host_ip}:9090/targets
71+
72+
Services with accessible metrics endpoints will be marked as "up" in Prometheus.
73+
If a service is marked as "down," Grafana Dashboards will be unable to display the associated metrics information.
74+
75+
.. image:: assets/prometheus.png
76+
:width: 800
77+
:alt: Alternative text
78+
79+
2. Grafana
80+
+++++++++++++++
81+
82+
The Grafana UI displays telemetry metrics through pre-defined dashboards, providing a clear visualization of data.
83+
For OPEA examples, Grafana is configured by default to use Prometheus as its data source, eliminating the need for manual setup.
84+
The Grafana UI web page could be accessed using the following URL.
85+
86+
.. code-block:: bash
87+
88+
http://${host_ip}:3000
89+
90+
91+
.. image:: assets/grafana_init.png
92+
:width: 800
93+
:alt: Alternative text
94+
95+
96+
To view the pre-defined dashboards, click on the "Dashboard" tab located on the left-hand side of the Grafana UI.
97+
This will allow you to explore various dashboards that have been set up to visualize telemetry metrics effectively.
98+
99+
100+
.. image:: assets/grafana_dashboard_init.png
101+
:width: 800
102+
:alt: Alternative text
103+
104+
Detailed explanations for understanding each dashboard are provided within the telemetry sections of the respective GenAIExamples.
105+
These sections offer insights into how to interpret the data and utilize the dashboards effectively for monitoring and analysis.
106+
107+
.. toctree::
108+
:maxdepth: 1
109+
110+
ChatQnA <deploy/ChatQnA>
111+
112+
113+
3. Jaeger
114+
+++++++++++++++
115+
116+
The Jaeger UI is instrumental in understanding function tracing for each request, providing visibility into the execution flow and timing of microservices.
117+
OPEA traces the execution time for each microservice and monitors key functions within them.
118+
By default, Jaeger operates on port 16686.
119+
The Jaeger UI web page could be accessed using the following URL.
120+
121+
.. code-block:: bash
122+
123+
http://${host_ip}:16686
124+
125+
Traces will only appear in the Jaeger UI if the relevant functions have been executed.
126+
Therefore, without running the example, the UI will not display any trace data.
127+
128+
.. image:: assets/jaeger_ui_init.png
129+
:width: 400
130+
:alt: Alternative text
131+
132+
Once the example is run, refresh the Jaeger UI webpage, and the OPEA service should appear under the "Services" tab,
133+
indicating that trace data is being captured and displayed.
134+
135+
.. image:: assets/jaeger_ui_opea.png
136+
:width: 400
137+
:alt: Alternative text
138+
139+
Select "opea" as the service, then click the "Find Traces" button to view the trace data associated with the service's execution.
140+
141+
.. image:: assets/jaeger_ui_opea_trace.png
142+
:width: 400
143+
:alt: Alternative text
144+
145+
146+
All traces will be displayed on the UI.
147+
The diagram in the upper right corner provides a visual representation of all requests along the timeline. Meanwhile,
148+
the diagrams in the lower right corner illustrate all spans within each request, offering detailed insights into the execution flow and timing.
149+
150+
.. image:: assets/jaeger_ui_opea_chatqna_1req.png
151+
:width: 800
152+
:alt: Alternative text
153+
154+
Detailed explanations for understanding each Jaeger diagrams are provided within the telemetry sections of the respective GenAIExamples.
155+
These sections offer insights into how to interpret the data and utilize the dashboards effectively for monitoring and analysis.
156+
157+
.. toctree::
158+
:maxdepth: 1
159+
160+
ChatQnA <deploy/ChatQnA>
161+
162+
163+
Code Instrumentations for OPEA Tracing
164+
****************************************
165+
166+
Enabling OPEA OpenTelemetry tracing for a function is straightforward.
167+
First, import opea_telemetry, and then apply the Python decorator @opea_telemetry to the function you wish to trace.
168+
Below is an example of how to trace your_func using OPEA tracing:
169+
170+
171+
.. code-block:: python
172+
173+
from comps import opea_telemetry
174+
175+
@opea_telemetry
176+
async def your_func():
177+
pass
178+
179+
180+
Loading
Loading
Loading
Loading
Loading
115 KB
Loading
29.1 KB
Loading
31.9 KB
Loading
Loading
98.5 KB
Loading
31.4 KB
Loading
36.5 KB
Loading
Loading
Loading
Loading
Loading
Loading
Loading
Loading
Loading
165 KB
Loading
+107
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,107 @@
1+
# OpenTelemetry on ChatQnA Application
2+
3+
Each microservice in ChatQnA is instrumented with opea_telemetry, enabling Jaeger to provide a detailed time breakdown across microservices for each request.
4+
Additionally, ChatQnA features a pre-defined Grafana dashboard for its megaservice, alongside a vLLM Grafana dashboard.
5+
A dashboard for monitoring CPU statistics is also available, offering comprehensive insights into system performance and resource utilization.
6+
7+
# Table of contents
8+
9+
1. [Deployment](#deployment)
10+
2. [Telemetry Tracing with Jaeger on Gaudi](#telemetry-tracing-with-jaeger-on-gaudi)
11+
3. [Telemetry Metrics with Grafana on Gaudi](#telemetry-metrics-with-grafana-on-gaudi)
12+
13+
## Deployment
14+
15+
### Xeon
16+
17+
```bash
18+
cd GenAIExamples/ChatQnA/docker_compose/intel/cpu/xeon/
19+
docker compose -f compose.yaml -f compose.telemetry.yaml up -d
20+
```
21+
### Gaudi
22+
23+
```bash
24+
cd GenAIExamples/ChatQnA/docker_compose/intel/hpu/gaudi/
25+
docker compose -f compose.yaml -f compose.telemetry.yaml up -d
26+
```
27+
28+
## Telemetry Tracing with Jaeger on Gaudi
29+
30+
After ChatQnA processes a question, two traces should appear along the timeline.
31+
The trace for opea: ServiceOrchestrator.schedule runs on the CPU and includes seven spans, one of which represents the LLM host functions in general.
32+
For LLM functions executed on Gaudi, stream requests are displayed under opea: llm_generate_stream.
33+
This trace contains two spans: one for the first token and another for all subsequent tokens.
34+
35+
![chatqna_1req](../assets/jaeger_ui_opea_chatqna_1req.png)
36+
37+
The first trace along the timeline is opea: ServiceOrchestrator.schedule, which operates on the CPU.
38+
This trace provides insights into the orchestration and scheduling of services within the ChatQnA megaservice, highlighting the execution flow during the process.
39+
40+
![chatqna_cpu_req](../assets/jaeger_ui_opea_chatqna_req_cpu.png)
41+
42+
Clicking on the opea: ServiceOrchestrator.schedule trace will expand to reveal seven spans along the timeline.
43+
The first span represents the main schedule function, which has minimal self-execution time, indicated in black.
44+
The second span corresponds to the embedding microservice execution time, taking 33.72 ms as shown in the diagram.
45+
Following the embedding is the retriever span, which took only 3.13 ms.
46+
The last span captures the LLM functions on the CPU, with an execution time of 41.99 ms.
47+
These spans provide a detailed breakdown of the execution flow and timing for each component within the service orchestration.
48+
49+
![chatqna_cpu_breakdown](../assets/jaeger_ui_opea_chatqna_cpu_breakdown.png)
50+
51+
The second trace following the schedule trace is opea: llm_generate_stream, which operates on Gaudi, as depicted in the diagram.
52+
This trace provides insights into the execution of LLM functions on Gaudi,
53+
highlighting the processing of stream requests and the associated spans for token generation.
54+
55+
![chatqna_gaudi_req](../assets/jaeger_ui_opea_chatqna_req_gaudi.png)
56+
57+
Clicking on the opea: llm_generate_stream trace will expand to reveal two spans along the timeline.
58+
The first span represents the execution time for the first token, which took 15.12 ms in this run.
59+
The second span captures the execution time for all subsequent tokens, taking 920 ms as shown in the diagram.
60+
These spans provide a detailed view of the token generation process and the performance of LLM functions on Gaudi.
61+
62+
![chatqna_gaudi_breakdown](../assets/jaeger_ui_opea_chatqna_req_breakdown_2.png)
63+
64+
Overall, the traces on the CPU consist of seven spans and are represented as larger circles.
65+
In contrast, the traces on Gaudi have two spans and are depicted as smaller circles.
66+
The diagrams below illustrate a run with 16 user requests, resulting in a total of 32 traces.
67+
In this scenario, the larger circles, representing CPU traces, took less time than the smaller circles,
68+
indicating that the requests required more processing time on Gaudi compared to the CPU.
69+
70+
![chatqna_gaudi_breakdown](../assets/chatqna_16reqs.png).
71+
72+
## Telemetry Metrics with Grafana on Gaudi
73+
74+
The ChatQnA application offers several useful dashboards that provide valuable insights into its performance and operations.
75+
These dashboards are designed to help monitor various aspects of the application, such as service execution times, resource utilization, and system health,
76+
enabling users to effectively manage and optimize the application.
77+
78+
### ChatQnA MegaService Dashboard
79+
80+
This dashboard provides metrics for services within the ChatQnA megaservice.
81+
The chatqna-backend-server service, which functions as the megaservice,
82+
is highlighted with its average response time displayed across multiple runs.
83+
Additionally, the dashboard presents CPU and memory usage statistics for the megaservice,
84+
offering a comprehensive view of its performance and resource consumption.
85+
86+
![chatqna_1req](../assets/Grafana_chatqna_backend_server_1.png)
87+
88+
The dashboard can also display metrics for the dataprep-redis-service and the retriever service.
89+
These metrics provide insights into the performance and resource utilization of these services,
90+
allowing for a more comprehensive understanding of the ChatQnA application's overall operation.
91+
92+
![chatqna_1req](../assets/Grafana_chatqna_dataprep.png)
93+
94+
![chatqna_1req](../assets/Grafana_chatqna_retriever.png)
95+
96+
### LLM Dashboard
97+
98+
This dashboard presents metrics for the LLM service, including key performance indicators such as request latency, time per output token latency,
99+
and time to first token latency, among others.
100+
These metrics offer valuable insights into the efficiency and responsiveness of the LLM service,
101+
helping to identify areas for optimization and ensuring smooth operation.
102+
103+
![chatqna_1req](../assets/Grafana_vLLM.png)
104+
105+
The dashboard also displays metrics for request prompt length and output length.
106+
107+
![chatqna_1req](../assets/Grafana_vLLM_2.png)

tutorial/index.rst

+1
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@ Provide following tutorials to cover common user cases:
1616
DocSum/DocSum_Guide
1717
DocIndexRetriever/DocIndexRetriever_Guide
1818
VideoQnA/VideoQnA_Guide
19+
OpenTelemetry/OpenTelemetry_OPEA_Guide
1920

2021
-----
2122

0 commit comments

Comments
 (0)