Skip to content

Commit d4dcbd1

Browse files
Enable vllm for DocSum (#1716)
Set vllm as default llm serving, and add related docker compose files, readmes, and test scripts. Fix issue #1436 Signed-off-by: letonghan <letong.han@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
1 parent 87baeb8 commit d4dcbd1

12 files changed

+1397
-311
lines changed

DocSum/docker_compose/intel/cpu/xeon/README.md

+35-2
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,8 @@
22

33
This document outlines the deployment process for a Document Summarization application utilizing the [GenAIComps](https://github.com/opea-project/GenAIComps.git) microservice pipeline on an Intel Xeon server. The steps include Docker image creation, container deployment via Docker Compose, and service execution to integrate microservices such as `llm`. We will publish the Docker images to Docker Hub soon, which will simplify the deployment process for this service.
44

5+
The default pipeline deploys with vLLM as the LLM serving component. It also provides options of using TGI backend for LLM microservice, please refer to [start-microservice-docker-containers](#start-microservice-docker-containers) section in this page.
6+
57
## 🚀 Apply Intel Xeon Server on AWS
68

79
To apply a Intel Xeon server on AWS, start by creating an AWS account if you don't have one already. Then, head to the [EC2 Console](https://console.aws.amazon.com/ec2/v2/home) to begin the process. Within the EC2 service, select the Amazon EC2 M7i or M7i-flex instance type to leverage 4th Generation Intel Xeon Scalable processors. These instances are optimized for high-performance computing and demanding workloads.
@@ -116,9 +118,20 @@ To set up environment variables for deploying Document Summarization services, f
116118

117119
```bash
118120
cd GenAIExamples/DocSum/docker_compose/intel/cpu/xeon
121+
```
122+
123+
If use vLLM as the LLM serving backend.
124+
125+
```bash
119126
docker compose -f compose.yaml up -d
120127
```
121128

129+
If use TGI as the LLM serving backend.
130+
131+
```bash
132+
docker compose -f compose_tgi.yaml up -d
133+
```
134+
122135
You will have the following Docker Images:
123136

124137
1. `opea/docsum-ui:latest`
@@ -128,10 +141,30 @@ You will have the following Docker Images:
128141

129142
### Validate Microservices
130143

131-
1. TGI Service
144+
1. LLM backend Service
145+
146+
In the first startup, this service will take more time to download, load and warm up the model. After it's finished, the service will be ready.
147+
Try the command below to check whether the LLM serving is ready.
148+
149+
```bash
150+
# vLLM service
151+
docker logs docsum-xeon-vllm-service 2>&1 | grep complete
152+
# If the service is ready, you will get the response like below.
153+
INFO: Application startup complete.
154+
```
155+
156+
```bash
157+
# TGI service
158+
docker logs docsum-xeon-tgi-service | grep Connected
159+
# If the service is ready, you will get the response like below.
160+
2024-09-03T02:47:53.402023Z INFO text_generation_router::server: router/src/server.rs:2311: Connected
161+
```
162+
163+
Then try the `cURL` command below to validate services.
132164

133165
```bash
134-
curl http://${host_ip}:8008/generate \
166+
# either vLLM or TGI service
167+
curl http://${host_ip}:8008/v1/chat/completions \
135168
-X POST \
136169
-d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":17, "do_sample": true}}' \
137170
-H 'Content-Type: application/json'

DocSum/docker_compose/intel/cpu/xeon/compose.yaml

+22-23
Original file line numberDiff line numberDiff line change
@@ -2,54 +2,53 @@
22
# SPDX-License-Identifier: Apache-2.0
33

44
services:
5-
tgi-server:
6-
image: ghcr.io/huggingface/text-generation-inference:2.4.0-intel-cpu
7-
container_name: tgi-server
5+
vllm-service:
6+
image: ${REGISTRY:-opea}/vllm:${TAG:-latest}
7+
container_name: docsum-xeon-vllm-service
88
ports:
9-
- ${LLM_ENDPOINT_PORT:-8008}:80
9+
- "8008:80"
10+
volumes:
11+
- "${MODEL_CACHE:-./data}:/root/.cache/huggingface/hub"
12+
shm_size: 1g
1013
environment:
1114
no_proxy: ${no_proxy}
1215
http_proxy: ${http_proxy}
1316
https_proxy: ${https_proxy}
14-
TGI_LLM_ENDPOINT: ${TGI_LLM_ENDPOINT}
15-
HUGGINGFACEHUB_API_TOKEN: ${HUGGINGFACEHUB_API_TOKEN}
16-
host_ip: ${host_ip}
17-
LLM_ENDPOINT_PORT: ${LLM_ENDPOINT_PORT}
17+
HF_TOKEN: ${HUGGINGFACEHUB_API_TOKEN}
18+
LLM_MODEL_ID: ${LLM_MODEL_ID}
19+
VLLM_TORCH_PROFILER_DIR: "/mnt"
1820
healthcheck:
19-
test: ["CMD-SHELL", "curl -f http://${host_ip}:${LLM_ENDPOINT_PORT}/health || exit 1"]
21+
test: ["CMD-SHELL", "curl -f http://localhost:80/health || exit 1"]
2022
interval: 10s
2123
timeout: 10s
2224
retries: 100
23-
volumes:
24-
- "${MODEL_CACHE:-./data}:/data"
25-
shm_size: 1g
26-
command: --model-id ${LLM_MODEL_ID} --cuda-graphs 0 --max-input-length ${MAX_INPUT_TOKENS} --max-total-tokens ${MAX_TOTAL_TOKENS}
25+
command: --model $LLM_MODEL_ID --host 0.0.0.0 --port 80
2726

28-
llm-docsum-tgi:
27+
llm-docsum-vllm:
2928
image: ${REGISTRY:-opea}/llm-docsum:${TAG:-latest}
30-
container_name: llm-docsum-server
29+
container_name: docsum-xeon-llm-server
3130
depends_on:
32-
tgi-server:
31+
vllm-service:
3332
condition: service_healthy
3433
ports:
35-
- ${DOCSUM_PORT:-9000}:9000
34+
- ${LLM_PORT:-9000}:9000
3635
ipc: host
3736
environment:
3837
no_proxy: ${no_proxy}
3938
http_proxy: ${http_proxy}
4039
https_proxy: ${https_proxy}
4140
LLM_ENDPOINT: ${LLM_ENDPOINT}
41+
LLM_MODEL_ID: ${LLM_MODEL_ID}
4242
HUGGINGFACEHUB_API_TOKEN: ${HUGGINGFACEHUB_API_TOKEN}
4343
MAX_INPUT_TOKENS: ${MAX_INPUT_TOKENS}
4444
MAX_TOTAL_TOKENS: ${MAX_TOTAL_TOKENS}
45-
LLM_MODEL_ID: ${LLM_MODEL_ID}
4645
DocSum_COMPONENT_NAME: ${DocSum_COMPONENT_NAME}
4746
LOGFLAG: ${LOGFLAG:-False}
4847
restart: unless-stopped
4948

5049
whisper:
5150
image: ${REGISTRY:-opea}/whisper:${TAG:-latest}
52-
container_name: whisper-server
51+
container_name: docsum-xeon-whisper-server
5352
ports:
5453
- "7066:7066"
5554
ipc: host
@@ -63,10 +62,10 @@ services:
6362
image: ${REGISTRY:-opea}/docsum:${TAG:-latest}
6463
container_name: docsum-xeon-backend-server
6564
depends_on:
66-
- tgi-server
67-
- llm-docsum-tgi
65+
- vllm-service
66+
- llm-docsum-vllm
6867
ports:
69-
- "8888:8888"
68+
- "${BACKEND_SERVICE_PORT:-8888}:8888"
7069
environment:
7170
- no_proxy=${no_proxy}
7271
- https_proxy=${https_proxy}
@@ -83,7 +82,7 @@ services:
8382
depends_on:
8483
- docsum-xeon-backend-server
8584
ports:
86-
- "5173:5173"
85+
- "${FRONTEND_SERVICE_PORT:-5173}:5173"
8786
environment:
8887
- no_proxy=${no_proxy}
8988
- https_proxy=${https_proxy}
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,97 @@
1+
# Copyright (C) 2024 Intel Corporation
2+
# SPDX-License-Identifier: Apache-2.0
3+
4+
services:
5+
tgi-server:
6+
image: ghcr.io/huggingface/text-generation-inference:2.4.0-intel-cpu
7+
container_name: docsum-xeon-tgi-server
8+
ports:
9+
- ${LLM_ENDPOINT_PORT:-8008}:80
10+
volumes:
11+
- "${MODEL_CACHE:-./data}:/data"
12+
environment:
13+
no_proxy: ${no_proxy}
14+
http_proxy: ${http_proxy}
15+
https_proxy: ${https_proxy}
16+
TGI_LLM_ENDPOINT: ${TGI_LLM_ENDPOINT}
17+
HUGGINGFACEHUB_API_TOKEN: ${HUGGINGFACEHUB_API_TOKEN}
18+
host_ip: ${host_ip}
19+
healthcheck:
20+
test: ["CMD-SHELL", "curl -f http://localhost:80/health || exit 1"]
21+
interval: 10s
22+
timeout: 10s
23+
retries: 100
24+
shm_size: 1g
25+
command: --model-id ${LLM_MODEL_ID} --cuda-graphs 0 --max-input-length ${MAX_INPUT_TOKENS} --max-total-tokens ${MAX_TOTAL_TOKENS}
26+
27+
llm-docsum-tgi:
28+
image: ${REGISTRY:-opea}/llm-docsum:${TAG:-latest}
29+
container_name: docsum-xeon-llm-server
30+
depends_on:
31+
tgi-server:
32+
condition: service_healthy
33+
ports:
34+
- ${LLM_PORT:-9000}:9000
35+
ipc: host
36+
environment:
37+
no_proxy: ${no_proxy}
38+
http_proxy: ${http_proxy}
39+
https_proxy: ${https_proxy}
40+
LLM_ENDPOINT: ${LLM_ENDPOINT}
41+
LLM_MODEL_ID: ${LLM_MODEL_ID}
42+
HUGGINGFACEHUB_API_TOKEN: ${HUGGINGFACEHUB_API_TOKEN}
43+
MAX_INPUT_TOKENS: ${MAX_INPUT_TOKENS}
44+
MAX_TOTAL_TOKENS: ${MAX_TOTAL_TOKENS}
45+
DocSum_COMPONENT_NAME: ${DocSum_COMPONENT_NAME}
46+
LOGFLAG: ${LOGFLAG:-False}
47+
restart: unless-stopped
48+
49+
whisper:
50+
image: ${REGISTRY:-opea}/whisper:${TAG:-latest}
51+
container_name: docsum-xeon-whisper-server
52+
ports:
53+
- "7066:7066"
54+
ipc: host
55+
environment:
56+
no_proxy: ${no_proxy}
57+
http_proxy: ${http_proxy}
58+
https_proxy: ${https_proxy}
59+
restart: unless-stopped
60+
61+
docsum-xeon-backend-server:
62+
image: ${REGISTRY:-opea}/docsum:${TAG:-latest}
63+
container_name: docsum-xeon-backend-server
64+
depends_on:
65+
- tgi-server
66+
- llm-docsum-tgi
67+
ports:
68+
- "${BACKEND_SERVICE_PORT:-8888}:8888"
69+
environment:
70+
- no_proxy=${no_proxy}
71+
- https_proxy=${https_proxy}
72+
- http_proxy=${http_proxy}
73+
- MEGA_SERVICE_HOST_IP=${MEGA_SERVICE_HOST_IP}
74+
- LLM_SERVICE_HOST_IP=${LLM_SERVICE_HOST_IP}
75+
- ASR_SERVICE_HOST_IP=${ASR_SERVICE_HOST_IP}
76+
ipc: host
77+
restart: always
78+
79+
docsum-gradio-ui:
80+
image: ${REGISTRY:-opea}/docsum-gradio-ui:${TAG:-latest}
81+
container_name: docsum-xeon-ui-server
82+
depends_on:
83+
- docsum-xeon-backend-server
84+
ports:
85+
- "${FRONTEND_SERVICE_PORT:-5173}:5173"
86+
environment:
87+
- no_proxy=${no_proxy}
88+
- https_proxy=${https_proxy}
89+
- http_proxy=${http_proxy}
90+
- BACKEND_SERVICE_ENDPOINT=${BACKEND_SERVICE_ENDPOINT}
91+
- DOC_BASE_URL=${BACKEND_SERVICE_ENDPOINT}
92+
ipc: host
93+
restart: always
94+
95+
networks:
96+
default:
97+
driver: bridge

DocSum/docker_compose/intel/hpu/gaudi/README.md

+35-2
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,8 @@
22

33
This document outlines the deployment process for a Document Summarization application utilizing the [GenAIComps](https://github.com/opea-project/GenAIComps.git) microservice pipeline on Intel Gaudi server. The steps include Docker image creation, container deployment via Docker Compose, and service execution to integrate microservices such as `llm`. We will publish the Docker images to Docker Hub soon, which will simplify the deployment process for this service.
44

5+
The default pipeline deploys with vLLM as the LLM serving component. It also provides options of using TGI backend for LLM microservice, please refer to [start-microservice-docker-containers](#start-microservice-docker-containers) section in this page.
6+
57
## 🚀 Build Docker Images
68

79
### 1. Build MicroService Docker Image
@@ -108,9 +110,20 @@ To set up environment variables for deploying Document Summarization services, f
108110

109111
```bash
110112
cd GenAIExamples/DocSum/docker_compose/intel/hpu/gaudi
113+
```
114+
115+
If use vLLM as the LLM serving backend.
116+
117+
```bash
111118
docker compose -f compose.yaml up -d
112119
```
113120

121+
If use TGI as the LLM serving backend.
122+
123+
```bash
124+
docker compose -f compose_tgi.yaml up -d
125+
```
126+
114127
You will have the following Docker Images:
115128

116129
1. `opea/docsum-ui:latest`
@@ -120,10 +133,30 @@ You will have the following Docker Images:
120133

121134
### Validate Microservices
122135

123-
1. TGI Service
136+
1. LLM backend Service
137+
138+
In the first startup, this service will take more time to download, load and warm up the model. After it's finished, the service will be ready.
139+
Try the command below to check whether the LLM serving is ready.
140+
141+
```bash
142+
# vLLM service
143+
docker logs docsum-xeon-vllm-service 2>&1 | grep complete
144+
# If the service is ready, you will get the response like below.
145+
INFO: Application startup complete.
146+
```
147+
148+
```bash
149+
# TGI service
150+
docker logs docsum-xeon-tgi-service | grep Connected
151+
# If the service is ready, you will get the response like below.
152+
2024-09-03T02:47:53.402023Z INFO text_generation_router::server: router/src/server.rs:2311: Connected
153+
```
154+
155+
Then try the `cURL` command below to validate services.
124156

125157
```bash
126-
curl http://${host_ip}:8008/generate \
158+
# either vLLM or TGI service
159+
curl http://${host_ip}:8008/v1/chat/completions \
127160
-X POST \
128161
-d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":17, "do_sample": true}}' \
129162
-H 'Content-Type: application/json'

0 commit comments

Comments
 (0)