@@ -178,11 +178,18 @@ Export the name of the `Secret` to the environment:
178
178
export REGISTRY_SECRET=anna-pull-secret
179
179
```
180
180
181
+ You can optionally set a custom EPP image (otherwise, the default will be used):
182
+
183
+ ``` console
184
+ export EPP_IMAGE="<YOUR_REGISTRY>/<YOUR_IMAGE>"
185
+ export EPP_TAG="<YOUR_TAG>"
186
+ ```
187
+
181
188
Set the ` VLLM_MODE ` environment variable based on which version of vLLM you want to deploy:
182
189
183
190
- ` vllm-sim ` : Lightweight simulator for simple environments
184
191
- ` vllm ` : Full vLLM model server for real inference
185
- - ` vllm-p2p ` : Full vLLM with LMCache P2P support for distributed KV caching
192
+ - ` vllm-p2p ` : Full vLLM with LMCache P2P support for enable KV-Cache aware routing
186
193
187
194
``` console
188
195
export VLLM_MODE=vllm-sim # or vllm / vllm-p2p
@@ -197,18 +204,14 @@ export VLLM_SIM_TAG="<YOUR_TAG>"
197
204
```
198
205
199
206
For vllm and vllm-p2p:
200
-
207
+ - set Vllm image:
201
208
``` console
202
209
export VLLM_IMAGE="<YOUR_REGISTRY>/<YOUR_IMAGE>"
203
210
export VLLM_TAG="<YOUR_TAG>"
204
211
```
205
-
206
- The same thing will need to be done for the EPP:
207
-
208
- ``` console
209
- export EPP_IMAGE="<YOUR_REGISTRY>/<YOUR_IMAGE>"
210
- export EPP_TAG="<YOUR_TAG>"
211
- ```
212
+ - Set hugging face token variable:
213
+ export HF_TOKEN="<HF_TOKEN>"
214
+ ** Warning** : For vllm mode, the default image uses llama3-8b and vllm-mistral. Make sure you have permission to access these files in their respective repositories.
212
215
213
216
Once all this is set up, you can deploy the environment:
214
217
@@ -224,12 +227,25 @@ kubectl -n ${NAMESPACE} port-forward service/inference-gateway 8080:80
224
227
```
225
228
226
229
And making requests with ` curl ` :
230
+ - vllm-sim
227
231
228
- ``` console
229
- curl -s -w '\n' http://localhost:8080/v1/completions -H 'Content-Type: application/json' \
230
- -d '{"model":"food-review","prompt":"hi","max_tokens":10,"temperature":0}' | jq
231
- ```
232
+ ``` console
233
+ curl -s -w '\n' http://localhost:8080/v1/completions -H 'Content-Type: application/json' \
234
+ -d '{"model":"food-review","prompt":"hi","max_tokens":10,"temperature":0}' | jq
235
+ ```
236
+
237
+ - vllm
238
+
239
+ ```console
240
+ curl -s -w '\n' http://localhost:8080/v1/completions -H 'Content-Type: application/json' \
241
+ -d '{"model":"meta-llama/Llama-3.1-8B-Instruct","prompt":"hi","max_tokens":10,"temperature":0}' | jq
242
+ ```
232
243
244
+ - vllm-p2p
245
+ ``` console
246
+ curl -s -w '\n' http://localhost:8080/v1/completions -H 'Content-Type: application/json' \
247
+ -d '{"model":"mistralai/Mistral-7B-Instruct-v0.2","prompt":"hi","max_tokens":10,"temperature":0}' | jq
248
+ ```
233
249
#### Development Cycle
234
250
235
251
> ** WARNING** : This is a very manual process at the moment. We expect to make
0 commit comments