Release v0.8.2 · vllm-project/vllm

This release contains important bug fix for the V1 engine's memory usage. We highly recommend you upgrading!

Highlights

Revert "Use uv python for docker rather than ppa:deadsnakess/ppa (#13569)" (#15377)
Remove openvino support in favor of external plugin (#15339)

V1 Engine

Fix V1 Engine crash while handling requests with duplicate request id (#15043)
Support FP8 KV Cache (#14570, #15191)
Add flag to disable cascade attention (#15243)
Scheduler Refactoring: Add Scheduler Interface (#15250)
Structured Output
- Add disable-any-whitespace option support for xgrammar (#15316)
- guidance backend for structured output + auto fallback mode (#14779)
Spec Decode
- Enable spec decode for top-p & top-k sampling (#15063)
- Use better defaults for N-gram (#15358)
- Update target_logits in place for rejection sampling (#15427)
AMD
- Enable Triton(ROCm) Attention backend for Nvidia GPUs (#14071)
TPU
- Support V1 Sampler for ragged attention (#14227)
- Tensor parallel MP support (#15059)
- MHA Pallas backend (#15288)

Features

Integrate fastsafetensors loader for loading model weights (#10647)
Add guidance backend for structured output (#14589)

Others

Add Kubernetes deployment guide with CPUs (#14865)
Support reset prefix cache by specified device (#15003)
Support tool calling and reasoning parser (#14511)
Support --disable-uvicorn-access-log parameters (#14754)
Support Tele-FLM Model (#15023)
Add pipeline parallel support to TransformersModel (#12832)
Enable CUDA graph support for llama 3.2 vision (#14917)

What's Changed

[FEAT]Support reset prefix cache by specified device by @maobaolong in #15003
[BugFix][V1] Update stats.py by @WrRan in #15139
[V1][TPU] Change kv cache shape. by @vanbasten23 in #15145
[FrontEnd][Perf] merge_async_iterators fast-path for single-prompt requests by @njhill in #15150
[Docs] Annouce Ollama and Singapore Meetups by @simon-mo in #15161
[V1] TPU - Tensor parallel MP support by @alexm-redhat in #15059
[BugFix] Lazily import XgrammarBackend to avoid early cuda init by @njhill in #15171
[Doc] Clarify run vllm only on one node in distributed inference by @ruisearch42 in #15148
Fix broken tests by @jovsa in #14713
[Bugfix] Fix embedding assignment for InternVL-based models by @DarkLight1337 in #15086
fix "Total generated tokens:" is 0 if using --backend tgi and --endpo… by @sywangyi in #14673
[V1][TPU] Support V1 Sampler for ragged attention by @NickLucche in #14227
[Benchmark] Allow oversample request in benchmark dataset by @JenZhao in #15170
[Core][V0] Add guidance backend for structured output by @russellb in #14589
[Doc] Update Mistral Small 3.1/Pixtral example by @ywang96 in #15184
[Misc] support --disable-uvicorn-access-log parameters by @chaunceyjiang in #14754
[Attention] Flash Attention 3 - fp8 by @mickaelseznec in #14570
[Doc] Update README.md by @DarkLight1337 in #15187
Enable CUDA graph support for llama 3.2 vision by @mritterfigma in #14917
typo: Update config.py by @WrRan in #15189
[Frontend][Bugfix] support prefill decode disaggregation on deepseek by @billishyahao in #14824
[release] Tag vllm-cpu with latest upon new version released by @khluu in #15193
Fixing Imprecise Type Annotations by @WrRan in #15192
[macOS] Ugrade pytorch to 2.6.0 by @linktohack in #15129
[Bugfix] Multi-video inference on LLaVA-Onevision by @DarkLight1337 in #15082
Add user forum to README by @hmellor in #15220
Fix env vars for running Ray distributed backend on GKE by @richardsliu in #15166
Replace misc issues with link to forum by @hmellor in #15226
[ci] feat: make the test_torchrun_example run with tp=2, external_dp=2 by @vermouth1992 in #15172
[Bugfix] fix V1 Engine crash while handling requests with duplicate request id by @JasonJ2021 in #15043
[V1] Add flag to disable cascade attention by @WoosukKwon in #15243
Enforce that TP > 1 is not supported for Mamba2 if Quantization is Enabled. by @fabianlim in #14617
[V1] Scheduler Refactoring [1/N] - Add Scheduler Interface by @WoosukKwon in #15250
[CI/Build] LoRA : make add_lora_test safer by @varun-sundar-rabindranath in #15181
Fix CUDA kernel index data type in vllm/csrc/quantization/fused_kernels/layernorm_utils.cuh +10 by @houseroad in #15159
[Misc] Clean up the BitsAndBytes arguments by @jeejeelee in #15140
[ROCM] Upgrade torch to 2.6 by @SageMoore in #15244
[Bugfix] Fix incorrect qwen2.5-vl attention mask pre-computation by @Isotr0py in #15200
Mention extra_body as a way top pass vLLM only parameters using the OpenAI client by @hmellor in #15240
[V1][TPU] Speed up top-k on TPU by using torch.topk by @hyeygit in #15242
[Bugfix] detect alibi and revert to FA2 by @tjohnson31415 in #15231
[Model] RE: Mamba2 Prefill Performance Tweaks: Fixing Flurry of Unnecessary Memory Copies by @cyang49 in #14857
[Docs] Trim the latest news in README by @WoosukKwon in #15261
[Misc] Better RayExecutor and multiprocessing compatibility by @comaniac in #14705
Add an example for reproducibility by @WoosukKwon in #15262
[Hardware][TPU] Add check for no additional graph compilation during runtime by @lsy323 in #14710
[V1] Enable Triton(ROCm) Attention backend for Nvidia GPUs by @Isotr0py in #14071
[Doc] Update LWS docs by @Edwinhr716 in #15163
[V1] Avoid redundant input processing in n>1 case by @njhill in #14985
[Feature] specify model in config.yaml by @wayzeng in #14855
[Bugfix] Add int8 torch dtype for KVCache by @shen-shanshan in #15260
[Misc] Add attention mask pre-computation optimization back to Qwen2.5-VL by @Isotr0py in #15273
[Bugfix] Fix incorrect resolving order for transformers fallback by @Isotr0py in #15279
[V1] Fix wrong import path of get_flash_attn_version by @lhtin in #15280
[Bugfix] Fix broken kernel test due to missing rename for v1 Triton backend by @Isotr0py in #15282
[Misc] Add cProfile helpers by @russellb in #15074
[v1] Refactor KVCacheConfig by @heheda12345 in #14079
[Bugfix][VLM] fix llava processor by @MengqingCao in #15285
Revert "[Feature] specify model in config.yaml (#14855)" by @DarkLight1337 in #15293
[TPU][V1] MHA Pallas backend by @NickLucche in #15288
[Build/CI] Fix env var typo by @russellb in #15305
[Misc] Increase RayDistributedExecutor RAY_CGRAPH_get_timeout by @ruisearch42 in #15301
[Bugfix][V0] Multi-sequence logprobs streaming edge case by @andylolu2 in #15259
[FEAT] [ROCm]: Add AITER RMS Norm (Layer Norm) Feature by @tjtanaa in #14959
[Doc] add load_format items in docs by @wwl2755 in #14804
[Bugfix] Fix torch.compile raise FileNotFoundError by @jeejeelee in #15278
[Bugfix] LoRA V0 - Fix case where max_num_seqs is between cudagraph capture sizes by @varun-sundar-rabindranath in #15308
[Model] Support Tele-FLM Model by @atone in #15023
[V1] Add disable-any-whitespace option support for xgrammar by @russellb in #15316
[BugFix][Typing] Fix Imprecise Type Annotations by @WrRan in #15208
Remove openvino support in favor of external plugin by @russellb in #15339
[doc] Add back previous news by @heheda12345 in #15331
Fix v1 supported oracle for worker-cls and worker-extension-cls by @hijkzzz in #15324
[V1][Usage] Refactor speculative decoding configuration and tests by @ShangmingCai in #14434
[ci/build] update torch nightly version for GH200 by @youkaichao in #15135
[ci/build] fix broken tests in LLM.collective_rpc by @youkaichao in #15350
[Misc] Add tuned R1 w8a8 and MoE configs for NVIDIA L20 by @DefTruth in #15322
[Bugfix] fix torch.compiled cache hash error by @DefTruth in #14953
[V1][Spec Decode] Respect prompt_lookup_max by @WoosukKwon in #15348
[V1][Spec Decode] Use better defaults for N-gram by @WoosukKwon in #15358
[Frontend] Support tool calling and reasoning parser by @WangErXiao in #14511
[Misc][Doc] Add note regarding loading generation_config by default by @ywang96 in #15281
[V1] Enable V1 Fp8 cache for FA3 in the oracle by @LucasWilkinson in #15191
[Fix] [torch.compile] Improve UUID system for custom passes by @ProExpertProg in #15249
Fix non-contiguous input passed to Marlin kernel by @Qubitium in #15319
[Misc] Upgrade BNB version by @jeejeelee in #15183
[Misc] Remove ignore_reinit_error for ray.init() by @ruisearch42 in #15373
[Bugfix][V1] Avoid importing PreTrainedModel by @HollowMan6 in #15366
[Misc] Update guided decoding logs to debug by @sfbemerk in #15310
Revert "[CI/Build] Use uv python for docker rather than ppa:deadsnakess/ppa (#13569)" by @simon-mo in #15377
[Kernel] allow non-contiguous input for marlin kernel by @jinzhen-lin in #14658
Fix zmq IPv6 URL format error by @russellb in #15341
[Bugfix] Fix chat template loading by @DarkLight1337 in #15143
[distributed] fix dp group by @youkaichao in #15355
[Core] Integrate fastsafetensors loader for loading model weights by @manish-sethi in #10647
[Core] Don't force uppercase for VLLM_LOGGING_LEVEL by @russellb in #15306
[V1][Minor] fix comments by @Chen-0210 in #15392
[MISC] Refine no available block debug msg by @yiliu30 in #15076
[V1] Aggregate chunked prompt logprobs in model runner by @njhill in #14875
[Hardware][Gaudi][Feature] Enable Dynamic MoE for Mixtral by @zhenwei-intel in #12303
[DOC] Add Kubernetes deployment guide with CPUs by @terrytangyuan in #14865
[Doc] Update docs on handling OOM by @DarkLight1337 in #15357
[V1][Perf] Simpler request output queues by @njhill in #15156
[BugFix][V1] Quick fix for min_tokens with multiple EOS by @njhill in #15407
[Hardware][TPU] Skip failed compilation test by @lsy323 in #15421
[Build] Cython compilation support fix by @gshtras in #14296
[ROCm][Kernel] MoE weights padding by @gshtras in #14454
[V1][Spec Decode] Enable spec decode for top-p & top-k sampling by @WoosukKwon in #15063
[Minor][Spec Decode] Remove compiled_softmax by @WoosukKwon in #15416
Add pipeline parallel support to TransformersModel by @hmellor in #12832
[Misc] Remove LoRA log by @jeejeelee in #15388
Revert "Fix non-contiguous input passed to Marlin kernel (#15319)" by @tlrmchlsmth in #15398
[Bugfix] Fixed the issue of not being able to input video and image simultaneously by @chaunceyjiang in #15387
[V1] guidance backend for structured output + auto fallback mode by @russellb in #14779
[V1][Spec Decode] Update target_logits in place for rejection sampling by @WoosukKwon in #15427

New Contributors

@maobaolong made their first contribution in #15003
@jovsa made their first contribution in #14713
@mickaelseznec made their first contribution in #14570
@mritterfigma made their first contribution in #14917
@billishyahao made their first contribution in #14824
@linktohack made their first contribution in #15129
@vermouth1992 made their first contribution in #15172
@JasonJ2021 made their first contribution in #15043
@hyeygit made their first contribution in #15242
@wayzeng made their first contribution in #14855
@lhtin made their first contribution in #15280
@wwl2755 made their first contribution in #14804
@atone made their first contribution in #15023
@hijkzzz made their first contribution in #15324
@sfbemerk made their first contribution in #15310
@manish-sethi made their first contribution in #10647
@yiliu30 made their first contribution in #15076

Full Changelog: v0.8.1...v0.8.2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.8.2

Highlights

V1 Engine

Features

Others

What's Changed

New Contributors

Contributors