[V1] Feedback Thread #12568

simon-mo · 2025-01-30T02:46:45Z

Please leave comments here about your usage of V1, does it work? does it not work? which feature do you need in order to adopt it? any bugs?

For bug report, please file it separately and link the issue here.

For in depth discussion, please feel free to join #sig-v1 in the vLLM Slack workspace.

robertgshaw2-redhat · 2025-01-30T02:50:50Z

[Bug]: V1 Regression: ValueError: could not broadcast input array from shape (y,) into shape (x,) #12567

wedobetter · 2025-01-30T07:22:45Z

👍 I have not done a proper benchmark but V1 feels superior, i.e. higher throughput + lower latency, TTFT.
The other thing that I have noticed is that logging has changed Running: 1 reqs, Waiting: 0 reqs, it used to print stats such token/s.

I have encountered a possible higher memory consumption issue, but am overall very pleased with the vllm community's hard work on V1.
#12529

m-harmonic · 2025-01-30T18:30:35Z

Does anyone know about this bug with n>1? Thanks
#12584

robertgshaw2-redhat · 2025-01-30T18:46:50Z

Does anyone know about this bug with n>1? Thanks #12584

Thanks, we are aware and have some ongoing PRs for it.

#10980

robertgshaw2-redhat · 2025-01-30T22:05:54Z

I have encountered a possible higher memory consumption issue, but am overall very pleased with the vllm community's hard work on V1.

Logging is in progress. Current main has a lot more and we will maintain compatibility with V0. Thanks!

dchichkov · 2025-01-30T22:15:04Z

Quick feedback [VLLM_USE_V1=1]:

n > 1 would be nice
guided_grammar (or anything guided really) would be nice

robertgshaw2-redhat · 2025-01-31T02:21:58Z

Quick feedback [VLLM_USE_V1=1]:

n > 1 would be nice

guided_grammar (or anything guided really) would be nice

Thanks, both are in progress

hibukipanim · 2025-01-31T14:29:18Z

are logprobs output (and specifically prompt logprobs with echo=True) expected to be working with current V1 (0.7.0)?
checking here before opening an issue to reproduce

akshay-loci · 2025-01-31T15:16:30Z

Maybe there is a better place to discuss this but the implementation for models that use more than one extra modality is quite non-intuitive. get_multimodal_embeddings() expects that we return a list or tensor of length equal to the number of multimodal items provided in the batch and we then have to make unintuitive assumptions on how the output passed into get_input_embeddings would look like because the batching being used while calling both functions is not the same. It would be much nicer if for example the input and output of get_multimodal_embeddings are dicts with the keys being the different modalities.

robertgshaw2-redhat · 2025-01-31T23:13:12Z

are logprobs output (and specifically prompt logprobs with echo=True) expected to be working with current V1 (0.7.0)? checking here before opening an issue to reproduce

Still in progress

wedobetter · 2025-02-02T14:06:59Z

👍 I have not done a proper benchmark but V1 feels superior, i.e. higher throughput + lower latency, TTFT. The other thing that I have noticed is that logging has changed Running: 1 reqs, Waiting: 0 reqs, it used to print stats such token/s.

I have encountered a possible higher memory consumption issue, but am overall very pleased with the vllm community's hard work on V1. #12529

Thanks for fixing metrics logs in 0.7.1!
Lack of pipeline parallelism in V1 is a show stopper for production deployments #11945

Ouna-the-Dataweaver · 2025-02-03T04:33:35Z

I'm either going insane, but with V1 qwen 8b instruct LLM just breaks in fp8 and around 25% of generations are just gibberish, with same running code and everything. Do I need to make a bug report, or it's an expected behaviour and I need some specific setup of sampling params for it to work in v1?

FrederickVu · 2025-02-03T06:45:20Z

The V1 engine doesn't seem to support logits processors or min-p filtering. Issue #12678

gmonair · 2025-02-03T15:12:28Z

Something is weird with memory calculation in V1 and tensor parallel. Here are 2 cases that I tested recently:

vllm 0.7.0 on 2x A6000:

Starting normally a 32b-awq model and using --max-model-len 32768 --gpu-memory-utilization 0.98 --tensor-parallel 2 --max-num-batched-tokens 32768 --max-seq-len-to-capture 32768

Everything works as previously, GPUs both get to ~44-46GB usage

Using VLLM_USE_V1=1 and the exact same parameters as above:

GPUs both load up to ~24-25GB and it slowly goes up as inference runs. I've seen it go up to 32GB on each GPU.

Updating to vllm 0.7.1 and running a 7b-awq model this time, I also noticed that running the above command "normally" the logs show Maximum concurrency at 44x

Using V1 I get:

INFO 02-02 23:26:19 kv_cache_utils.py:400] Maximum concurrency for 32768 tokens per request: **22.25x**

And finally, with vllm 0.7.0 and 4x L4 loading a 32b-awq model with tp 4 works in "normal mode", but OOMs with V1.

Xarbirus · 2025-02-03T16:33:44Z

I did a little experiment with DeepSeek-R1 on 8xH200 GPU.

vLLM 0.7.0 showed the following results with benchmark_serving.py --backend openai --base-url http://0.0.0.0:8000 --dataset-name=random --model deepseek-ai/DeepSeek-R1

with VLLM_USE_V1=1 (with --request-rate 4)

Traffic request rate: 4.0
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: None
100%|█████████████████████████████████████████████████████████████| 1000/1000 [07:53<00:00,  2.11it/s]
============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  473.62    
Total input tokens:                      1024000   
Total generated tokens:                  119550    
Request throughput (req/s):              2.11      
Output token throughput (tok/s):         252.42    
Total Token throughput (tok/s):          2414.51   
---------------Time to First Token----------------
Mean TTFT (ms):                          100636.33 
Median TTFT (ms):                        103588.53 
P99 TTFT (ms):                           197277.97 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          177.82    
Median TPOT (ms):                        172.14    
P99 TPOT (ms):                           363.05    
---------------Inter-token Latency----------------
Mean ITL (ms):                           173.08    
Median ITL (ms):                         136.46    
P99 ITL (ms):                            575.30    
==================================================

without VLLM_USE_V1 (with --request-rate 4)

Traffic request rate: 4.0
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: None
100%|█████████████████████████████████████████████████████████████| 1000/1000 [05:24<00:00,  3.08it/s]
============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  324.29    
Total input tokens:                      1024000   
Total generated tokens:                  119163    
Request throughput (req/s):              3.08      
Output token throughput (tok/s):         367.46    
Total Token throughput (tok/s):          3525.12   
---------------Time to First Token----------------
Mean TTFT (ms):                          29022.37  
Median TTFT (ms):                        32492.50  
P99 TTFT (ms):                           54457.59  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          125.16    
Median TPOT (ms):                        119.91    
P99 TPOT (ms):                           411.21    
---------------Inter-token Latency----------------
Mean ITL (ms):                           120.20    
Median ITL (ms):                         76.78     
P99 ITL (ms):                            656.11    
==================================================

In general, vLLM without VLLM_USE_V1 looked more productive. I also tried V0 with --request-rate 10 and got

Traffic request rate: 10.0
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: None
100%|█████████████████████████████████████████████████████████████| 1000/1000 [05:16<00:00,  3.16it/s]
============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  316.20    
Total input tokens:                      1024000   
Total generated tokens:                  119448    
Request throughput (req/s):              3.16      
Output token throughput (tok/s):         377.76    
Total Token throughput (tok/s):          3616.21   
---------------Time to First Token----------------
Mean TTFT (ms):                          100122.09 
Median TTFT (ms):                        98699.05  
P99 TTFT (ms):                           201732.11 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          139.61    
Median TPOT (ms):                        104.30    
P99 TPOT (ms):                           1276.91   
---------------Inter-token Latency----------------
Mean ITL (ms):                           105.90    
Median ITL (ms):                         76.35     
P99 ITL (ms):                            648.36    
==================================================

Throughput was still 2 times lower than SGLang in the same benchmark. Today I updated vLLM to the new version (0.7.1) and decided to repeat the experiment. And the results in version V0 have become much better!

without VLLM_USE_V1 (with --request-rate 4)

Traffic request rate: 4.0
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: None
100%|█████████████████████████████████████████████████████████████| 1000/1000 [04:29<00:00,  3.71it/s]
============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  269.74    
Total input tokens:                      1024000   
Total generated tokens:                  119805    
Request throughput (req/s):              3.71      
Output token throughput (tok/s):         444.14    
Total Token throughput (tok/s):          4240.35   
---------------Time to First Token----------------
Mean TTFT (ms):                          368.78    
Median TTFT (ms):                        269.07    
P99 TTFT (ms):                           3826.70   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          124.95    
Median TPOT (ms):                        122.03    
P99 TPOT (ms):                           214.93    
---------------Inter-token Latency----------------
Mean ITL (ms):                           123.32    
Median ITL (ms):                         75.30     
P99 ITL (ms):                            583.77    
==================================================

without VLLM_USE_V1 (with --request-rate 10)

Traffic request rate: 10.0
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: None
100%|█████████████████████████████████████████████████████████████| 1000/1000 [02:26<00:00,  6.83it/s]
============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  146.43    
Total input tokens:                      1024000   
Total generated tokens:                  119701    
Request throughput (req/s):              6.83      
Output token throughput (tok/s):         817.48    
Total Token throughput (tok/s):          7810.75   
---------------Time to First Token----------------
Mean TTFT (ms):                          14575.11  
Median TTFT (ms):                        13606.50  
P99 TTFT (ms):                           29954.96  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          297.01    
Median TPOT (ms):                        282.46    
P99 TPOT (ms):                           1393.69   
---------------Inter-token Latency----------------
Mean ITL (ms):                           262.67    
Median ITL (ms):                         132.89    
P99 ITL (ms):                            2840.40   
==================================================

But running vLLM with VLLM_USE_V1=1 I got en error TypeError: FlashAttentionImpl.__init__() got an unexpected keyword argument 'q_lora_rank' with previous warnings like

`torch.compile` is turned on, but the model deepseek-ai/DeepSeek-R1 does not support it. Please open an issue on GitHubif you want it to be supported.

bao231 · 2025-02-04T06:53:47Z

v1 not support T4,are you support?

bao231 · 2025-02-04T08:36:01Z

@simon-mo

WoosukKwon · 2025-02-04T09:23:36Z

Hi @bao231, V1 does not support T4 or older-generation GPUs since the kernel libraries used in V1 (e.g., flash-attn) do not support them.

bao231 · 2025-02-04T10:41:17Z

V1 support other attention libs？has you plan? @WoosukKwon

robertgshaw2-redhat · 2025-02-04T14:44:23Z

I did a little experiment with DeepSeek-R1 on 8xH200 GPU.

vLLM 0.7.0 showed the following results with benchmark_serving.py --backend openai --base-url http://0.0.0.0:8000 --dataset-name=random --model deepseek-ai/DeepSeek-R1

with VLLM_USE_V1=1 (with --request-rate 4)

Traffic request rate: 4.0
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: None
100%|█████████████████████████████████████████████████████████████| 1000/1000 [07:53<00:00,  2.11it/s]
============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  473.62    
Total input tokens:                      1024000   
Total generated tokens:                  119550    
Request throughput (req/s):              2.11      
Output token throughput (tok/s):         252.42    
Total Token throughput (tok/s):          2414.51   
---------------Time to First Token----------------
Mean TTFT (ms):                          100636.33 
Median TTFT (ms):                        103588.53 
P99 TTFT (ms):                           197277.97 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          177.82    
Median TPOT (ms):                        172.14    
P99 TPOT (ms):                           363.05    
---------------Inter-token Latency----------------
Mean ITL (ms):                           173.08    
Median ITL (ms):                         136.46    
P99 ITL (ms):                            575.30    
==================================================

without VLLM_USE_V1 (with --request-rate 4)

Traffic request rate: 4.0
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: None
100%|█████████████████████████████████████████████████████████████| 1000/1000 [05:24<00:00,  3.08it/s]
============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  324.29    
Total input tokens:                      1024000   
Total generated tokens:                  119163    
Request throughput (req/s):              3.08      
Output token throughput (tok/s):         367.46    
Total Token throughput (tok/s):          3525.12   
---------------Time to First Token----------------
Mean TTFT (ms):                          29022.37  
Median TTFT (ms):                        32492.50  
P99 TTFT (ms):                           54457.59  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          125.16    
Median TPOT (ms):                        119.91    
P99 TPOT (ms):                           411.21    
---------------Inter-token Latency----------------
Mean ITL (ms):                           120.20    
Median ITL (ms):                         76.78     
P99 ITL (ms):                            656.11    
==================================================

In general, vLLM without VLLM_USE_V1 looked more productive. I also tried V0 with --request-rate 10 and got

Traffic request rate: 10.0
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: None
100%|█████████████████████████████████████████████████████████████| 1000/1000 [05:16<00:00,  3.16it/s]
============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  316.20    
Total input tokens:                      1024000   
Total generated tokens:                  119448    
Request throughput (req/s):              3.16      
Output token throughput (tok/s):         377.76    
Total Token throughput (tok/s):          3616.21   
---------------Time to First Token----------------
Mean TTFT (ms):                          100122.09 
Median TTFT (ms):                        98699.05  
P99 TTFT (ms):                           201732.11 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          139.61    
Median TPOT (ms):                        104.30    
P99 TPOT (ms):                           1276.91   
---------------Inter-token Latency----------------
Mean ITL (ms):                           105.90    
Median ITL (ms):                         76.35     
P99 ITL (ms):                            648.36    
==================================================

Throughput was still 2 times lower than SGLang in the same benchmark. Today I updated vLLM to the new version (0.7.1) and decided to repeat the experiment. And the results in version V0 have become much better!

without VLLM_USE_V1 (with --request-rate 4)

Traffic request rate: 4.0
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: None
100%|█████████████████████████████████████████████████████████████| 1000/1000 [04:29<00:00,  3.71it/s]
============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  269.74    
Total input tokens:                      1024000   
Total generated tokens:                  119805    
Request throughput (req/s):              3.71      
Output token throughput (tok/s):         444.14    
Total Token throughput (tok/s):          4240.35   
---------------Time to First Token----------------
Mean TTFT (ms):                          368.78    
Median TTFT (ms):                        269.07    
P99 TTFT (ms):                           3826.70   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          124.95    
Median TPOT (ms):                        122.03    
P99 TPOT (ms):                           214.93    
---------------Inter-token Latency----------------
Mean ITL (ms):                           123.32    
Median ITL (ms):                         75.30     
P99 ITL (ms):                            583.77    
==================================================

without VLLM_USE_V1 (with --request-rate 10)

Traffic request rate: 10.0
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: None
100%|█████████████████████████████████████████████████████████████| 1000/1000 [02:26<00:00,  6.83it/s]
============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  146.43    
Total input tokens:                      1024000   
Total generated tokens:                  119701    
Request throughput (req/s):              6.83      
Output token throughput (tok/s):         817.48    
Total Token throughput (tok/s):          7810.75   
---------------Time to First Token----------------
Mean TTFT (ms):                          14575.11  
Median TTFT (ms):                        13606.50  
P99 TTFT (ms):                           29954.96  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          297.01    
Median TPOT (ms):                        282.46    
P99 TPOT (ms):                           1393.69   
---------------Inter-token Latency----------------
Mean ITL (ms):                           262.67    
Median ITL (ms):                         132.89    
P99 ITL (ms):                            2840.40   
==================================================

But running vLLM with VLLM_USE_V1=1 I got en error TypeError: FlashAttentionImpl.__init__() got an unexpected keyword argument 'q_lora_rank' with previous warnings like

`torch.compile` is turned on, but the model deepseek-ai/DeepSeek-R1 does not support it. Please open an issue on GitHubif you want it to be supported.

Thanks!

We are aware of the performance gap for DeepSeekV3 and are actively working on it. See [Perf] Mem align KV caches for CUDA devices (MLA perf improvement) #12676 which will resolve the gap. We will do a release hopefully today with this change
DeepSeekV3 is not yet supported on V1 since it requires chunked prefill. We are actively working on chunked prefill for MLA and hope to have it complete this week!

robertgshaw2-redhat · 2025-02-04T14:45:15Z

I'm either going insane, but with V1 qwen 8b instruct LLM just breaks in fp8 and around 25% of generations are just gibberish, with same running code and everything. Do I need to make a bug report, or it's an expected behaviour and I need some specific setup of sampling params for it to work in v1?

Can you provide a more detailed reproduction instruction?

cc @WoosukKwon

robertgshaw2-redhat · 2025-02-04T14:45:40Z

👍 I have not done a proper benchmark but V1 feels superior, i.e. higher throughput + lower latency, TTFT. The other thing that I have noticed is that logging has changed Running: 1 reqs, Waiting: 0 reqs, it used to print stats such token/s.
I have encountered a possible higher memory consumption issue, but am overall very pleased with the vllm community's hard work on V1. #12529

Thanks for fixing metrics logs in 0.7.1! Lack of pipeline parallelism in V1 is a show stopper for production deployments #11945

Thanks. We are actively working on PP

robertgshaw2-redhat · 2025-02-04T14:47:14Z

Maybe there is a better place to discuss this but the implementation for models that use more than one extra modality is quite non-intuitive. get_multimodal_embeddings() expects that we return a list or tensor of length equal to the number of multimodal items provided in the batch and we then have to make unintuitive assumptions on how the output passed into get_input_embeddings would look like because the batching being used while calling both functions is not the same. It would be much nicer if for example the input and output of get_multimodal_embeddings are dicts with the keys being the different modalities.

Check out #sig-multi-modality in our slack! This is the best place for a discussion like this

robertgshaw2-redhat · 2025-02-04T14:48:16Z

Something is weird with memory calculation in V1 and tensor parallel. Here are 2 cases that I tested recently:

vllm 0.7.0 on 2x A6000:

Starting normally a 32b-awq model and using --max-model-len 32768 --gpu-memory-utilization 0.98 --tensor-parallel 2 --max-num-batched-tokens 32768 --max-seq-len-to-capture 32768

Everything works as previously, GPUs both get to ~44-46GB usage

Using VLLM_USE_V1=1 and the exact same parameters as above:

GPUs both load up to ~24-25GB and it slowly goes up as inference runs. I've seen it go up to 32GB on each GPU.

Updating to vllm 0.7.1 and running a 7b-awq model this time, I also noticed that running the above command "normally" the logs show Maximum concurrency at 44x

Using V1 I get:

INFO 02-02 23:26:19 kv_cache_utils.py:400] Maximum concurrency for 32768 tokens per request: **22.25x**

And finally, with vllm 0.7.0 and 4x L4 loading a 32b-awq model with tp 4 works in "normal mode", but OOMs with V1.

Its pretty hard to follow what you are seeing. Please attach:

launch command
logs

Thanks!

gmonair · 2025-02-04T15:17:40Z

Its pretty hard to follow what you are seeing. Please attach:
* launch command

* logs

Hi, please see vllm_output(27)-OOM.log for OOM on 4x L4 and vllm_output(28)-WORKS.log to compare. The only difference between them is the V1 flag.

Launch command

my_env = os.environ.copy()
my_env["VLLM_USE_V1"] = "0"

# background task
command = [
    "python", 
    "-m", 
    "vllm.scripts", 
    "serve",
    "/kaggle/input/qwen25/transformers/r1-32b-awq/1",
    "--served-model-name", "model",
    "--tensor_parallel_size", "4",
    "--gpu_memory_utilization", "0.95",
    "--port", "9901",
    "--max-num-batched-tokens", "32768",
    "--max-seq-len-to-capture", "32768",
    "--max-model-len", "32768",
    "--enable_prefix_caching",
]

process = subprocess.Popen(command, stdout=log_file, stderr=log_file, env=my_env)

vllm_output(28)-WORKS.log
vllm_output(27)-OOM.log

DefinitlyEvil · 2025-03-11T20:12:44Z

[ngram] Speculative decoding is easy to generate repeat texts for multilingual tasks.

Speculative decoding is using larger model as "error correction" for the tiny model, repetitive text might be caused by your larger model, speculative decoding should (in theory) only accelerate the generation instead of interfering with the generated text.

But the larger model works well. That's weird. And it often generates repetitive text when there are multiple requests at the same time.

Please also try same seed, or zero temperature, without speculative decoding to see if problem exists.

JaheimLee · 2025-03-12T03:20:23Z

[ngram] Speculative decoding is easy to generate repeat texts for multilingual tasks.

Speculative decoding is using larger model as "error correction" for the tiny model, repetitive text might be caused by your larger model, speculative decoding should (in theory) only accelerate the generation instead of interfering with the generated text.

But the larger model works well. That's weird. And it often generates repetitive text when there are multiple requests at the same time.

Please also try same seed, or zero temperature, without speculative decoding to see if problem exists.

It's related to this pr

LiuXiaoxuanPKU · 2025-03-12T05:31:06Z

[ngram] Speculative decoding is easy to generate repeat texts for multilingual tasks.

Speculative decoding is using larger model as "error correction" for the tiny model, repetitive text might be caused by your larger model, speculative decoding should (in theory) only accelerate the generation instead of interfering with the generated text.

But the larger model works well. That's weird. And it often generates repetitive text when there are multiple requests at the same time.

Please also try same seed, or zero temperature, without speculative decoding to see if problem exists.

It's related to this pr

Could you try main again and see if it is fixed? Thanks!

K-e-t-i · 2025-03-14T13:21:24Z

Hi, are you already working on resolving the size mismatch issue when loading the MixtralForCausalLM GGUF model?
Details: #14423

MichoChan · 2025-03-17T13:26:03Z

#14915 hi, i meet this bug

Happy2Git · 2025-03-18T15:21:21Z

#15046 When trying one of the listed supported models with architecture StableLMForCausalLM, stabilityai/stablelm-base-alpha-7b-v2 , I got the error StableLMAlphaForCausalLM has no vLLM implementation.

jifa513 · 2025-03-19T01:29:50Z

Is it possible to close prefix-caching in V1 ?

saattrupdan · 2025-03-19T14:42:16Z

Hi. V1 only supports using the XGrammar backend for structured generation, but XGrammar does not support as many JSON schemas as Outlines. Specifically, I'm using conlist(str, max_length=5) in my Pydantic class, which doesn't work with XGrammar. Outlines can support this just fine, but that's not permitted in V1.

robertgshaw2-redhat · 2025-03-20T19:33:32Z

Is it possible to close prefix-caching in V1 ?

You can set --no-enable-prefix-caching. However, there is no overhead from prefix caching so we suggest enabling it

robertgshaw2-redhat · 2025-03-20T19:34:14Z

Hi. V1 only supports using the XGrammar backend for structured generation, but XGrammar does not support as many JSON schemas as Outlines. Specifically, I'm using conlist(str, max_length=5) in my Pydantic class, which doesn't work with XGrammar. Outlines can support this just fine, but that's not permitted in V1.

We are aware and are close to finishing other structured generation backends in V1. Ideally EOW

robertgshaw2-redhat · 2025-03-20T19:34:59Z

I rely on min_tokens for some benchmarks I want to run. Will V1 eventually support this?

I would suggest using --ignore-eos rather than min_tokens for benchmarking. This will allow you to control the exact length of the generations. However, Are you having an issue with min_tokens? This should be working

sethkimmel3 · 2025-03-20T21:54:45Z

Have to disable v1 engine due to this small restriction: #15252. Will x-post in the Slack for vis.

DanlinJia · 2025-03-20T22:27:40Z

I have documented the results of my experiments comparing the throughput of V0 and V1 in a newly created issue. The findings suggest that when GPU memory is fully utilized, preemption occurs, and V1 fails to demonstrate a significant throughput advantage over V0. Can anyone explain why this happens?

oyerli · 2025-03-21T15:39:45Z

We've encountered a critical memory leak when using the V1 engine for image inference — system RAM usage exceeds 200 GB over time. Full bug report with reproduction steps and details can be found here: #15294.

vrdn-23 · 2025-03-23T20:43:11Z

Just wanted to leave a quick comment here that I think the default value of --max-num-seqs should be left at the same V0 default rather than being upped to 1024 (or maybe set to a much more milder jump) because it is inevitably causing a huge source of confusion for folks who had not set that value but are now running into OOM issues with the new default just by updating versions.

robertgshaw2-redhat · 2025-03-23T21:03:47Z

We've encountered a critical memory leak when using the V1 engine for image inference — system RAM usage exceeds 200 GB over time. Full bug report with reproduction steps and details can be found here: #15294.

Thanks. We have resolved this and will do a hotfix.

robertgshaw2-redhat · 2025-03-23T21:04:04Z

Have to disable v1 engine due to this small restriction: #15252. Will x-post in the Slack for vis.

Thanks!

robertgshaw2-redhat · 2025-03-23T21:05:10Z

Just wanted to leave a quick comment here that I think the default value of --max-num-seqs should be left at the same V0 default rather than being upped to 1024 (or maybe set to a much more milder jump) because it is inevitably causing a huge source of confusion for folks who had not set that value but are now running into OOM issues with the new default just by updating versions.

Can you share more about what is causing OOM? Is this during profiling? The value of --max-num-seqs should not cause OOM during runtime. It would be helpful if you can share more of your use case so we can check it out.

vrdn-23 · 2025-03-24T04:08:33Z

@robertgshaw2-redhat I'm not sure whether this is during the profiling step but essentially our deployments started running into these errors during engine start-up for the meta-llama/Llama-Guard-3-8B model after upgrading to v0.8.1.

INFO 03-24 04:03:54 [loader.py:429] Loading weights took 28.34 seconds                                                                                                                                           │
│ INFO 03-24 04:03:55 [gpu_model_runner.py:1176] Model loading took 14.9888 GB and 28.719376 seconds                                                                                                               │
│ INFO 03-24 04:04:03 [backends.py:409] Using cache directory: /root/.cache/vllm/torch_compile_cache/855dafa5cf/rank_0_0 for vLLM's torch.compile                                                                  │
│ INFO 03-24 04:04:03 [backends.py:419] Dynamo bytecode transform time: 8.79 s                                                                                                                                     │
│ INFO 03-24 04:04:07 [backends.py:132] Cache the graph of shape None for later use                                                                                                                                │
│ INFO 03-24 04:04:37 [backends.py:144] Compiling a graph for general shape takes 33.10 s                                                                                                                          │
│ INFO 03-24 04:04:49 [monitor.py:33] torch.compile takes 41.89 s in total                                                                                                                                         │
│ ERROR 03-24 04:04:50 [core.py:340] EngineCore hit an exception: Traceback (most recent call last):                                                                                                               │
│ ERROR 03-24 04:04:50 [core.py:340]   File "/app/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 332, in run_engine_core                                                                         │
│ ERROR 03-24 04:04:50 [core.py:340]     engine_core = EngineCoreProc(*args, **kwargs)                                                                                                                             │
│ ERROR 03-24 04:04:50 [core.py:340]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                             │
│ ERROR 03-24 04:04:50 [core.py:340]   File "/app/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 287, in __init__                                                                                │
│ ERROR 03-24 04:04:50 [core.py:340]     super().__init__(vllm_config, executor_class, log_stats)                                                                                                                  │
│ ERROR 03-24 04:04:50 [core.py:340]   File "/app/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 62, in __init__                                                                                 │
│ ERROR 03-24 04:04:50 [core.py:340]     num_gpu_blocks, num_cpu_blocks = self._initialize_kv_caches(                                                                                                              │
│ ERROR 03-24 04:04:50 [core.py:340]                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                              │
│ ERROR 03-24 04:04:50 [core.py:340]   File "/app/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 124, in _initialize_kv_caches                                                                   │
│ ERROR 03-24 04:04:50 [core.py:340]     kv_cache_configs = get_kv_cache_configs(vllm_config, kv_cache_specs,                                                                                                      │
│ ERROR 03-24 04:04:50 [core.py:340]                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                      │
│ ERROR 03-24 04:04:50 [core.py:340]   File "/app/.venv/lib/python3.12/site-packages/vllm/v1/core/kv_cache_utils.py", line 576, in get_kv_cache_configs                                                            │
│ ERROR 03-24 04:04:50 [core.py:340]     check_enough_kv_cache_memory(vllm_config, kv_cache_spec,                                                                                                                  │
│ ERROR 03-24 04:04:50 [core.py:340]   File "/app/.venv/lib/python3.12/site-packages/vllm/v1/core/kv_cache_utils.py", line 468, in check_enough_kv_cache_memory                                                    │
│ ERROR 03-24 04:04:50 [core.py:340]     raise ValueError("No available memory for the cache blocks. "                                                                                                             │
│ ERROR 03-24 04:04:50 [core.py:340] ValueError: No available memory for the cache blocks. Try increasing `gpu_memory_utilization` when initializing the engine.                                                   │
│ ERROR 03-24 04:04:50 [core.py:340]                                                                                                                                                                               │
│ CRITICAL 03-24 04:04:50 [core_client.py:269] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.

We do not see the error if we manually set the max_num_seqs value to 128 which is something I realized was the problem after looking at this issue (#14992 (comment)).
Let me know if you would like to provide any additional info

rajveerb · 2025-03-24T18:16:32Z

Is there a temporary way to get around the max_num_seqs in v1?

I'm running into the same issue as (#14992) with v0.7.2.

Ucag · 2025-03-26T17:24:26Z

OOM issue after upgrade to v0.8. Same configuration should work on preceding vllm version. I'm deploying a 72b AWQ model on 8x4090, which works fine with 128k context length. Things go wrong after I upgrade to latest version (0.8.2), no matter what number of max_model_len or max_num_seqs, there is always OOM. I had to disable v1 feature by VLLM_USE_V1=0.

I event set max_num_seqs to 1 and max_model_len to 10240(10k), OOM still exits.

win9killhuaxiong · 2025-03-28T08:14:04Z

when l use 8*H20 * 2 to run DeepSeek-R1 with vllm0.8.2, l get a terrible error,please help kids

2025-03-28 07:55:36,878 INFO worker.py:1660 -- Connecting to existing Ray cluster at address: 7.216.55.218:6379...
2025-03-28 07:55:36,888 INFO worker.py:1843 -- Connected to Ray cluster. View the dashboard at 127.0.0.1:8265
ERROR 03-28 07:55:36 [core.py:343] EngineCore hit an exception: Traceback (most recent call last):
ERROR 03-28 07:55:36 [core.py:343] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 335, in run_engine_core
ERROR 03-28 07:55:36 [core.py:343] engine_core = EngineCoreProc(*args, **kwargs)
ERROR 03-28 07:55:36 [core.py:343] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 290, in init
ERROR 03-28 07:55:36 [core.py:343] super().init(vllm_config, executor_class, log_stats)
ERROR 03-28 07:55:36 [core.py:343] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 60, in init
ERROR 03-28 07:55:36 [core.py:343] self.model_executor = executor_class(vllm_config)
ERROR 03-28 07:55:36 [core.py:343] File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 271, in init
ERROR 03-28 07:55:36 [core.py:343] super().init(*args, **kwargs)
ERROR 03-28 07:55:36 [core.py:343] File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 52, in init
ERROR 03-28 07:55:36 [core.py:343] self._init_executor()
ERROR 03-28 07:55:36 [core.py:343] File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_distributed_executor.py", line 105, in _init_executor
ERROR 03-28 07:55:36 [core.py:343] initialize_ray_cluster(self.parallel_config)
ERROR 03-28 07:55:36 [core.py:343] File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_utils.py", line 299, in initialize_ray_cluster
ERROR 03-28 07:55:36 [core.py:343] ray.init(address=ray_address)
ERROR 03-28 07:55:36 [core.py:343] File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
ERROR 03-28 07:55:36 [core.py:343] return func(*args, **kwargs)
ERROR 03-28 07:55:36 [core.py:343] File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 1854, in init
ERROR 03-28 07:55:36 [core.py:343] connect(
ERROR 03-28 07:55:36 [core.py:343] File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2325, in connect
ERROR 03-28 07:55:36 [core.py:343] faulthandler.enable(all_threads=False)
ERROR 03-28 07:55:36 [core.py:343] OSError: [Errno 12] Cannot allocate memory
ERROR 03-28 07:55:36 [core.py:343]
INFO 03-28 07:55:36 [ray_distributed_executor.py:127] Shutting down Ray distributed executor. If you see error log from logging.cc regarding SIGTERM received, please ignore because this is the expected termination process in Ray.
CRITICAL 03-28 07:55:36 [core_client.py:269] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
Killed

hmellor · 2025-03-28T12:21:00Z

@win9killhuaxiong could this be a host OOM? Can you monitor your host memory when this error is thrown?

win9killhuaxiong · 2025-03-31T07:39:50Z

@win9killhuaxiong could this be a host OOM? Can you monitor your host memory when this error is thrown?

not OOM,when l start service,the memory is not change, maybe is ray bug

lbeisteiner · 2025-04-04T11:55:32Z

Hi team! I've encountered another error when using ngram-speculative decoding on v1, see #16058. Thanks a lot for your help on my previous issue #13673!

mkgs210 · 2025-04-04T15:16:56Z

Model: RefalMachine/RuadaptQwen2.5-1.5B-instruct
Vllm version: 0.8.2
command: VLLM_USE_V1=0 vllm serve RefalMachine/RuadaptQwen2.5-1.5B-instruct --device cuda --gpu-memory-utilization 0.98 --enforce-eager --quantization fp8 --port 8002 and also without eager, quantization and with V1
Testing was done on one message at a time

I measured model Qwen2.5-1.5B-instruct. I was extremely surprised by the results. Now I see the point of using v1 only for structured output in full accuracy. In other cases, the results will be worse, or about the same.
I am also extremely disappointed with the speed of the model in fp8, which dropped 4-8 times in v1, and without eager the mod works even slower than with it.

s-banach · 2025-04-08T17:44:11Z

V1 seems consistently a lot slower with speculative decode compared to the old engine. Using Qwen 14B+1.5B.

dtransposed · 2025-04-10T11:01:02Z

Hey team!

NotImplementedError: VLLM_USE_V1=1 is not supported with --task classify

Would this be something on the immediate roadmap?

Thanks for your hard work, V1, in general, looks so promising and more hackable than the slightly over-bloated V0!

Edit: I've just learned that the PR is already there: #16188

simon-mo added the misc label Jan 30, 2025

simon-mo changed the title ~~[V1] Feedback Threads~~ [V1] Feedback Thread Jan 30, 2025

simon-mo removed the misc label Jan 30, 2025

simon-mo pinned this issue Jan 30, 2025

njhill added the v1 label Feb 4, 2025

liuzijing2014 mentioned this issue Mar 12, 2025

[V1][Metrics] Allow V1 AsyncLLM to use custom logger #14661

Open

This comment has been minimized.

Sign in to view

[V1] Feedback Thread #12568

[V1] Feedback Thread #12568

Comments

simon-mo commented Jan 30, 2025 • edited Loading

robertgshaw2-redhat commented Jan 30, 2025

wedobetter commented Jan 30, 2025 • edited Loading

m-harmonic commented Jan 30, 2025

robertgshaw2-redhat commented Jan 30, 2025

robertgshaw2-redhat commented Jan 30, 2025

dchichkov commented Jan 30, 2025

robertgshaw2-redhat commented Jan 31, 2025

hibukipanim commented Jan 31, 2025

akshay-loci commented Jan 31, 2025

robertgshaw2-redhat commented Jan 31, 2025

wedobetter commented Feb 2, 2025

Ouna-the-Dataweaver commented Feb 3, 2025

FrederickVu commented Feb 3, 2025

gmonair commented Feb 3, 2025 • edited Loading

Xarbirus commented Feb 3, 2025

bao231 commented Feb 4, 2025

bao231 commented Feb 4, 2025

WoosukKwon commented Feb 4, 2025

bao231 commented Feb 4, 2025

robertgshaw2-redhat commented Feb 4, 2025

robertgshaw2-redhat commented Feb 4, 2025

robertgshaw2-redhat commented Feb 4, 2025

robertgshaw2-redhat commented Feb 4, 2025

robertgshaw2-redhat commented Feb 4, 2025

gmonair commented Feb 4, 2025

DefinitlyEvil commented Mar 11, 2025

JaheimLee commented Mar 12, 2025

LiuXiaoxuanPKU commented Mar 12, 2025

This comment has been minimized.

K-e-t-i commented Mar 14, 2025

MichoChan commented Mar 17, 2025

Happy2Git commented Mar 18, 2025 • edited Loading

jifa513 commented Mar 19, 2025

saattrupdan commented Mar 19, 2025 • edited Loading

robertgshaw2-redhat commented Mar 20, 2025

robertgshaw2-redhat commented Mar 20, 2025

robertgshaw2-redhat commented Mar 20, 2025

sethkimmel3 commented Mar 20, 2025

DanlinJia commented Mar 20, 2025

oyerli commented Mar 21, 2025

vrdn-23 commented Mar 23, 2025

robertgshaw2-redhat commented Mar 23, 2025

robertgshaw2-redhat commented Mar 23, 2025

robertgshaw2-redhat commented Mar 23, 2025

vrdn-23 commented Mar 24, 2025

rajveerb commented Mar 24, 2025

Ucag commented Mar 26, 2025 • edited Loading

win9killhuaxiong commented Mar 28, 2025

hmellor commented Mar 28, 2025

win9killhuaxiong commented Mar 31, 2025

lbeisteiner commented Apr 4, 2025

mkgs210 commented Apr 4, 2025 • edited Loading

s-banach commented Apr 8, 2025

dtransposed commented Apr 10, 2025 • edited Loading

simon-mo commented Jan 30, 2025 •

edited

Loading

wedobetter commented Jan 30, 2025 •

edited

Loading

gmonair commented Feb 3, 2025 •

edited

Loading

Happy2Git commented Mar 18, 2025 •

edited

Loading

saattrupdan commented Mar 19, 2025 •

edited

Loading

Ucag commented Mar 26, 2025 •

edited

Loading

mkgs210 commented Apr 4, 2025 •

edited

Loading

dtransposed commented Apr 10, 2025 •

edited

Loading