Are there any KV cache movement when using PP/TP? #13653
leizhenyuan
announced in
Q&A
Replies: 1 comment 1 reply
-
Your understanding is correct. The API you found is used for prefill disaggregation. In this case we transfer the kv cache from a prefill worker to a decode worker. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I am not familiar with distributed. Below is my under standing:
For Pipeline parallel, we only need to transfer intermediate tensor between different node, so there are no KV cache communication.
For Tensor Parallel, different heads are allocated on different device, each head only need to calculate own output, there is also no KV cache communication.
But i also found need_recv_kvcache when worker is a driver worker, so i think my above understanding must be wrong, Can anyone answer this question? Thanks.
Beta Was this translation helpful? Give feedback.
All reactions