What version of Flash Attention llama.cpp is using? #13265

giuseros · 2025-05-02T15:05:37Z

giuseros
May 2, 2025

Hi all,
I am trying to understand what version of Flash Attention llama.cpp is using. I am asking because in /workspace/llama.cpp/tests/test-backend-ops.cpp, I see:

ggml_tensor * q = create_permuted(GGML_TYPE_F32, hsk_padded, nb, nh*nr, 1);
ggml_tensor * k = create_permuted(type_KV,       hsk_padded, kv, nh,    1);
ggml_tensor * v = create_permuted(type_KV,       hsv_padded, kv, nh,    1);

So it looks to me that we can have nb != kv. My understanding of the algorithm is that both Q, K and V were R^(nxd) matrices. Could you point me to the version of the algorithm you are using?

Thanks,
Giuseppe

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What version of Flash Attention llama.cpp is using? #13265

{{title}}

Replies: 0 comments

Select a reply

What version of Flash Attention llama.cpp is using? #13265

giuseros May 2, 2025

Replies: 0 comments

giuseros
May 2, 2025