Why doesn't llama-server distribute layers evenly across multiple GPUs of same size? #12752

segmond · 2025-04-04T15:08:27Z

segmond
Apr 4, 2025

load_tensors: offloaded 50/62 layers to GPU
load_tensors: CPU_Mapped model buffer size = 31952.61 MiB
load_tensors: CUDA0 model buffer size = 17656.55 MiB
load_tensors: CUDA1 model buffer size = 14125.24 MiB
load_tensors: CUDA2 model buffer size = 17656.55 MiB
load_tensors: CUDA3 model buffer size = 14125.24 MiB
load_tensors: CUDA4 model buffer size = 17656.55 MiB
load_tensors: CUDA5 model buffer size = 14125.24 MiB

The uneven distribution causes OOM when the KV is being loaded, meanwhile it's clear that we would have enough memory if it was all loaded evenly.

segmond · 2025-04-15T20:05:41Z

segmond
Apr 15, 2025
Author

I don't know why, but I figured out that I can force it to by using / instead of , to specify exactly how many layers to put on each GPU

3 replies

devnen Apr 16, 2025

Thank you for the suggestion. Can you please send a sample command line?

segmond Apr 22, 2025
Author

Yes, with the ts option you can use / but that means you need to know how many layers and have an idea of their size to force it. For example for deepseekV3 that I'm running, I noticed that the layer is roughly 4gb. So on my 24gb GPUs I apply 5 layers, for my 12gpu GPU I apply 2 layers and for my 16gb 3 layers, leaving roughly 4gb free for the KV cache. So let's say I have 3 gpus in this order 24gb/12gb/16gb, then I'll pass in -ts 5/2/3 for a total of 10 layers, and then I pass in -ngl 10

~/llama.cpp/build/bin/llama-server -ngl 63 --host 0.0.0.0 --path ~/llama.cpp/examples/server/public -m /llmzoo/models/DeepSeek-V3-0324-UD-Q2_K_XL.gguf --port 8089 --rpc 172.16.1.38:50051,172.16.1.38:50052,172.16.1.38:50054,172.16.1.38:50055,172.16.1.38:50056,172.16.1.18:50051,172.16.1.18:50052,172.16.1.18:50053,172.16.1.18:50054 -c 4250 -ts 5/2/5/5/5/3/3/3/3/4/5/5/5/5/5

load_tensors: offloaded 62/62 layers to GPU
load_tensors: CPU_Mapped model buffer size = 497.11 MiB
load_tensors: CUDA0 model buffer size = 16127.24 MiB
load_tensors: CUDA1 model buffer size = 20159.05 MiB
load_tensors: CUDA2 model buffer size = 20159.05 MiB
load_tensors: CUDA3 model buffer size = 20159.05 MiB
load_tensors: CUDA4 model buffer size = 20159.05 MiB
load_tensors: CUDA5 model buffer size = 12820.41 MiB
load_tensors: RPC[172.16.1.38:50051] model buffer size = 9115.67 MiB
load_tensors: RPC[172.16.1.38:50052] model buffer size = 8063.62 MiB
load_tensors: RPC[172.16.1.38:50054] model buffer size = 20159.05 MiB
load_tensors: RPC[172.16.1.38:50055] model buffer size = 20159.05 MiB
load_tensors: RPC[172.16.1.38:50056] model buffer size = 20159.05 MiB
load_tensors: RPC[172.16.1.18:50051] model buffer size = 12095.43 MiB
load_tensors: RPC[172.16.1.18:50052] model buffer size = 12095.43 MiB
load_tensors: RPC[172.16.1.18:50053] model buffer size = 12095.43 MiB
load_tensors: RPC[172.16.1.18:50054] model buffer size = 12095.43 MiB

Thank you for the suggestion. Can you please send a sample command line?

devnen Apr 22, 2025

I can't wait to try this with quantized DeepSeek v3.1. I may be able to fit the entire model in vram if this works. Thank you very much.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why doesn't llama-server distribute layers evenly across multiple GPUs of same size? #12752

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Why doesn't llama-server distribute layers evenly across multiple GPUs of same size? #12752

segmond Apr 4, 2025

Replies: 1 comment · 3 replies

segmond Apr 15, 2025 Author

devnen Apr 16, 2025

segmond Apr 22, 2025 Author

devnen Apr 22, 2025

segmond
Apr 4, 2025

Replies: 1 comment 3 replies

segmond
Apr 15, 2025
Author

segmond Apr 22, 2025
Author