🚀 Meet prima.cpp — our distributed implementation of llama.cpp, built to scale across more everyday home devices. #12852

Lizonghang · 2025-04-09T12:24:43Z

Lizonghang
Apr 9, 2025

Want to run larger LLMs with llama.cpp but hit hardware limits? Have multiple devices lying around but not sure how to use them for collaborative inference? If yes, try prima.cpp!

Prima.cpp is a distributed implementation of llama.cpp. It lets you use multiple everyday home devices to run larger models, even 70B! It inherits great features from llama.cpp like mmap to avoid OOM, and adds more features like piped-ring parallelism, prefetching, and automatic workload distribution to make distributed inference faster.

Give it a try and unlock the full power of your devices! 🖥️ 💻 📱

myan-o · 2025-04-10T22:06:27Z

myan-o
Apr 10, 2025

wow

Do you have any plans for PR?

4 replies

Lizonghang Apr 11, 2025
Author

prima.cpp significantly modified the basic workflow of llama.cpp, and llama.cpp updates too fast. It may cause many problems and need much effort when merging, so there is no plan for PR now.

myan-o Apr 11, 2025

@Lizonghang
Can you incorporate ggml-hexagon into your project?

#12326

myan-o Apr 11, 2025

Is there an API server in prima.cpp?

Lizonghang Apr 11, 2025
Author

This project is in its very early stage, we will consider fix API then.

tuaris · 2025-04-14T06:31:27Z

tuaris
Apr 14, 2025

Does this mean I could have a farm of network connected GPU servers and have Lllama-cpp make use of them?

1 reply

Lizonghang Apr 14, 2025
Author

Yes, if your friends have a desktop/laptop with a GPU, you can ask him to join your cluster and make use of yours and your friends' devices to run llama.cpp. It is also ok if your friend has a big-memory desktop but with no GPU.

myan-o · 2025-04-14T20:49:04Z

myan-o
Apr 14, 2025

How much does network speed affect it?

1 reply

Lizonghang Apr 15, 2025
Author

Very little. For more details, please see our paper here:

https://huggingface.co/papers/2504.08791

rgerganov · 2025-04-16T08:20:22Z

rgerganov
Apr 16, 2025
Collaborator

Thanks for sharing this work. I'd be very interested to see a performance comparison between prima.cpp and llama.cpp with multiple RPC servers.

14 replies

saood06 Apr 19, 2025

I haven't tried --override-tensor yet, I'm still figuring out how it works.

Here is a well documented example of someone using
--override-tensor on an MoE model (where that override is more useful). They didn't use RPC, but it should still be useful in understanding how to use it.

Lizonghang Apr 19, 2025
Author

I haven't tried --override-tensor yet, I'm still figuring out how it works.

Here is a well documented example of someone using --override-tensor on an MoE model (where that override is more useful). They didn't use RPC, but it should still be useful in understanding how to use it.

Thanks for the suggestion.

I tried to offload all Transformer blocks to the rpc server and keep the input embedding and output layer on device 1 (same in prima.cpp).

The command is updated here:

bin/llama-cli -m ../download/Qwen2.5-32B-Instruct-Q4_K_M.gguf --prompt "what is edge AI?" -c 1024 -n 20 --rpc 192.168.31.155:50052 -ngl 80 -no-cnv --tensor-split 0 --override-tensor 'blk.*=RPC[192.168.31.155:50052]' --override-tensor 'tok_embeddings.*=CPU' --override-tensor 'output.*=CPU'

The results are:

	Token Latency
llama.cpp (with RPC server, default workload distribution)	75.83 ms/token
llama.cpp (with RPC server, manually offload all Transformer blocks to the RPC server)	76.34 ms/token
prima.cpp	48.38 ms/token

Is there any suggestion to update my command?

zhouwg Apr 20, 2025

@Lizonghang Thanks. With llama.cpp you can also use --override-tensor to manually override tensor distribution across CPU, GPUs and RPC servers, e.g. -ot 'blk.1*=RPC[localhost:50052]'.
It'd be interesting to see performance results with and without your smart tensor distribution.

I think llama.cpp rpc backend and prima.cpp each shine in different scenarios—they shouldn't be seen as competitors.

I can see you are a very young PhD student from China and I understand your feelings.

myan-o told me your project last week and my humble opinions as following:

I don't know anything tech details about ggml-rpc and your project, but I still think you can submit a formal PR here then your ideas can be useful/helpful for more developers and experts rather than maintain a hard-forked project. maintain a complex and formal PR in llama.cpp community is really not easy and I also understand your answer to myan-o last week("prima.cpp significantly modified the basic workflow of llama.cpp, and llama.cpp updates too fast. It may cause many problems and need much effort when merging, so there is no plan for PR now"), but a formal PR in llama.cpp community is good for the entire llama.cpp community and you: because you are freely promoting your llama.cpp derived project in the official llama.cpp project, I personally don't think it's a correct manner especially there is already ggml-rpc backend. it's better we should put ourselves in others' shoes.

the core maintainers of llama.cpp project are very kind&open-minded AI experts and C++ masters and I think they should also very welcome your formal PR.

Lizonghang Apr 20, 2025
Author

I understand your concerns, but merging it to llama.cpp will cause errors to many models, OS, backends. It's not a good time for a PR because prima.cpp now is just a research validation, it shows this approach is effective but it may destroy current features. I haven't fix these issues now.

Lizonghang Apr 20, 2025
Author

I will submit a PR to merge prima.cpp into llama.cpp as a new branch, if they don't mind.

saood06 · 2025-04-19T10:30:08Z

saood06
Apr 19, 2025

Is there any suggestion to update my command?

Yes, maybe you can split the RPC server into two, one with the GPU backend compiled and one without so it will be CPU. Then you set -ngl 100 and -ot manually offload all the attention tensors and KV to the GPU and then whatever else fits on to the remaining GPU VRAM and then for the rest do whatever you think would be best.

Prima.cpp has a smart workload scheduler, so it knows that device 2 is faster and shifts all the work there.

I feel like this is a strange situation to compare then, because as you point out it doesn't make sense to use it as an RPC server and you may as well directly use llama.cpp on the large machine instead.

7 replies

Lizonghang Apr 19, 2025
Author

I split the RPC server into two Dockerized backends: a GPU backend with 24 GiB of VRAM, and a CPU backend with 64 GiB of RAM. Now, the devices are as follows:

Head device: Mac M1 with 8GB RAM
RPC backend 1: CPUs with 64GB RAM
RPC backend 2: a GPU with 24GB VRAM

No resource limitation is set to containers, and I changed the model to Llama 3 70B Q4K.

I launched llama.cpp using the following command (the VRAM is filled):

bin/llama-cli -m ../download/Meta-Llama-3.1-70B-Instruct-Q4_K_S.gguf --prompt "what is edge AI?" -c 1024 -n 20 --rpc 192.168.31.155:50052 --rpc 192.168.31.155:50053 -no-cnv --override-tensor 'tok_embeddings.*=CPU' --override-tensor 'output.*=CPU' --tensor-split 25/55 -ngl 80

The results are:

	Token Latency
llama.cpp (with 2 RPC servers)	7450.22 ms/token
prima.cpp	405.68 ms/token

I have no idea what happened to llama.cpp to have such high latency.

Lizonghang Apr 20, 2025
Author

Update:

I remove the first device (Mac M1) and launch llama-cli on the same host server, now the devices are:

Devices:

Head device: Host machine A, no GPU.
RPC backend 1: Container 1 on host machine A, CPUs with 64GB RAM.
RPC backend 2: Container 2 on host machine A, a GPU with 24GB VRAM.

Still no resource limitation and run Llama 3 70B Q4K. Launch the same command on the head device, the results are:

	Token Latency
llama.cpp (with the head device on Mac M1)	7450.22 ms/token
llama.cpp (with the head device on the same host machine with rpc servers)	349.85 ms/token

It seems that there is frequent communication between the head device and rpc servers, that's a killer in a high-latency edge network like the home Wi-Fi network.

saood06 Apr 20, 2025

Thanks for running more tests, it looks like 349.85 ms/token beats the 405.68 ms/token you reported for prima.cpp, can you show the performance using machine A without RPC and just a simple --override-tensor so that we can see how much overhead is added by both solutions for this situation.

Lizonghang Apr 20, 2025
Author

It may not be a good idea to compare these two data because they run on different testbeds.

Lizonghang Apr 20, 2025
Author

Thanks for running more tests, it looks like 349.85 ms/token beats the 405.68 ms/token you reported for prima.cpp, can you show the performance using machine A without RPC and just a simple --override-tensor so that we can see how much overhead is added by both solutions for this situation.

I'm not sure whether I get your point, here I summarize the results:

Device (one host machine with 64GB RAM and 24GB VRAM, and it was splitted to 3 nodes, 1 head device and 2 rpc backends):

Head device: Host machine A, no GPU.
RPC backend 1: Container 1 on host machine A, CPUs with 64GB RAM.
RPC backend 2: Container 2 on host machine A, a GPU with 24GB VRAM.

Here we set 4 cases:

Case 1: Run llama.cpp on the host machine directly, without RPC backends, and use -ngl to fill the GPU:

bin/llama-cli -m ../../prima.cpp/download/Meta-Llama-3.1-70B-Instruct-Q4_K_S.gguf -c 1024 -n 20 -p "what is edge AI?" -no-cnv --override-tensor 'tok_embeddings.*=CPU' --override-tensor 'output.*=CPU' -ngl 52

Case 2: Run llama.cpp with 1 head device and 2 rpc backends (all run on the same machine), also use -ngl to fill the GPU:

bin/llama-cli -m ../download/Meta-Llama-3.1-70B-Instruct-Q4_K_S.gguf --prompt "what is edge AI?" -c 1024 -n 20 --rpc 192.168.31.155:50052 --rpc 192.168.31.155:50053 -no-cnv --override-tensor 'tok_embeddings.*=CPU' --override-tensor 'output.*=CPU' --tensor-split 25/55 -ngl 80

Case 3: Based on Case 2, but change the head device to a Mac M1, connected by Wi-Fi.
Case 4: Based on Case 3, but run prima.cpp instead of llama.cpp.

The results are:

	Token Latency
Case 1	287.05 ms/token
Case 2	349.85 ms/token
Case 3	7450.22 ms/token
Case 4	405.68 ms/token

Since prima.cpp is a distributed implementation of llama.cpp, when there is only 1 device (Case 1), distributed computing does not work, so prima.cpp will roll back to llama.cpp. In other words, prima.cpp has the same speed as llama.cpp in Case 1.

segmond · 2025-04-22T04:58:09Z

segmond
Apr 22, 2025

Don't worry about the concerns, it's a fork. Just keep experimenting and hopefully you can come up with good ideas that can make it back to the original mainline of llama.cpp I use RPC and like it, if you can contribute anything towards improving the performance that would be great! How are you reducing the latency over wifi? It was so bad for me, I had to hookup all the RPC machines to an ethernet 1gigabits switch.

1 reply

Lizonghang Apr 22, 2025
Author

The communication in prima.cpp is nothing special. It's less affected by Wi-Fi simply because it barely communicates, thanks to its pipeline-like architecture and minimal data transfer.

🚀 Meet prima.cpp — our distributed implementation of llama.cpp, built to scale across more everyday home devices. #12852

Replies: 6 comments · 28 replies

Lizonghang Apr 11, 2025 Author

Lizonghang Apr 11, 2025 Author

Lizonghang Apr 14, 2025 Author

Lizonghang Apr 15, 2025 Author

rgerganov Apr 16, 2025 Collaborator

Lizonghang Apr 19, 2025 Author

Lizonghang Apr 20, 2025 Author

Lizonghang Apr 20, 2025 Author

Lizonghang Apr 19, 2025 Author

Lizonghang Apr 20, 2025 Author

Lizonghang Apr 20, 2025 Author

Lizonghang Apr 20, 2025 Author

Lizonghang Apr 22, 2025 Author

Replies: 6 comments 28 replies

Lizonghang Apr 11, 2025
Author

Lizonghang Apr 11, 2025
Author

Lizonghang Apr 14, 2025
Author

Lizonghang Apr 15, 2025
Author

rgerganov
Apr 16, 2025
Collaborator

Lizonghang Apr 19, 2025
Author

Lizonghang Apr 20, 2025
Author

Lizonghang Apr 20, 2025
Author

Lizonghang Apr 19, 2025
Author

Lizonghang Apr 20, 2025
Author

Lizonghang Apr 20, 2025
Author

Lizonghang Apr 20, 2025
Author

Lizonghang Apr 22, 2025
Author