π Meet prima.cpp β our distributed implementation of llama.cpp, built to scale across more everyday home devices. #12852
Replies: 6 comments 28 replies
-
wow Do you have any plans for PR? |
Beta Was this translation helpful? Give feedback.
-
Does this mean I could have a farm of network connected GPU servers and have Lllama-cpp make use of them? |
Beta Was this translation helpful? Give feedback.
-
How much does network speed affect it? |
Beta Was this translation helpful? Give feedback.
-
Thanks for sharing this work. I'd be very interested to see a performance comparison between prima.cpp and llama.cpp with multiple RPC servers. |
Beta Was this translation helpful? Give feedback.
-
Yes, maybe you can split the RPC server into two, one with the GPU backend compiled and one without so it will be CPU. Then you set
I feel like this is a strange situation to compare then, because as you point out it doesn't make sense to use it as an RPC server and you may as well directly use llama.cpp on the large machine instead. |
Beta Was this translation helpful? Give feedback.
-
Don't worry about the concerns, it's a fork. Just keep experimenting and hopefully you can come up with good ideas that can make it back to the original mainline of llama.cpp I use RPC and like it, if you can contribute anything towards improving the performance that would be great! How are you reducing the latency over wifi? It was so bad for me, I had to hookup all the RPC machines to an ethernet 1gigabits switch. |
Beta Was this translation helpful? Give feedback.
-
Want to run larger LLMs with llama.cpp but hit hardware limits? Have multiple devices lying around but not sure how to use them for collaborative inference? If yes, try prima.cpp!
Prima.cpp is a distributed implementation of llama.cpp. It lets you use multiple everyday home devices to run larger models, even 70B! It inherits great features from llama.cpp like mmap to avoid OOM, and adds more features like piped-ring parallelism, prefetching, and automatic workload distribution to make distributed inference faster.
Give it a try and unlock the full power of your devices! π₯οΈ π» π±
Beta Was this translation helpful? Give feedback.
All reactions