M-series Macs running llama.cpp in GPU-Accelerated Containers - New Benchmark Results #12985
Replies: 7 comments 8 replies
-
That's because Flash Attention in Vulkan is currently only implemented for Nvidia GPUs with coopmat2, on any other device it falls back to CPU, which will slow you down a lot.
That is because Vulkan still accelerates large matrix multiplications that are slow on CPU, by transferring the data to GPU and calculating it there. Edit: Also great writeup, very interesting. I'm happy that Vulkan is useful on Mac, too, despite being slower than Metal. |
Beta Was this translation helpful? Give feedback.
-
I have a base model M1 Macbook Air I decided to test. This only has 8GB of RAM so it can only run a 1B model. Bare metal CPU
Bare Metal GPU
local/fedora-krunkit CPU
local/fedora-krunkit GPU
RamaLama provides a prebuilt image based on
RamaLama CPU
RamaLama GPU
|
Beta Was this translation helpful? Give feedback.
-
Yeah weird that running on CPU seems to have faster text generation. I noticed this before as well while using whisper.cpp. Looking at I think the reason |
Beta Was this translation helpful? Give feedback.
-
I was checking out the
|
Beta Was this translation helpful? Give feedback.
-
I remembered there is a fork of Ollama with Vulkan support. It's an old build so doesn't support new models like Gemma 3. I had to patch the
|
Beta Was this translation helpful? Give feedback.
-
It would be really interesting to get some performance comparisons with the same models running under Docker Model Runner |
Beta Was this translation helpful? Give feedback.
-
I randomly found this script which references a |
Beta Was this translation helpful? Give feedback.
-
Containers provide an important security-perimeter for running less-trusted software. However until now, this has not been quite feasible for Apple-Silicon Macs and llama.cpp. Mac GPUs couldn't really be used in containers. I re-visited the current status of Podman, which has macOS GPU-remoting - here are my benchmark-results and a how-to.
Technology: Podman enables Vulkan API-calls inside its containers to be accelerated via transferring these calls to the host's Apple M-series GPU.
TLDR: Benefits/Caveats GPU-Containers with this are faster than pure CPU containers in Docker, etc. - but the overhead vs. running llama.cpp directly is still significant. In my M2 Max based tests they are approx. 40% slower. But the GPU-container is still 3x faster than a pure CPU-container in prompt-processing, and 25% faster in token-generation. However, building these containers is a bit complicated - therefore below my how-to.
We need the following components:
You find more technical details here:
Installing Podman,...
We need Podman,... installed on MacOS, e.g. with:
Then create the Podman machine as documented here.
You can test your Podman GPU-Container-ready installation with the command
podman run --rm -it --device /dev/dri --name gpu-info quay.io/slopezpa/fedora-vgpu vulkaninfo | grep "GPU"
This should display the information about the GPU.
How to Build a GPU-accelerated Container Base Image
For our experiments we need a "Containerfile" which defines what OS and base-software should be installed. The Podman Containerfile (e.g.
fedora-krunkit.containerfile
):Then we use Podman to create a container-image from this (best done in an empty directory):
You can test this image with the command
podman run --rm -it --device /dev/dri --name gpu-info local/fedora-krunkit vulkaninfo | grep "GPU"
This should again display the information about the GPU.
Playing with llama.cpp and this Base Image
We can now use this image, to start a container and install the toolset, git-clone llama.cpp, build and run it. I suggest to store the large model-files ion the host and share this folder with your container.
podman run --rm -it --device /dev/dri --name llama-cpp-dev -v <your host's model folder>:/models local/fedora-krunkit /bin/bash
starts the container and puts you into it's terminal. Please be careful, if you exit this, you discard all the changes you made. This is good for experimentation, because we can just start again from scratch, if things go wrong. For permanent use, we can build a llama.cpp server or cli container later with our experience.Installing the toolset, llama.cpp and building the server, cli, and bench:
Playing with llama.cpp in that container - e.g. benchmarking - requires a model from the host. You can then run this e.g. with
./build/bin/llama-bench -p 512 -n 128 -ngl 99 -m <your model's .gguf-file>
Happy experimenting! My benchmark results are below.
llama.cpp Performance Results
All benchmarking was done on Apple Silicon Macs, with llama-bench and llama-2 Q4_0. This is in order to be comparable to the llama.cpp perfocmance-discussion results), but with the latest llama.cpp-build: b43d89e (5143).
M2 Max Podman Container Performance
On a M2 Max Mac Studio, 8 CPUs / 64GB RAM allocated to the Podman machine.
Container, llama.cpp CPU-only:
Built for the CPU-Backend, without special parameters.
Container with llama.cpp, built for Vulkan, with GPU-remoting:
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Virtio-GPU Venus (Apple M2 Max) (venus) | uma: 1 | fp16: 1 | warp size: 32 | shared memory: 32768 | int dot: 0 | matrix cores: none
Strangely -fa with the Vulkan-backend leads to much worse performance, especially in TG:
Also, building with Vulkan and running with -ngl 0 strangely is faster than the CPU-only container and CPU-native:
For Comparison a "Native" M2 Max:
With GPU (-ngl 99):
Without GPU (-ngl 0, it's similar to the container):
I wrote an earlier, now a bit outdated medium.com article and discussion-item #8042 on this. But thought I should give it another try with the latest versions.
I hope this is useful information - comments and feedback very welcome!
Beta Was this translation helpful? Give feedback.
All reactions