M-series Macs running llama.cpp in GPU-Accelerated Containers - New Benchmark Results #12985

AndreasKunar · 2025-04-16T20:56:40Z

AndreasKunar
Apr 16, 2025

Containers provide an important security-perimeter for running less-trusted software. However until now, this has not been quite feasible for Apple-Silicon Macs and llama.cpp. Mac GPUs couldn't really be used in containers. I re-visited the current status of Podman, which has macOS GPU-remoting - here are my benchmark-results and a how-to.

Technology: Podman enables Vulkan API-calls inside its containers to be accelerated via transferring these calls to the host's Apple M-series GPU.

TLDR: Benefits/Caveats GPU-Containers with this are faster than pure CPU containers in Docker, etc. - but the overhead vs. running llama.cpp directly is still significant. In my M2 Max based tests they are approx. 40% slower. But the GPU-container is still 3x faster than a pure CPU-container in prompt-processing, and 25% faster in token-generation. However, building these containers is a bit complicated - therefore below my how-to.

We need the following components:

Podman (in the Terminal) or the Podman Desktop App as container software, best installed via homebrew.
slp/krunkit as enabler, also best installed via homebrew.
Fedora 40 as operating-system, running in the container - with a patched krunkit mesa-driver (and this is only available for Fedora), details below.

You find more technical details here:

A very good technical introduction video is GPU Accelerated Containers on Apple Silicon with libkrun and podman machine - DevConf.US 2024.
This blog post covers the background.
The Podman documentation also covers installation on MacOS and use on MacOS in detail.

Installing Podman,...

We need Podman,... installed on MacOS, e.g. with:

brew install podman-desktop
export CONTAINERS_MACHINE_PROVIDER=libkrun
brew tap slp/krunkit
brew install krunkit
brew install podman

Then create the Podman machine as documented here.

You can test your Podman GPU-Container-ready installation with the command podman run --rm -it --device /dev/dri --name gpu-info quay.io/slopezpa/fedora-vgpu vulkaninfo | grep "GPU"
This should display the information about the GPU.

How to Build a GPU-accelerated Container Base Image

For our experiments we need a "Containerfile" which defines what OS and base-software should be installed. The Podman Containerfile (e.g. fedora-krunkit.containerfile):

# *** M-series Mac GPU-access from containers via Vukan (krunkit) base image ***
# currently requires Fedora 40 due to patched mesa-driver for GPU-remoting
ARG FEDORA_VERSION=40
FROM fedora:$FEDORA_VERSION
USER 0

# install patched mesa driver for GPU remoting and some tools
RUN dnf -y install dnf-plugins-core \
    && dnf -y install dnf-plugin-versionlock \
    && dnf -y install mesa-vulkan-drivers vulkan-loader-devel vulkan-headers vulkan-tools\
    && dnf -y copr enable slp/mesa-krunkit fedora-40-aarch64 \
    && dnf -y downgrade mesa-vulkan-drivers.aarch64 --repo=copr:copr.fedorainfracloud.org:slp:mesa-krunkit \
    && dnf versionlock mesa-vulkan-drivers \
    && dnf update -y \
    && dnf clean all \
    && rm -rf /var/cache/dnf

# Set working directory
WORKDIR /app

# Default command
CMD ["/bin/bash"]

Then we use Podman to create a container-image from this (best done in an empty directory):

podman build -t local/fedora-krunkit -f ./fedora-krunkit.Dockerfile .

You can test this image with the command podman run --rm -it --device /dev/dri --name gpu-info local/fedora-krunkit vulkaninfo | grep "GPU"
This should again display the information about the GPU.

Playing with llama.cpp and this Base Image

We can now use this image, to start a container and install the toolset, git-clone llama.cpp, build and run it. I suggest to store the large model-files ion the host and share this folder with your container.

podman run --rm -it --device /dev/dri --name llama-cpp-dev -v <your host's model folder>:/models local/fedora-krunkit /bin/bash starts the container and puts you into it's terminal. Please be careful, if you exit this, you discard all the changes you made. This is good for experimentation, because we can just start again from scratch, if things go wrong. For permanent use, we can build a llama.cpp server or cli container later with our experience.

Installing the toolset, llama.cpp and building the server, cli, and bench:

# Installing the build-tools
dnf -y install git @development-tools gcc-c++ \
  cmake wget libcurl-devel curl \
  vulkan-loader-devel vulkan-headers vulkan-tools \
  vulkan-loader glslc
  
# Getting and building llama.cpp
git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
cmake -B build -D GGML_OPENMP=OFF -D GGML_VULKAN=ON
cmake --build build --config Release -j \
  --target llama-cli llama-bench llama-server

# don't exit, except if you need to start from scratch

Playing with llama.cpp in that container - e.g. benchmarking - requires a model from the host. You can then run this e.g. with ./build/bin/llama-bench -p 512 -n 128 -ngl 99 -m <your model's .gguf-file>

Happy experimenting! My benchmark results are below.

llama.cpp Performance Results

All benchmarking was done on Apple Silicon Macs, with llama-bench and llama-2 Q4_0. This is in order to be comparable to the llama.cpp perfocmance-discussion results), but with the latest llama.cpp-build: b43d89e (5143).

M2 Max Podman Container Performance

On a M2 Max Mac Studio, 8 CPUs / 64GB RAM allocated to the Podman machine.

Container, llama.cpp CPU-only:

Built for the CPU-Backend, without special parameters.

model	size	params	backend	threads	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	8	pp512	135.06 ± 0.28
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	8	tg128	28.59 ± 0.03

Container with llama.cpp, built for Vulkan, with GPU-remoting:

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	pp512	428.47 ± 1.24
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	tg128	35.88 ± 0.03

Strangely -fa with the Vulkan-backend leads to much worse performance, especially in TG:

model	size	params	backend	ngl	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	pp512	312.92 ± 0.49
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	1	tg128	14.10 ± 0.06

Also, building with Vulkan and running with -ngl 0 strangely is faster than the CPU-only container and CPU-native:

model	size	params	backend	ngl	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	0	pp512	332.43 ± 0.98
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	0	tg128	25.74 ± 0.05

For Comparison a "Native" M2 Max:

With GPU (-ngl 99):

model	size	params	backend	threads	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Metal,BLAS	8	1	pp512	746.06 ± 0.96
llama 7B Q4_0	3.56 GiB	6.74 B	Metal,BLAS	8	1	tg128	71.36 ± 0.17

Without GPU (-ngl 0, it's similar to the container):

model	size	params	backend	threads	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	Metal,BLAS	8	1	pp512	139.60 ± 2.61
llama 7B Q4_0	3.56 GiB	6.74 B	Metal,BLAS	8	1	tg128	23.11 ± 0.29

I wrote an earlier, now a bit outdated medium.com article and discussion-item #8042 on this. But thought I should give it another try with the latest versions.
I hope this is useful information - comments and feedback very welcome!

0cc4m · 2025-04-17T07:46:08Z

0cc4m
Apr 17, 2025
Collaborator

Strangely -fa with the Vulkan-backend leads to much worse performance, especially in TG

That's because Flash Attention in Vulkan is currently only implemented for Nvidia GPUs with coopmat2, on any other device it falls back to CPU, which will slow you down a lot.

Also, building with Vulkan and running with -ngl 0 strangely is faster than the CPU-only container and CPU-native

That is because Vulkan still accelerates large matrix multiplications that are slow on CPU, by transferring the data to GPU and calculating it there.

Edit: Also great writeup, very interesting. I'm happy that Vulkan is useful on Mac, too, despite being slower than Metal.

0 replies

kth8 · 2025-04-20T07:48:02Z

kth8
Apr 20, 2025

I have a base model M1 Macbook Air I decided to test. This only has 8GB of RAM so it can only run a 1B model.

Bare metal CPU

$ llama-bench -p 512 \
    -n 128 \
    -ngl 0 \
    -m $HOME/.local/share/ramalama/models/huggingface/lmstudio-community/gemma-3-1B-it-qat-GGUF/gemma-3-1B-it-QAT-Q4_0.gguf
| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| gemma3 1B Q4_0                 | 680.82 MiB |   999.89 M | Metal,BLAS |       4 |         pp512 |        454.08 ± 2.68 |
| gemma3 1B Q4_0                 | 680.82 MiB |   999.89 M | Metal,BLAS |       4 |         tg128 |         67.05 ± 1.32 |

Bare Metal GPU

$ llama-bench -p 512 \
    -n 128 \
    -ngl 99 \
    -m $HOME/.local/share/ramalama/models/huggingface/lmstudio-community/gemma-3-1B-it-qat-GGUF/gemma-3-1B-it-QAT-Q4_0.gguf
| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| gemma3 1B Q4_0                 | 680.82 MiB |   999.89 M | Metal,BLAS |       4 |         pp512 |       1031.18 ± 3.76 |
| gemma3 1B Q4_0                 | 680.82 MiB |   999.89 M | Metal,BLAS |       4 |         tg128 |         56.99 ± 0.20 |

local/fedora-krunkit CPU

$ podman run \
    --rm \
    -it \
    --device /dev/dri \
    --mount=type=bind,src=$HOME/.local/share/ramalama/models/huggingface/lmstudio-community/gemma-3-1B-it-qat-GGUF/gemma-3-1B-it-QAT-Q4_0.gguf,destination=/mnt/models/model.file,ro \
    localhost/local/fedora-krunkit:latest \
    /app/llama.cpp/build/bin/llama-bench -p 512 -n 128 -ngl 0 -m /mnt/models/model.file
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Virtio-GPU Venus (Apple M1) (venus) | uma: 1 | fp16: 1 | warp size: 32 | shared memory: 32768 | int dot: 0 | matrix cores: none
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| gemma3 1B Q4_0                 | 680.82 MiB |   999.89 M | Vulkan     |   0 |         pp512 |        475.54 ± 5.50 |
| gemma3 1B Q4_0                 | 680.82 MiB |   999.89 M | Vulkan     |   0 |         tg128 |         65.09 ± 0.88 |

local/fedora-krunkit GPU

$ podman run \
    --rm \
    -it \
    --device /dev/dri \
    --mount=type=bind,src=$HOME/.local/share/ramalama/models/huggingface/lmstudio-community/gemma-3-1B-it-qat-GGUF/gemma-3-1B-it-QAT-Q4_0.gguf,destination=/mnt/models/model.file,ro \
    localhost/local/fedora-krunkit:latest \
    /app/llama.cpp/build/bin/llama-bench -p 512 -n 128 -ngl 99 -m /mnt/models/model.file
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Virtio-GPU Venus (Apple M1) (venus) | uma: 1 | fp16: 1 | warp size: 32 | shared memory: 32768 | int dot: 0 | matrix cores: none
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| gemma3 1B Q4_0                 | 680.82 MiB |   999.89 M | Vulkan     |  99 |         pp512 |        626.57 ± 0.51 |
| gemma3 1B Q4_0                 | 680.82 MiB |   999.89 M | Vulkan     |  99 |         tg128 |         23.61 ± 0.03 |

RamaLama provides a prebuilt image based on registry.access.redhat.com/ubi9/ubi:9.5-1744101466 with included Mesa Vulkan drivers but llama.cpp built with Kompute(?)

$ podman run --rm --device /dev/dri quay.io/ramalama/ramalama:0.7 llama-cli --list-devices          
Available devices:
  Kompute0: Virtio-GPU Venus (Apple M1) (8192 MiB, 8192 MiB free)

RamaLama CPU

$ podman run \
    --rm \
    -it \
    --device /dev/dri \
    --mount=type=bind,src=$HOME/.local/share/ramalama/models/huggingface/lmstudio-community/gemma-3-1B-it-qat-GGUF/gemma-3-1B-it-QAT-Q4_0.gguf,destination=/mnt/models/model.file,ro \
    quay.io/ramalama/ramalama:0.7 \
    llama-bench -p 512 -n 128 -ngl 0 -m /mnt/models/model.file
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| gemma3 1B Q4_0                 | 680.82 MiB |   999.89 M | Kompute    |   0 |         pp512 |         55.15 ± 0.08 |
| gemma3 1B Q4_0                 | 680.82 MiB |   999.89 M | Kompute    |   0 |         tg128 |         47.37 ± 1.41 |

RamaLama GPU

$ podman run \
    --rm \
    -it \
    --device /dev/dri \
    --mount=type=bind,src=$HOME/.local/share/ramalama/models/huggingface/lmstudio-community/gemma-3-1B-it-qat-GGUF/gemma-3-1B-it-QAT-Q4_0.gguf,destination=/mnt/models/model.file,ro \
    quay.io/ramalama/ramalama:0.7 \
    llama-bench -p 512 -n 128 -ngl 99 -m /mnt/models/model.file
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| gemma3 1B Q4_0                 | 680.82 MiB |   999.89 M | Kompute    |  99 |         pp512 |         61.16 ± 0.04 |
| gemma3 1B Q4_0                 | 680.82 MiB |   999.89 M | Kompute    |  99 |         tg128 |         20.05 ± 0.10 |

1 reply

AndreasKunar Apr 20, 2025
Author

Cool, thanks a lot!

The comparison with containers/RamaLama is very cool, I did not know them, and that they support Podman+krunkit ... On Macs with Arm support and Podman, the Podman machine must be configured to use the krunkit VM Type. This allows the Mac's GPU to be used within the VM. ....

Apparently the use of the Kompute llama.cpp-backend is a problem in your example, not sure if there is a Vulkan-RamaLama container with Fedora40+krunkit-mesa-driver. If there is, it should have the same, faster performance.

Also, to me it's totally strange, that gemma3 1B Q4_0 TG on the bare metal CPU (-ngl 0) is faster than when using the bare metal GPU (-ngl 99). Its similar on my M4 Pro (I seem to have a slightly different model - google/gemma-3-1b-it-qat-q4_0-gguf):

-ngl 0

model	size	params	backend	threads	fa	test	t/s
gemma3 1B Q4_0	950.82 MiB	999.89 M	Metal,BLAS	10	1	pp512	696.81 ± 30.71
gemma3 1B Q4_0	950.82 MiB	999.89 M	Metal,BLAS	10	1	tg128	161.45 ± 0.89

-ngl 99

model	size	params	backend	threads	fa	test	t/s
gemma3 1B Q4_0	950.82 MiB	999.89 M	Metal,BLAS	10	1	pp512	3563.86 ± 13.91
gemma3 1B Q4_0	950.82 MiB	999.89 M	Metal,BLAS	10	1	tg128	147.83 ± 0.50

kth8 · 2025-04-20T11:30:53Z

kth8
Apr 20, 2025

Yeah weird that running on CPU seems to have faster text generation. I noticed this before as well while using whisper.cpp. Looking at mactop while running both tests, on CPU the power draw is 12W during PP and 16W duing TG. On GPU it's 9W during PP and 5W during TG. The GPU only having a cut down 7 cores and 68.3 GB/s bandwidth probably doesn't help.

I think the reason gemma-3-1B-it-QAT-Q4_0.gguf from lmstudio-community and others is 720MB while the official QAT GGUF is 1GB is because Google use F16 instead of Q8_0 for the embeddings table. I also saw Docker Desktop recently added Docker Model Runner based on llama.cpp. I haven't tried it but from what I've read, that runs on bare metal using Metal instead of in container using Vulkan.

2 replies

AndreasKunar Apr 20, 2025
Author

I tried to run RamaLama on my M2 Max to get comparable numbers for RamaLama(Kompute) vs. my initial Fedora40/mesa-krunkit/Vulkan, but I get very strange "memoryErrorOutOfDeviceMemory" errors on my 60GB 8-CPU (!) Podman krunkit machine when it tries to offload all llama 2 7B layers to the GPU with Kompute (it worked with my self-built container and Vulkan).

Also gemma 3 1B is quite small and maybe an unusual benchmark, here FYI my two systems' bare-metal data with gemma 3 1b and old llama 2 7b. The M4 generation seems to consume a bit less power for a similar result, and its CPUs provide better performance, even with its slower RAM-bandwidth. Also, the GPU is not only usually faster, but also more energy-efficient per token than the CPU.

model	SoC	RAM	-ngl	test	token/s	CPU+GPU	W/token
llama 7B Q4_0	M2 Max	400GB/s	99	pp512	~752	~ 56W	~0.07
llama 7B Q4_0	M4 Pro	273GB/s	99	pp512	~491	~ 34W	~0.07
llama 7B Q4_0	M2 Max	400GB/s	0	pp512	~95	~ 23W	~0.24
llama 7B Q4_0	M4 Pro	273GB/s	0	pp512	~145	~ 22W	~0.15
gemma3 1B Q4_0	M2 Max	400GB/s	99	pp512	~5110	~ 33W	~0.01
gemma3 1B Q4_0	M4 Pro	273GB/s	99	pp512	~3570	~ 23W	~0.01
gemma3 1B Q4_0	M2 Max	400GB/s	0	pp512	~590	~ 31W	~0.05
gemma3 1B Q4_0	M4 Pro	273GB/s	0	pp512	~715	~ 34W	~0.04
llama 7B Q4_0	M2 Max	400GB/s	99	tg128	~72	~ 34W	~0.47
llama 7B Q4_0	M4 Pro	273GB/s	99	tg128	~55	~ 19W	~0.35
llama 7B Q4_0	M2 Max	400GB/s	0	tg128	~28	~ 40W	~1.42
llama 7B Q4_0	M4 Pro	273GB/s	0	tg128	~46	~ 34W	~0.74
gemma3 1B Q4_0	M2 Max	400GB/s	99	tg128	~169	~ 19W	~0.11
gemma3 1B Q4_0	M4 Pro	273GB/s	99	tg128	~150	~ 14W	~0.09
gemma3 1B Q4_0	M2 Max	400GB/s	0	tg128	~92	~ 28W	~0.30
gemma3 1B Q4_0	M4 Pro	273GB/s	0	tg128	~156	~ 36W	~0.23

kth8 Apr 20, 2025

Those are some interesting results. I have no idea why RamaLama decided to build with Kompute over Vulkan which I didn't even know was a thing before this. I would run the benchmark with llama 7B Q4_0 if I could but just having a few browser tabs open and and podman machine VM running is causing significant swap so on this machine I can only really use remote APIs.

kth8 · 2025-04-20T15:53:20Z

kth8
Apr 20, 2025

I was checking out the slp/mesa-krunkit COPR repo and noticed it also supports EPEL 9. Using that seems simpler since both mesa-vulkan-drivers from appstream and COPR is version 24.1.2 but COPR has higher release Version so it will be installed by default thus you don't need to downgrade and versionlock.

podman run --rm -it --device /dev/dri almalinux bash -c "dnf -y install dnf-plugins-core && \
    dnf -y copr enable slp/mesa-krunkit && \
    dnf -y install mesa-vulkan-drivers vulkan-tools && \
    vulkaninfo --summary"

1 reply

kth8 Apr 20, 2025

Here is a quick containerfile I wrote to build llama.cpp and install Mesa krunkit driver using AlmaLinux 9 .

FROM almalinux:9 AS builder
RUN dnf -y install vulkan-loader-devel glslc git gcc-c++ cmake libcurl-devel

RUN git clone --depth 1 https://github.com/ggml-org/llama.cpp.git && \
    cmake llama.cpp -B llama.cpp/build -DBUILD_SHARED_LIBS=OFF -D GGML_VULKAN=ON && \
    cmake --build llama.cpp/build --config Release -j --target llama-cli

FROM almalinux:9
RUN dnf -y install dnf-plugins-core && \
    dnf -y copr enable slp/mesa-krunkit && \
    dnf -y install mesa-vulkan-drivers && \
    dnf clean all && \
    rm -rf /var/cache/dnf
    
COPY --from=builder /llama.cpp/build/bin/llama-cli /usr/local/bin/  
CMD ["llama-cli", "--list-devices"]

kth8 · 2025-04-21T09:12:16Z

kth8
Apr 21, 2025

I remembered there is a fork of Ollama with Vulkan support. It's an old build so doesn't support new models like Gemma 3. I had to patch the gpu.go file and manually run /set parameter num_gpu 99 but in the end I was actually able to see 100% GPU usage on the host.

2025/04/21 09:01:51 routes.go:1186: INFO server config env="map[CUDA_VISIBLE_DEVICES: GGML_VK_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
time=2025-04-21T09:01:51.688Z level=INFO source=images.go:432 msg="total blobs: 10"
time=2025-04-21T09:01:51.688Z level=INFO source=images.go:439 msg="total unused blobs removed: 0"
[GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.

[GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.
 - using env:	export GIN_MODE=release
 - using code:	gin.SetMode(gin.ReleaseMode)

[GIN-debug] POST   /api/pull                 --> github.com/ollama/ollama/server.(*Server).PullHandler-fm (5 handlers)
[GIN-debug] POST   /api/generate             --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (5 handlers)
[GIN-debug] POST   /api/chat                 --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (5 handlers)
[GIN-debug] POST   /api/embed                --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (5 handlers)
[GIN-debug] POST   /api/embeddings           --> github.com/ollama/ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers)
[GIN-debug] POST   /api/create               --> github.com/ollama/ollama/server.(*Server).CreateHandler-fm (5 handlers)
[GIN-debug] POST   /api/push                 --> github.com/ollama/ollama/server.(*Server).PushHandler-fm (5 handlers)
[GIN-debug] POST   /api/copy                 --> github.com/ollama/ollama/server.(*Server).CopyHandler-fm (5 handlers)
[GIN-debug] DELETE /api/delete               --> github.com/ollama/ollama/server.(*Server).DeleteHandler-fm (5 handlers)
[GIN-debug] POST   /api/show                 --> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (5 handlers)
[GIN-debug] POST   /api/blobs/:digest        --> github.com/ollama/ollama/server.(*Server).CreateBlobHandler-fm (5 handlers)
[GIN-debug] HEAD   /api/blobs/:digest        --> github.com/ollama/ollama/server.(*Server).HeadBlobHandler-fm (5 handlers)
[GIN-debug] GET    /api/ps                   --> github.com/ollama/ollama/server.(*Server).PsHandler-fm (5 handlers)
[GIN-debug] POST   /v1/chat/completions      --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers)
[GIN-debug] POST   /v1/completions           --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (6 handlers)
[GIN-debug] POST   /v1/embeddings            --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (6 handlers)
[GIN-debug] GET    /v1/models                --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (6 handlers)
[GIN-debug] GET    /v1/models/:model         --> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (6 handlers)
[GIN-debug] GET    /                         --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] GET    /api/tags                 --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (5 handlers)
[GIN-debug] GET    /api/version              --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
[GIN-debug] HEAD   /                         --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
[GIN-debug] HEAD   /api/tags                 --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (5 handlers)
[GIN-debug] HEAD   /api/version              --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
time=2025-04-21T09:01:51.689Z level=INFO source=routes.go:1237 msg="Listening on 127.0.0.1:11434 (version 0.0.0)"
time=2025-04-21T09:01:51.689Z level=INFO source=gpu.go:251 msg="looking for compatible GPUs"
time=2025-04-21T09:01:51.712Z level=INFO source=gpu.go:454 msg="no compatible GPUs were discovered"
time=2025-04-21T09:01:51.712Z level=INFO source=types.go:137 msg="inference compute" id=0 library=cpu variant="" compute="" driver=0.0 name="" total="7.7 GiB" available="6.7 GiB"
time=2025-04-21T09:01:57.528Z level=INFO source=server.go:101 msg="system memory" total="7.7 GiB" free="6.7 GiB" free_swap="0 B"
time=2025-04-21T09:01:57.528Z level=INFO source=memory.go:356 msg="offload to cpu" layers.requested=99 layers.model=17 layers.offload=0 layers.split="" memory.available="[6.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="2.1 GiB" memory.required.partial="0 B" memory.required.kv="256.0 MiB" memory.required.allocations="[2.1 GiB]" memory.weights.total="1.2 GiB" memory.weights.repeating="976.1 MiB" memory.weights.nonrepeating="266.2 MiB" memory.graph.full="544.0 MiB" memory.graph.partial="554.3 MiB"
time=2025-04-21T09:01:57.531Z level=INFO source=server.go:387 msg="starting llama server" cmd="/ollama-vulkan/ollama runner --model /root/.ollama/models/blobs/sha256-74701a8c35f6c8d9a4b91f3f3497643001d63e0c7a84e085bed452548fa88d45 --ctx-size 8192 --batch-size 512 --n-gpu-layers 99 --threads 4 --no-mmap --parallel 4 --port 39551"
time=2025-04-21T09:01:57.532Z level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2025-04-21T09:01:57.532Z level=INFO source=server.go:563 msg="waiting for llama runner to start responding"
time=2025-04-21T09:01:57.533Z level=INFO source=server.go:597 msg="waiting for server to become available" status="llm server error"
time=2025-04-21T09:01:57.554Z level=INFO source=runner.go:936 msg="starting go runner"
time=2025-04-21T09:01:57.554Z level=INFO source=runner.go:937 msg=system info="CPU : NEON = 1 | ARM_FMA = 1 | LLAMAFILE = 1 | CPU : NEON = 1 | ARM_FMA = 1 | LLAMAFILE = 1 | cgo(gcc)" threads=4
time=2025-04-21T09:01:57.554Z level=INFO source=runner.go:995 msg="Server listening on 127.0.0.1:39551"
load_backend: loaded Vulkan backend from /ollama-vulkan/build/lib/ollama/libggml-vulkan.so
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Virtio-GPU Venus (Apple M1) (venus) | uma: 1 | fp16: 1 | warp size: 32 | matrix cores: none
llama_load_model_from_file: using device Vulkan0 (Virtio-GPU Venus (Apple M1)) - 8192 MiB free
llama_model_loader: loaded meta data with 30 key-value pairs and 147 tensors from /root/.ollama/models/blobs/sha256-74701a8c35f6c8d9a4b91f3f3497643001d63e0c7a84e085bed452548fa88d45 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 1B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Llama-3.2
llama_model_loader: - kv   5:                         general.size_label str              = 1B
llama_model_loader: - kv   6:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   7:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   8:                          llama.block_count u32              = 16
llama_model_loader: - kv   9:                       llama.context_length u32              = 131072
llama_model_loader: - kv  10:                     llama.embedding_length u32              = 2048
llama_model_loader: - kv  11:                  llama.feed_forward_length u32              = 8192
llama_model_loader: - kv  12:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  13:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  14:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  15:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  16:                 llama.attention.key_length u32              = 64
llama_model_loader: - kv  17:               llama.attention.value_length u32              = 64
llama_model_loader: - kv  18:                          general.file_type u32              = 7
llama_model_loader: - kv  19:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  20:                 llama.rope.dimension_count u32              = 64
llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  25:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  29:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   34 tensors
llama_model_loader: - type q8_0:  113 tensors
time=2025-04-21T09:01:57.785Z level=INFO source=server.go:597 msg="waiting for server to become available" status="llm server loading model"
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 2048
llm_load_print_meta: n_layer          = 16
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 64
llm_load_print_meta: n_embd_head_v    = 64
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 512
llm_load_print_meta: n_embd_v_gqa     = 512
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 8192
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 1B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 1.24 B
llm_load_print_meta: model size       = 1.22 GiB (8.50 BPW)
llm_load_print_meta: general.name     = Llama 3.2 1B Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token        = 128008 '<|eom_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOG token        = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
ggml_vulkan: Compiling shaders.....................................Done!
llm_load_tensors: offloading 16 repeating layers to GPU
llm_load_tensors: offloading output layer to GPU
llm_load_tensors: offloaded 17/17 layers to GPU
llm_load_tensors:  Vulkan_Host model buffer size =   266.16 MiB
llm_load_tensors:      Vulkan0 model buffer size =  1252.41 MiB
llama_new_context_with_model: n_seq_max     = 4
llama_new_context_with_model: n_ctx         = 8192
llama_new_context_with_model: n_ctx_per_seq = 2048
llama_new_context_with_model: n_batch       = 2048
llama_new_context_with_model: n_ubatch      = 512
llama_new_context_with_model: flash_attn    = 0
llama_new_context_with_model: freq_base     = 500000.0
llama_new_context_with_model: freq_scale    = 1
llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 8192, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 16, can_shift = 1
llama_kv_cache_init:    Vulkan0 KV buffer size =   256.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model: Vulkan_Host  output buffer size =     1.99 MiB
llama_new_context_with_model:    Vulkan0 compute buffer size =   544.00 MiB
llama_new_context_with_model: Vulkan_Host compute buffer size =    20.01 MiB
llama_new_context_with_model: graph nodes  = 518
llama_new_context_with_model: graph splits = 2
time=2025-04-21T09:02:01.597Z level=INFO source=server.go:602 msg="llama runner started in 4.06 seconds"
llama_model_loader: loaded meta data with 30 key-value pairs and 147 tensors from /root/.ollama/models/blobs/sha256-74701a8c35f6c8d9a4b91f3f3497643001d63e0c7a84e085bed452548fa88d45 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 1B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Llama-3.2
llama_model_loader: - kv   5:                         general.size_label str              = 1B
llama_model_loader: - kv   6:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   7:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   8:                          llama.block_count u32              = 16
llama_model_loader: - kv   9:                       llama.context_length u32              = 131072
llama_model_loader: - kv  10:                     llama.embedding_length u32              = 2048
llama_model_loader: - kv  11:                  llama.feed_forward_length u32              = 8192
llama_model_loader: - kv  12:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  13:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  14:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  15:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  16:                 llama.attention.key_length u32              = 64
llama_model_loader: - kv  17:               llama.attention.value_length u32              = 64
llama_model_loader: - kv  18:                          general.file_type u32              = 7
llama_model_loader: - kv  19:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  20:                 llama.rope.dimension_count u32              = 64
llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  25:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  29:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   34 tensors
llama_model_loader: - type q8_0:  113 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 1
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = all F32
llm_load_print_meta: model params     = 1.24 B
llm_load_print_meta: model size       = 1.22 GiB (8.50 BPW)
llm_load_print_meta: general.name     = Llama 3.2 1B Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token        = 128008 '<|eom_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOG token        = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llama_model_load: vocab only - skipping tensors
[GIN] 2025/04/21 - 09:02:38 | 200 | 41.041632604s |       127.0.0.1 | POST     "/api/chat"

0 replies

zachforsyth · 2025-04-21T19:16:23Z

zachforsyth
Apr 21, 2025

It would be really interesting to get some performance comparisons with the same models running under Docker Model Runner
https://www.docker.com/blog/introducing-docker-model-runner/

3 replies

kth8 Apr 22, 2025

I assume Docker Model Runner will have the same performance as running llama.cpp on bare metal since that's what it does.

AndreasKunar Apr 22, 2025
Author

Thanks for pointing this out, I did not know Docker Model Runner - its not virtualizing the local GPU(s) or Vulkan, instead just providing a closed-source, open-AI compatible, llama.cpp based LLM-API to Docker-containers, and a convenient access to a curated model-store. It will likely have native llama.cpp performance - but with Docker's licensing, no enhanced security. And in my opinion not any more convenience than e.g. just brew-installing and running local ollama - which is open-source , not bound to Docker, and has a more permissive licensing. Or you could just run a plain llama.cpp llama-server with maybe a tiny bit less convenience, but even more flexibility.

Docker Model Runner seems just focused on simplification/convenience. A bit like ollama or other company efforts like e.g. NVIDIA's AI Playground or Microsoft's AI Toolkit for Visual Studio Code. It just integrates some kind of llama.cpp llama-server derivative into the host docker-desktop app, together with a kind of curated model-store, from which you can pull your models.
My opinion:

It does not seem to have any additional security-isolation aspects - which to me would be a key element of Docker or containers. Besides that, it limits you to a pre-built part of closed-source docker-desktop and a curated model-store - running without a container security-boundary. To my knowledge the default brew-installed macOS docker-desktop does not even run rootless without additional configuration (not convenient!).
You seem limited to their model-store, and to their integrated API-server version. No chance to try out new llama.cpp developments or models, which you can in the Podman+krunkit containers. Also you cannot just use the llama.cpp API, like you can do in a container - while running your app in a container security-boundary.

P.S.: What I forgot to highlight on Podman+krunkit is, that it runs rootless. I experiment in parallel with NVIDIA-GPU container access. E.g. on NVIDIA Jetson's Jetpack, which has Docker pre-installed, GPU-access from docker-containers apparently cannot run in a rootless Docker. At least I was not able to find anybody on the internet who managed this successfully - I killed my Jetson installation two times while trying, and had to completely wipe+reinstall it.

AndreasKunar Apr 29, 2025
Author

Just a clarification on Docker Model Runner (DMR). I had a conversation with someone from Docker and he explained their reasoning. I think I understand their motivation now better and see the following advantages of DMR, which look promising:

Docker (and therefore DMR) being likely already a corporate-approved software (security-concerns) vs. getting ollama/llama.cpp through internal corporate software approval processes - with my past jobs' experience, I understand this very well.
Docker tooling integration (compose,...)
DMR enabling models as OCI artifacts

P.S. he also explained, that Vulkan-redirect out of Docker is also technically possible, albeit with the known technical heavy-lifting and performance-penalties. Apparently this is why they went the DMR path, and I now think, that I understand the why much better.

kth8 · 2025-04-23T09:07:39Z

kth8
Apr 23, 2025

I randomly found this script which references a quay.io/ramalama/vulkan:latest image but when I checked the Quay repo there is no Vulkan image so I'm really confused why they would ditch that for Kompose.

1 reply

AndreasKunar Apr 29, 2025
Author

I now got more information on Docker Model Runner (DMR) and it seems a much more interesting concept than I thought - Virtualizing an openAI API with OCI models instead of Vulkan and making this part of corporate-approved Docker.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

M-series Macs running llama.cpp in GPU-Accelerated Containers - New Benchmark Results #12985

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 7 comments 8 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

M-series Macs running llama.cpp in GPU-Accelerated Containers - New Benchmark Results #12985

Installing Podman,...

How to Build a GPU-accelerated Container Base Image

Playing with llama.cpp and this Base Image

llama.cpp Performance Results

M2 Max Podman Container Performance

Container, llama.cpp CPU-only:

Container with llama.cpp, built for Vulkan, with GPU-remoting:

For Comparison a "Native" M2 Max:

Replies: 7 comments · 8 replies

0cc4m Apr 17, 2025 Collaborator

Bare metal CPU

Bare Metal GPU

local/fedora-krunkit CPU

local/fedora-krunkit GPU

RamaLama CPU

RamaLama GPU

AndreasKunar Apr 20, 2025 Author

AndreasKunar Apr 20, 2025 Author

AndreasKunar Apr 22, 2025 Author

AndreasKunar Apr 29, 2025 Author

AndreasKunar Apr 29, 2025 Author

Replies: 7 comments 8 replies

0cc4m
Apr 17, 2025
Collaborator

AndreasKunar Apr 20, 2025
Author

AndreasKunar Apr 20, 2025
Author

AndreasKunar Apr 22, 2025
Author

AndreasKunar Apr 29, 2025
Author

AndreasKunar Apr 29, 2025
Author