Release v1.21.1
New feature: vRAM estimator!
- Calculate vRAM usage for a given model configuration
- Determine maximum context length for a given vRAM constraint
- Find the best quantisation setting for a given vRAM and context constraint
- Support for different k/v cache quantisation options (fp16, q8_0, q4_0)
To estimate VRAM usage:
gollama --vram --model NousResearch/Hermes-2-Theta-Llama-3-8B --quant q4_k_m --context 2048 --kvcache q4_0 # For GGUF models
gollama --vram --model NousResearch/Hermes-2-Theta-Llama-3-8B --quant 5.0 --context 2048 --kvcache q4_0 # For exl2 models
# Estimated VRAM usage: 5.35 GB
To calculate maximum context for a given memory constraint:
gollama --vram --model NousResearch/Hermes-2-Theta-Llama-3-8B --quant q4_k_m --memory 6 --kvcache q8_0 # For GGUF models
gollama --vram --model NousResearch/Hermes-2-Theta-Llama-3-8B --bpw 5.0 --memory 6 --kvcache q8_0 # For exl2 models
# Maximum context for 6.00 GB of memory: 5069
To find the best BPW:
gollama --vram --model NousResearch/Hermes-2-Theta-Llama-3-8B --memory 6 --quanttype gguf
# Best BPW for 6.00 GB of memory: IQ3_S
The vRAM estimator works by:
- Fetching the model configuration from Hugging Face (if not cached locally)
- Calculating the memory requirements for model parameters, activations, and KV cache
- Adjusting calculations based on the specified quantisation settings
- Performing binary and linear searches to optimize for context length or quantisation settings
1.21.1 (2024-08-01)
What's Changed
- chore(renovate): patch Update patch (patch) by @renovate in #77
- feat: vram estimator by @sammcj in #86
Full Changelog: v1.20.4...v1.21.1