Skip to content

Release v1.21.1

Compare
Choose a tag to compare
@github-actions github-actions released this 01 Aug 22:58
· 53 commits to main since this release

New feature: vRAM estimator!

  • Calculate vRAM usage for a given model configuration
  • Determine maximum context length for a given vRAM constraint
  • Find the best quantisation setting for a given vRAM and context constraint
  • Support for different k/v cache quantisation options (fp16, q8_0, q4_0)

To estimate VRAM usage:

gollama --vram --model NousResearch/Hermes-2-Theta-Llama-3-8B --quant q4_k_m --context 2048 --kvcache q4_0 # For GGUF models
gollama --vram --model NousResearch/Hermes-2-Theta-Llama-3-8B --quant 5.0 --context 2048 --kvcache q4_0 # For exl2 models
# Estimated VRAM usage: 5.35 GB

To calculate maximum context for a given memory constraint:

gollama --vram --model NousResearch/Hermes-2-Theta-Llama-3-8B --quant q4_k_m --memory 6 --kvcache q8_0 # For GGUF models
gollama --vram --model NousResearch/Hermes-2-Theta-Llama-3-8B --bpw 5.0 --memory 6 --kvcache q8_0 # For exl2 models
# Maximum context for 6.00 GB of memory: 5069

To find the best BPW:

gollama --vram --model NousResearch/Hermes-2-Theta-Llama-3-8B --memory 6 --quanttype gguf
# Best BPW for 6.00 GB of memory: IQ3_S

The vRAM estimator works by:

  1. Fetching the model configuration from Hugging Face (if not cached locally)
  2. Calculating the memory requirements for model parameters, activations, and KV cache
  3. Adjusting calculations based on the specified quantisation settings
  4. Performing binary and linear searches to optimize for context length or quantisation settings

1.21.1 (2024-08-01)

What's Changed

Full Changelog: v1.20.4...v1.21.1