Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: ggml-org/llama.cpp
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: cbbd1efa06f8c09f9dff58ff9d9af509cc4c152b
Choose a base ref
...
head repository: ggml-org/llama.cpp
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: cb49e0f8c906e5da49e9f6d64a57742a9a241c6a
Choose a head ref
  • 5 commits
  • 14 files changed
  • 4 contributors

Commits on Feb 27, 2024

  1. llama : fix defrag bugs + add parameter (#5735)

    * llama : fix defrag bugs + enable by default
    
    ggml-ci
    
    * llama : add defrag_thold parameter
    
    ggml-ci
    
    * llama : cont
    
    * llama : disable log message
    
    ggml-ci
    
    * llama : fix graph size check during defrag
    ggerganov authored Feb 27, 2024

    Verified

    This commit was created on GitHub.com and signed with GitHub’s verified signature.
    Copy the full SHA
    9d533a7 View commit details
  2. ggml-quants : fix avx2 iq1_s vec_dot when compiled with gcc (#5742)

    Engininja2 authored Feb 27, 2024

    Verified

    This commit was created on GitHub.com and signed with GitHub’s verified signature.
    Copy the full SHA
    1f30b7a View commit details
  3. cuda : replace remaining shfl_xor with calls to warp_reduce functions (

    Engininja2 authored Feb 27, 2024

    Verified

    This commit was created on GitHub.com and signed with GitHub’s verified signature.
    Copy the full SHA
    c24a2a6 View commit details
  4. IQ4_XS: a 4.25 bpw quantization (#5747)

    * Try IQ4_NL with blocks of 64 - does not look good
    
    * iq4_xs: go to super-blocks of 256 and 6-bit scales for blocks of 32
    
    * iq4_xs: CUDA works - 133.2 t/s
    
    * iq4_xs: AVX2 dot product
    
    * iq4_xs: ARM_NEON dot product
    
    * iq4_nl: Metal implementation
    
    As usual, Metal / Apple Silicon don't like my quants.
    
    * iq3_xs: minor fix
    
    * iq4_xs: shrink by using IQ3_S for attn_k and attn_q
    
    * iq4_xs: revert using IQ3_S for attn_k and attn_v
    
    PPL vs size is good, but CPU performance suffers: on M2 Max
    TG-128 drops to 21.7 t/s from 28.8, and on a Ryzen-7950X
    to 14.5 t/s from 15.8 t/s. On CUDA we have 135 t/s when
    using IQ3_S vs 133 t/s with pure IQ4_XS.
    
    * Fix CI
    
    * iq4_xs: Added forgotten check for 256 divisibility
    
    ---------
    
    Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
    ikawrakow and Kawrakow authored Feb 27, 2024

    Verified

    This commit was created on GitHub.com and signed with GitHub’s verified signature.
    Copy the full SHA
    0becb22 View commit details
  5. Attempt to fix android build (#5752)

    Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
    ikawrakow and Kawrakow authored Feb 27, 2024

    Verified

    This commit was created on GitHub.com and signed with GitHub’s verified signature.
    Copy the full SHA
    cb49e0f View commit details
Loading