Skip to content

Releases: ml-explore/mlx

v0.22.0

09 Jan 22:33
1ce0c0f
Compare
Choose a tag to compare

Highlights

  • Export and import MLX functions to a file (example, bigger example)
    • Functions can be exported from Python and run in C++ and vice versa

Core

  • Add slice and slice_update which take arrays for starting locations
  • Add an example for using MLX in C++ with CMake
  • Fused attention for generation now supports boolean masking (benchmark)
  • Allow array offset for mx.fast.rope
  • Add mx.finfo
  • Allow negative strides without resorting to copying for slice and as_strided
  • Add Flatten, Unflatten and ExpandDims primitives
  • Enable the compilation of lambdas in C++
  • Add a lot more primitives for shapeless compilation (full list)
  • Fix performance regression in qvm
  • Introduce separate types for Shape and Strides and switch to int64 strides from uint64
  • Reduced copies for fused-attention kernel
  • Recompile a function when the stream changes
  • Several steps to improve the linux / x86_64 experience (#1625, #1627, #1635)
  • Several steps to improve/enable the windows experience (#1628, #1660, #1662, #1661, #1672, #1663, #1664, ...)
  • Update to newer Metal-cpp
  • Throw when exceeding the maximum number of buffers possible
  • Add mx.kron
  • mx.distributed.send now implements the identity function instead of returning an empty array
  • Better errors reporting for mx.compile on CPU and for unrecoverable errors

NN

  • Add optional bias correction in Adam/AdamW
  • Enable mixed quantization by nn.quantize
  • Remove reshapes from nn.QuantizedEmbedding

Bug fixes

  • Fix qmv/qvm bug for batch size 2-5
  • Fix some leaks and races (#1629)
  • Fix transformer postnorm in mlx.nn
  • Fix some mx.fast fallbacks
  • Fix the hashing for string constants in compile
  • Fix small sort in Metal
  • Fix memory leak of non-evaled arrays with siblings
  • Fix concatenate/slice_update vjp in edge-case where the inputs have different type

v0.21.1

06 Dec 21:17
50fa705
Compare
Choose a tag to compare

πŸš€ πŸš€

v0.21.0

22 Nov 20:18
bb303c4
Compare
Choose a tag to compare

Highlights

  • Support 3 and 6 bit quantization: benchmarks
  • Much faster memory efficient attention for headdim 64, 80: benchmarks
  • Much faster sdpa inference kernel for longer sequences: benchmarks

Core

  • contiguous op (C++ only) + primitive
  • Bfs width limit to reduce memory consumption during eval
  • Fast CPU quantization
  • Faster indexing math in several kernels:
    • unary, binary, ternary, copy, compiled, reduce
  • Improve dispatch threads for a few kernels:
    • conv, gemm splitk, custom kernels
  • More buffer donation with no-ops to reduce memory use
  • Use CMAKE_OSX_DEPLOYMENT_TARGET to pick Metal version
  • Dispatch Metal bf16 type at runtime when using the JIT

NN

  • nn.AvgPool3d and nn.MaxPool3d
  • Support groups in nn.Conv2d

Bug fixes

  • Fix per-example mask + docs in sdpa
  • Fix FFT synchronization bug (use dispatch method everywhere)
  • Throw for invalid *fft{2,n} cases
  • Fix OOB access in qmv
  • Fix donation in sdpa to reduce memory use
  • Allocate safetensors header on the heap to avoid stack overflow
  • Fix sibling memory leak
  • Fix view segfault for scalars input
  • Fix concatenate vmap

v0.20.0

05 Nov 21:23
726dbd9
Compare
Choose a tag to compare

Highlights

  • Even faster GEMMs
  • BFS graph optimizations
    • Over 120tks with Mistral 7B!
  • Fast batched QMV/QVM for KV quantized attention benchmarks

Core

  • New Features
    • mx.linalg.eigh and mx.linalg.eigvalsh
    • mx.nn.init.sparse
    • 64bit type support for mx.cumprod, mx.cumsum
  • Performance
    • Faster long column reductions
    • Wired buffer support for large models
    • Better Winograd dispatch condition for convs
    • Faster scatter/gather
    • Faster mx.random.uniform and mx.random.bernoulli
    • Better threadgroup sizes for large arrays
  • Misc
    • Added Python 3.13 to CI
    • C++20 compatibility

Bugfixes

  • Fix command encoder synchronization
  • Fix mx.vmap with gather and constant outputs
  • Fix fused sdpa with differing key and value strides
  • Support mx.array.__format__ with spec
  • Fix multi output array leak
  • Fix RMSNorm weight mismatch error

v0.19.3

31 Oct 23:11
eac961d
Compare
Choose a tag to compare

πŸš€

v0.19.2

31 Oct 02:54
cde5b4a
Compare
Choose a tag to compare

πŸš€πŸš€

v0.19.1

25 Oct 20:18
35e9c87
Compare
Choose a tag to compare

πŸš€

v0.19.0

18 Oct 19:35
58a8556
Compare
Choose a tag to compare

Highlights

  • Speed improvements
    • Up to 6x faster CPU indexing benchmarks
    • Faster Metal compiled kernels for strided inputs benchmarks
    • Faster generation with fused-attention kernel benchmarks
  • Gradient for grouped convolutions
  • Due to Python 3.8's end-of-life we no longer test with it on CI

Core

  • New features
    • Gradient for grouped convolutions
    • mx.roll
    • mx.random.permutation
    • mx.real and mx.imag
  • Performance
    • Up to 6x faster CPU indexing benchmarks
    • Faster CPU sort benchmarks
    • Faster Metal compiled kernels for strided inputs benchmarks
    • Faster generation with fused-attention kernel benchmarks
    • Bulk eval in safetensors to avoid unnecessary serialization of work
  • Misc
    • Bump to nanobind 2.2
    • Move testing to python 3.9 due to 3.8's end-of-life
    • Make the GPU device more thread safe
    • Fix the submodule stubs for better IDE support
    • CI generated docs that will never be stale

NN

  • Add support for grouped 1D convolutions to the nn API
  • Add some missing type annotations

Bugfixes

  • Fix and speedup row-reduce with few rows
  • Fix normalization primitive segfault with unexpected inputs
  • Fix complex power on the GPU
  • Fix freeing deep unevaluated graphs details
  • Fix race with array::is_available
  • Consistently handle softmax with all -inf inputs
  • Fix streams in affine quantize
  • Fix CPU compile preamble for some linux machines
  • Stream safety in CPU compilation
  • Fix CPU compile segfault at program shutdown

v0.18.1

10 Oct 20:05
c21331d
Compare
Choose a tag to compare

πŸš€

v0.18.0

27 Sep 21:10
b1e2b53
Compare
Choose a tag to compare

Highlights

  • Speed improvements:
    • Up to 2x faster I/O: benchmarks.
    • Faster transposed copies, unary, and binary ops
  • Transposed convolutions
  • Improvements to mx.distributed (send/recv/average_gradients)

Core

  • New features:

    • mx.conv_transpose{1,2,3}d
    • Allow mx.take to work with integer index
    • Add std as method on mx.array
    • mx.put_along_axis
    • mx.cross_product
    • int() and float() work on scalar mx.array
    • Add optional headers to mx.fast.metal_kernel
    • mx.distributed.send and mx.distributed.recv
    • mx.linalg.pinv
  • Performance

    • Up to 2x faster I/O
    • Much faster CPU convolutions
    • Faster general n-dimensional copies, unary, and binary ops for both CPU and GPU
    • Put reduction ops in default stream with async for faster comms
    • Overhead reductions in mx.fast.metal_kernel
    • Improve donation heuristics to reduce memory use
  • Misc

    • Support Xcode 160

NN

  • Faster RNN layers
  • nn.ConvTranspose{1,2,3}d
  • mlx.nn.average_gradients data parallel helper for distributed training

Bug Fixes

  • Fix boolean all reduce bug
  • Fix extension metal library finding
  • Fix ternary for large arrays
  • Make eval just wait if all arrays are scheduled
  • Fix CPU softmax by removing redundant coefficient in neon_fast_exp
  • Fix JIT reductions
  • Fix overflow in quantize/dequantize
  • Fix compile with byte sized constants
  • Fix copy in the sort primitive
  • Fix reduce edge case
  • Fix slice data size
  • Throw for certain cases of non captured inputs in compile
  • Fix copying scalars by adding fill_gpu
  • Fix bug in module attribute set, reset, set
  • Ensure io/comm streams are active before eval
  • Fix mx.clip
  • Override class function in Repr so mx.array is not confused with array.array
  • Avoid using find_library to make install truly portable
  • Remove fmt dependencies from MLX install
  • Fix for partition VJP
  • Avoid command buffer timeout for IO on large arrays