Skip to content

Releases: ml-explore/mlx

v0.20.0

05 Nov 21:23
726dbd9
Compare
Choose a tag to compare

Highlights

  • Even faster GEMMs
  • BFS graph optimizations
    • Over 120tks with Mistral 7B!
  • Fast batched QMV/QVM for KV quantized attention benchmarks

Core

  • New Features
    • mx.linalg.eigh and mx.linalg.eigvalsh
    • mx.nn.init.sparse
    • 64bit type support for mx.cumprod, mx.cumsum
  • Performance
    • Faster long column reductions
    • Wired buffer support for large models
    • Better Winograd dispatch condition for convs
    • Faster scatter/gather
    • Faster mx.random.uniform and mx.random.bernoulli
    • Better threadgroup sizes for large arrays
  • Misc
    • Added Python 3.13 to CI
    • C++20 compatibility

Bugfixes

  • Fix command encoder synchronization
  • Fix mx.vmap with gather and constant outputs
  • Fix fused sdpa with differing key and value strides
  • Support mx.array.__format__ with spec
  • Fix multi output array leak
  • Fix RMSNorm weight mismatch error

v0.19.3

31 Oct 23:11
eac961d
Compare
Choose a tag to compare

πŸš€

v0.19.2

31 Oct 02:54
cde5b4a
Compare
Choose a tag to compare

πŸš€πŸš€

v0.19.1

25 Oct 20:18
35e9c87
Compare
Choose a tag to compare

πŸš€

v0.19.0

18 Oct 19:35
58a8556
Compare
Choose a tag to compare

Highlights

  • Speed improvements
    • Up to 6x faster CPU indexing benchmarks
    • Faster Metal compiled kernels for strided inputs benchmarks
    • Faster generation with fused-attention kernel benchmarks
  • Gradient for grouped convolutions
  • Due to Python 3.8's end-of-life we no longer test with it on CI

Core

  • New features
    • Gradient for grouped convolutions
    • mx.roll
    • mx.random.permutation
    • mx.real and mx.imag
  • Performance
    • Up to 6x faster CPU indexing benchmarks
    • Faster CPU sort benchmarks
    • Faster Metal compiled kernels for strided inputs benchmarks
    • Faster generation with fused-attention kernel benchmarks
    • Bulk eval in safetensors to avoid unnecessary serialization of work
  • Misc
    • Bump to nanobind 2.2
    • Move testing to python 3.9 due to 3.8's end-of-life
    • Make the GPU device more thread safe
    • Fix the submodule stubs for better IDE support
    • CI generated docs that will never be stale

NN

  • Add support for grouped 1D convolutions to the nn API
  • Add some missing type annotations

Bugfixes

  • Fix and speedup row-reduce with few rows
  • Fix normalization primitive segfault with unexpected inputs
  • Fix complex power on the GPU
  • Fix freeing deep unevaluated graphs details
  • Fix race with array::is_available
  • Consistently handle softmax with all -inf inputs
  • Fix streams in affine quantize
  • Fix CPU compile preamble for some linux machines
  • Stream safety in CPU compilation
  • Fix CPU compile segfault at program shutdown

v0.18.1

10 Oct 20:05
c21331d
Compare
Choose a tag to compare

πŸš€

v0.18.0

27 Sep 21:10
b1e2b53
Compare
Choose a tag to compare

Highlights

  • Speed improvements:
    • Up to 2x faster I/O: benchmarks.
    • Faster transposed copies, unary, and binary ops
  • Transposed convolutions
  • Improvements to mx.distributed (send/recv/average_gradients)

Core

  • New features:

    • mx.conv_transpose{1,2,3}d
    • Allow mx.take to work with integer index
    • Add std as method on mx.array
    • mx.put_along_axis
    • mx.cross_product
    • int() and float() work on scalar mx.array
    • Add optional headers to mx.fast.metal_kernel
    • mx.distributed.send and mx.distributed.recv
    • mx.linalg.pinv
  • Performance

    • Up to 2x faster I/O
    • Much faster CPU convolutions
    • Faster general n-dimensional copies, unary, and binary ops for both CPU and GPU
    • Put reduction ops in default stream with async for faster comms
    • Overhead reductions in mx.fast.metal_kernel
    • Improve donation heuristics to reduce memory use
  • Misc

    • Support Xcode 160

NN

  • Faster RNN layers
  • nn.ConvTranspose{1,2,3}d
  • mlx.nn.average_gradients data parallel helper for distributed training

Bug Fixes

  • Fix boolean all reduce bug
  • Fix extension metal library finding
  • Fix ternary for large arrays
  • Make eval just wait if all arrays are scheduled
  • Fix CPU softmax by removing redundant coefficient in neon_fast_exp
  • Fix JIT reductions
  • Fix overflow in quantize/dequantize
  • Fix compile with byte sized constants
  • Fix copy in the sort primitive
  • Fix reduce edge case
  • Fix slice data size
  • Throw for certain cases of non captured inputs in compile
  • Fix copying scalars by adding fill_gpu
  • Fix bug in module attribute set, reset, set
  • Ensure io/comm streams are active before eval
  • Fix mx.clip
  • Override class function in Repr so mx.array is not confused with array.array
  • Avoid using find_library to make install truly portable
  • Remove fmt dependencies from MLX install
  • Fix for partition VJP
  • Avoid command buffer timeout for IO on large arrays

v0.17.3

13 Sep 00:17
d0c5884
Compare
Choose a tag to compare

πŸš€

v0.17.1

24 Aug 17:19
8081df7
Compare
Choose a tag to compare

πŸ›

v0.17.0

23 Aug 18:48
684e11c
Compare
Choose a tag to compare

Highlights

  • mx.einsum: PR
  • Big speedups in reductions: benchmarks
  • 2x faster model loading: PR
  • mx.fast.metal_kernel for custom GPU kernels: docs

Core

  • Faster program exits
  • Laplace sampling
  • mx.nan_to_num
  • nn.tanh gelu approximation
  • Fused GPU quantization ops
  • Faster group norm
  • bf16 winograd conv
  • vmap support for mx.scatter
  • mx.pad "edge" padding
  • More numerically stable mx.var
  • mx.linalg.cholesky_inv/mx.linalg.tri_inv
  • mx.isfinite
  • Complex mx.sign now mirrors NumPy 2.0 behaviour
  • More flexible mx.fast.rope
  • Update to nanobind 2.1

Bug Fixes

  • gguf zero initialization
  • expm1f overflow handling
  • bfloat16 hadamard
  • large arrays for various ops
  • rope fix
  • bf16 array creation
  • preserve dtype in nn.Dropout
  • nn.TransformerEncoder with norm_first=False
  • excess copies from contiguity bug