Releases: ml-explore/mlx
Releases Β· ml-explore/mlx
v0.20.0
Highlights
- Even faster GEMMs
- Peaking at 23.89 TFlops on M2 Ultra benchmarks
- BFS graph optimizations
- Over 120tks with Mistral 7B!
- Fast batched QMV/QVM for KV quantized attention benchmarks
Core
- New Features
mx.linalg.eigh
andmx.linalg.eigvalsh
mx.nn.init.sparse
- 64bit type support for
mx.cumprod
,mx.cumsum
- Performance
- Faster long column reductions
- Wired buffer support for large models
- Better Winograd dispatch condition for convs
- Faster scatter/gather
- Faster
mx.random.uniform
andmx.random.bernoulli
- Better threadgroup sizes for large arrays
- Misc
- Added Python 3.13 to CI
- C++20 compatibility
Bugfixes
- Fix command encoder synchronization
- Fix
mx.vmap
with gather and constant outputs - Fix fused sdpa with differing key and value strides
- Support
mx.array.__format__
with spec - Fix multi output array leak
- Fix RMSNorm weight mismatch error
v0.19.3
v0.19.2
ππ
v0.19.1
v0.19.0
Highlights
- Speed improvements
- Up to 6x faster CPU indexing benchmarks
- Faster Metal compiled kernels for strided inputs benchmarks
- Faster generation with fused-attention kernel benchmarks
- Gradient for grouped convolutions
- Due to Python 3.8's end-of-life we no longer test with it on CI
Core
- New features
- Gradient for grouped convolutions
mx.roll
mx.random.permutation
mx.real
andmx.imag
- Performance
- Up to 6x faster CPU indexing benchmarks
- Faster CPU sort benchmarks
- Faster Metal compiled kernels for strided inputs benchmarks
- Faster generation with fused-attention kernel benchmarks
- Bulk eval in safetensors to avoid unnecessary serialization of work
- Misc
- Bump to nanobind 2.2
- Move testing to python 3.9 due to 3.8's end-of-life
- Make the GPU device more thread safe
- Fix the submodule stubs for better IDE support
- CI generated docs that will never be stale
NN
- Add support for grouped 1D convolutions to the nn API
- Add some missing type annotations
Bugfixes
- Fix and speedup row-reduce with few rows
- Fix normalization primitive segfault with unexpected inputs
- Fix complex power on the GPU
- Fix freeing deep unevaluated graphs details
- Fix race with
array::is_available
- Consistently handle softmax with all
-inf
inputs - Fix streams in affine quantize
- Fix CPU compile preamble for some linux machines
- Stream safety in CPU compilation
- Fix CPU compile segfault at program shutdown
v0.18.1
v0.18.0
Highlights
- Speed improvements:
- Up to 2x faster I/O: benchmarks.
- Faster transposed copies, unary, and binary ops
- Transposed convolutions
- Improvements to
mx.distributed
(send
/recv
/average_gradients
)
Core
-
New features:
mx.conv_transpose{1,2,3}d
- Allow
mx.take
to work with integer index - Add
std
as method onmx.array
mx.put_along_axis
mx.cross_product
int()
andfloat()
work on scalarmx.array
- Add optional headers to
mx.fast.metal_kernel
mx.distributed.send
andmx.distributed.recv
mx.linalg.pinv
-
Performance
- Up to 2x faster I/O
- Much faster CPU convolutions
- Faster general n-dimensional copies, unary, and binary ops for both CPU and GPU
- Put reduction ops in default stream with async for faster comms
- Overhead reductions in
mx.fast.metal_kernel
- Improve donation heuristics to reduce memory use
-
Misc
- Support Xcode 160
NN
- Faster RNN layers
nn.ConvTranspose{1,2,3}d
mlx.nn.average_gradients
data parallel helper for distributed training
Bug Fixes
- Fix boolean all reduce bug
- Fix extension metal library finding
- Fix ternary for large arrays
- Make eval just wait if all arrays are scheduled
- Fix CPU softmax by removing redundant coefficient in neon_fast_exp
- Fix JIT reductions
- Fix overflow in quantize/dequantize
- Fix compile with byte sized constants
- Fix copy in the sort primitive
- Fix reduce edge case
- Fix slice data size
- Throw for certain cases of non captured inputs in compile
- Fix copying scalars by adding fill_gpu
- Fix bug in module attribute set, reset, set
- Ensure io/comm streams are active before eval
- Fix
mx.clip
- Override class function in Repr so
mx.array
is not confused witharray.array
- Avoid using find_library to make install truly portable
- Remove fmt dependencies from MLX install
- Fix for partition VJP
- Avoid command buffer timeout for IO on large arrays
v0.17.3
π
v0.17.1
π
v0.17.0
Highlights
mx.einsum
: PR- Big speedups in reductions: benchmarks
- 2x faster model loading: PR
mx.fast.metal_kernel
for custom GPU kernels: docs
Core
- Faster program exits
- Laplace sampling
mx.nan_to_num
nn.tanh
gelu approximation- Fused GPU quantization ops
- Faster group norm
- bf16 winograd conv
- vmap support for
mx.scatter
mx.pad
"edge" padding- More numerically stable
mx.var
mx.linalg.cholesky_inv
/mx.linalg.tri_inv
mx.isfinite
- Complex
mx.sign
now mirrors NumPy 2.0 behaviour - More flexible
mx.fast.rope
- Update to
nanobind
2.1
Bug Fixes
- gguf zero initialization
- expm1f overflow handling
- bfloat16 hadamard
- large arrays for various ops
- rope fix
- bf16 array creation
- preserve dtype in
nn.Dropout
nn.TransformerEncoder
withnorm_first=False
- excess copies from contiguity bug