Releases · ggml-org/llama.cpp

20 Feb 13:46

c5d91a7

b4747 Latest

Latest

ggml-cpu: Add CPU backend support for KleidiAI library (#11390)

* ggml-cpu: Add CPU backend support for KleidiAI library

* Add environmental variable GGML_KLEIDIAI_SME

* Add support for multithread LHS conversion

* Switch kernel selection order to dotprod and i8mm

* updates for review comments

* More updates for review comments

* Reorganize and rename KleidiAI files

* Move ggml-cpu-traits.h to source file

* Update cmake for SME build and add alignment for SME

* Remove append GGML_USE_CPU_KLEIDIAI to the GGML_CDEF_PUBLIC list

Assets 24

cudart-llama-bin-win-cu11.7-x64.zip

303 MB 2025-02-20T13:46:48Z
cudart-llama-bin-win-cu12.4-x64.zip

373 MB 2025-02-20T13:47:00Z
llama-b4747-bin-macos-arm64.zip

23.3 MB 2025-02-20T13:47:14Z
llama-b4747-bin-macos-x64.zip

24.9 MB 2025-02-20T13:47:16Z
llama-b4747-bin-ubuntu-vulkan-x64.zip

30.7 MB 2025-02-20T13:47:17Z
llama-b4747-bin-ubuntu-x64.zip

26.9 MB 2025-02-20T13:47:19Z
llama-b4747-bin-win-avx-x64.zip

16.4 MB 2025-02-20T13:47:20Z
llama-b4747-bin-win-avx2-x64.zip

16.4 MB 2025-02-20T13:47:21Z
llama-b4747-bin-win-avx512-x64.zip

16.4 MB 2025-02-20T13:47:22Z
llama-b4747-bin-win-cuda-cu11.7-x64.zip

191 MB 2025-02-20T13:47:24Z
Source code (zip)

2025-02-20T13:06:51Z
Source code (tar.gz)

2025-02-20T13:06:51Z

20 Feb 10:54

github-actions

b4746

4806498

b4746

ggml: aarch64: implement SVE kernels for q3_K_q8_K vector dot (#11917)

* Added SVE Implementation for Q3_K Kernel in ggml-cpu-quants.c file

* Improved Formating of code in  ggml-cpu-quants.c file

* style : minor fixes

* style : less whitespaces

* style : ptr spaceing

---------

Co-authored-by: vithulep <[email protected]>
Co-authored-by: Georgi Gerganov <[email protected]>

Assets 24

20 Feb 09:09

github-actions

b4745

0d55958

b4745

run : add --chat-template-file (#11961)

Relates to: https://github.com/ggml-org/llama.cpp/issues/11178

Added --chat-template-file CLI option to llama-run. If specified, the file
will be read and the content passed for overwriting the chat template of
the model to common_chat_templates_from_model.

Signed-off-by: Michael Engel <[email protected]>

Assets 24

19 Feb 12:23

github-actions

b4743

d07c621

b4743

common : add llama.vim preset for Qwen2.5 Coder (#11945)

This commit adds a preset for llama.vim to use the default Qwen 2.5
Coder models.

The motivation for this change is to make it easier to start a server
suitable to be used with the llama.vim plugin. For example, the server
can be started with a command like the following:
```console
$ llama.vim --fim-qwen-1.5b-default
```

Refs: https://github.com/ggml-org/llama.cpp/issues/10932

Assets 24

19 Feb 12:14

github-actions

b4742

abd4d0b

b4742

speculative : update default params (#11954)

* speculative : update default params

* speculative : do not discard the last drafted token

Assets 24

18 Feb 18:46

github-actions

b4739

63e489c

b4739

tool-call: refactor common chat / tool-call api (+ tests / fixes) (#1…

Assets 24

18 Feb 14:00

github-actions

b4738

63ac128

b4738

server : add TEI API format for /rerank endpoint (#11942)

* server : add TEI API format for /rerank endpoint

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <[email protected]>

* fix

* also gitignore examples/server/*.gz.hpp

---------

Co-authored-by: Georgi Gerganov <[email protected]>

Assets 24

17 Feb 13:49

github-actions

b4735

73e2ed3

b4735

CUDA: use async data loading for FlashAttention (#11894)

* CUDA: use async data loading for FlashAttention

---------

Co-authored-by: Diego Devesa <[email protected]>

Assets 24

17 Feb 11:54

github-actions

b4734

f7b1116

b4734

update release requirements (#11897)

Assets 24

17 Feb 10:53

github-actions

b4733

c4d29ba

b4733

server : fix divide-by-zero in metrics reporting (#11915)

Assets 23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Releases: ggml-org/llama.cpp

b4747

b4746

b4745

b4743

b4742

b4739

b4738

b4735

b4734

b4733