Releases: ggml-org/llama.cpp
Releases · ggml-org/llama.cpp
b4747
ggml-cpu: Add CPU backend support for KleidiAI library (#11390) * ggml-cpu: Add CPU backend support for KleidiAI library * Add environmental variable GGML_KLEIDIAI_SME * Add support for multithread LHS conversion * Switch kernel selection order to dotprod and i8mm * updates for review comments * More updates for review comments * Reorganize and rename KleidiAI files * Move ggml-cpu-traits.h to source file * Update cmake for SME build and add alignment for SME * Remove append GGML_USE_CPU_KLEIDIAI to the GGML_CDEF_PUBLIC list
b4746
ggml: aarch64: implement SVE kernels for q3_K_q8_K vector dot (#11917) * Added SVE Implementation for Q3_K Kernel in ggml-cpu-quants.c file * Improved Formating of code in ggml-cpu-quants.c file * style : minor fixes * style : less whitespaces * style : ptr spaceing --------- Co-authored-by: vithulep <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>
b4745
run : add --chat-template-file (#11961) Relates to: https://github.com/ggml-org/llama.cpp/issues/11178 Added --chat-template-file CLI option to llama-run. If specified, the file will be read and the content passed for overwriting the chat template of the model to common_chat_templates_from_model. Signed-off-by: Michael Engel <[email protected]>
b4743
common : add llama.vim preset for Qwen2.5 Coder (#11945) This commit adds a preset for llama.vim to use the default Qwen 2.5 Coder models. The motivation for this change is to make it easier to start a server suitable to be used with the llama.vim plugin. For example, the server can be started with a command like the following: ```console $ llama.vim --fim-qwen-1.5b-default ``` Refs: https://github.com/ggml-org/llama.cpp/issues/10932
b4742
speculative : update default params (#11954) * speculative : update default params * speculative : do not discard the last drafted token
b4739
tool-call: refactor common chat / tool-call api (+ tests / fixes) (#1…
b4738
server : add TEI API format for /rerank endpoint (#11942) * server : add TEI API format for /rerank endpoint * Apply suggestions from code review Co-authored-by: Georgi Gerganov <[email protected]> * fix * also gitignore examples/server/*.gz.hpp --------- Co-authored-by: Georgi Gerganov <[email protected]>
b4735
CUDA: use async data loading for FlashAttention (#11894) * CUDA: use async data loading for FlashAttention --------- Co-authored-by: Diego Devesa <[email protected]>
b4734
update release requirements (#11897)
b4733
server : fix divide-by-zero in metrics reporting (#11915)