Merge pull request #122 from kvcache-ai/feat-DeepSeekV3

[Feat] add support to DeepSeekV3
kvcache-ai · Feb 10, 2025 · 7527619 · 7527619
2 parents f4903d5 + 6f0fe95
commit 7527619
Show file tree

Hide file tree

Showing 32 changed files with 4,460 additions and 159 deletions.
diff --git a/.github/workflows/package_wheel_release.yml b/.github/workflows/package_wheel_release.yml
@@ -142,11 +142,11 @@ jobs:
 
       - name: Setup Mamba
         if: matrix.cuda != ''
-        uses: conda-incubator/setup-miniconda@v2.3.0
+        uses: conda-incubator/setup-miniconda@v3
         with:
           activate-environment: "ktransformers"
           python-version: ${{ matrix.pyver }}
-          miniforge-variant: Mambaforge
+          miniforge-variant: Miniforge3
           miniforge-version: latest
           use-mamba: true
           add-pip-as-python-dependency: true

diff --git a/.github/workflows/package_wheel_test.yml b/.github/workflows/package_wheel_test.yml
@@ -54,11 +54,11 @@ jobs:
 
       - name: Setup Mamba
         if: matrix.cuda != ''
-        uses: conda-incubator/setup-miniconda@v2.3.0
+        uses: conda-incubator/setup-miniconda@v3
         with:
           activate-environment: "ktransformers"
           python-version: ${{ matrix.pyver }}
-          miniforge-variant: Mambaforge
+          miniforge-variant: Miniforge3
           miniforge-version: latest
           use-mamba: true
           add-pip-as-python-dependency: true

diff --git a/.gitignore b/.gitignore
@@ -18,4 +18,7 @@ compile_commands.json
 ktransformers/server/local_store/
 ktransformers/server_test1.db
 *.patch
-img/
+img/
+tmp1.txt
+test_65_300_1536.txt
+test.txt
diff --git a/Makefile b/Makefile
@@ -17,5 +17,5 @@ dev_install:
 	pip install -r requirements-local_chat.txt
 
 	echo "Installing ktransformers"
-	KTRANSFORMERS_FORCE_BUILD=TRUE pip install -e . --no-build-isolation
+	KTRANSFORMERS_FORCE_BUILD=TRUE pip install -e . -v --no-build-isolation
 	echo "Installation completed successfully"
diff --git a/README.md b/README.md
@@ -23,61 +23,76 @@ Our vision for KTransformers is to serve as a flexible platform for experimentin
 
 <h2 id="Updates">🔥 Updates</h2>
 
+* **Fed 10, 2025**: Support Deepseek-R1 and V3 on single (24GB VRAM)/multi gpu and 382G DRAM, up to 3~64x speedup. The detailed tutorial is [here](./doc/en/DeepseekR1_V3_tutorial.md)
 * **Aug 28, 2024**: Support 1M context under the InternLM2.5-7B-Chat-1M model, utilizing 24GB of VRAM and 150GB of DRAM. The detailed tutorial is [here](./doc/en/long_context_tutorial.md).
 * **Aug 28, 2024**: Decrease DeepseekV2's required VRAM from 21G to 11G.
 * **Aug 15, 2024**: Update detailed [TUTORIAL](doc/en/injection_tutorial.md) for injection and multi-GPU. 
 * **Aug 14, 2024**: Support llamfile as linear backend. 
 * **Aug 12, 2024**: Support multiple GPU; Support new model: mixtral 8\*7B  and 8\*22B; Support q2k, q3k, q5k dequant on gpu.
 * **Aug 9, 2024**: Support windows native.
 
-<h2 id="show-cases">🔥 Show Cases</h2>
-<h3>1M Context Local Inference on a Desktop with Only 24GB VRAM</h3>
-<p align="center">
+<h2 id="show-cases">🌟 Show Cases</h2>
 
-https://github.com/user-attachments/assets/a865e5e4-bca3-401e-94b8-af3c080e6c12
+<div>
+<h3>GPT-4/o1-level Local VSCode Copilot on a Desktop with only 24GB VRAM</h3>
+</div>
 
-* **1M Context InternLM 2.5 7B**: Operates at full bf16 precision, utilizing 24GB VRAM and 150GB DRAM, which is feasible on a local desktop setup. It achieves a 92.88% success rate on the 1M "Needle In a Haystack" test and 100% on the 128K NIAH test.
+https://github.com/user-attachments/assets/ebd70bfa-b2c1-4abb-ae3b-296ed38aa285
 
-<p align="center">
-  <picture>
-    <img alt="Single Needle Retrieval 128K" src="./doc/assets/needle_128K.png" width=100%>
-  </picture>
 </p>
 
+- **[NEW!!!] Local 671B DeepSeek-Coder-V3/R1:** Running its Q4_K_M version using only 14GB VRAM and 382GB DRAM.
+	- Prefill Speed (tokens/s): 
+ 		- KTransfermor: 54.21 (32 cores) → 74.362 (dual-socket, 2×32 cores) → 255.26 (optimized AMX-based MoE kernel, V0.3 only) → 286.55 (selectively using 6 experts, V0.3 only)  
+ 		- Compared to 4.51 tokens/s in llama.cpp with 2×32 cores, achieving up to **63.53× speedup**.  
+ 	- Decode Speed (tokens/s):  
+ 		- KTransfermor: 8.73 (32 cores) → 11.26 (dual-socket, 2×32 cores) → 13.69 (selectively using 6 experts, V0.3 only)  
+ 		- Compared to 4.51 tokens/s in llama.cpp with 2×32 cores, achieving up to **3.03× speedup**.  
+	- Upcoming Open Source Release:
+		- AMX optimizations and selective expert activation will be open-sourced in V0.3.  
+		- Currently available only in preview binary distribution, which can be downloaded [here](https://github.com/kvcache-ai/ktransformers/releases/download/v0.1.4/ktransformers-0.3.0rc0+cu126torch26fancy-cp311-cp311-linux_x86_64.whl).  
+
+- **Local 236B DeepSeek-Coder-V2:** Running its Q4_K_M version using only 21GB VRAM and 136GB DRAM, attainable on a local desktop machine, which scores even better than GPT4-0613 in [BigCodeBench](https://huggingface.co/blog/leaderboard-bigcodebench).
+
 <p align="center">
   <picture>
-    <img alt="Single Needle Retrieval 1000K" src="./doc/assets/needle_1M.png" width=100%>
+    <img alt="DeepSeek-Coder-V2 Score" src="https://github.com/user-attachments/assets/d052924e-8631-44de-aad2-97c54b965693" width=100%>
   </picture>
 </p>
 
-* **Enhanced Speed**: Reaches 16.91 tokens/s for generation with a 1M context using sparse attention, powered by llamafile kernels. This method is over 10 times faster than full attention approach of llama.cpp.
-
-* **Flexible Sparse Attention Framework**: Offers a flexible block sparse attention framework for CPU offloaded decoding. Compatible with SnapKV, Quest, and InfLLm. Further information is available [here](./doc/en/long_context_introduction.md).
+- **Faster Speed:** Achieving 126 tokens/s for 2K prompt prefill and 13.6 tokens/s for generation through MoE offloading and injecting advanced kernels from [Llamafile](https://github.com/Mozilla-Ocho/llamafile/tree/main) and [Marlin](https://github.com/IST-DASLab/marlin).
+- **VSCode Integration:** Wrapped into an OpenAI and Ollama compatible API for seamless integration as a backend for [Tabby](https://github.com/TabbyML/tabby) and various other frontends.
 
-<div>
-<h3>GPT-4-level Local VSCode Copilot on a Desktop with only 24GB VRAM</h3>
-</div>
+<p align="center">
 
-https://github.com/user-attachments/assets/0b9fa2da-66f0-48eb-b4b9-f0e1f06f8927
+https://github.com/user-attachments/assets/4c6a8a38-05aa-497d-8eb1-3a5b3918429c
 
 </p>
 
-- **Local 236B DeepSeek-Coder-V2:** Running its Q4_K_M version using only 21GB VRAM and 136GB DRAM, attainable on a local desktop machine, which scores even better than GPT4-0613 in [BigCodeBench](https://huggingface.co/blog/leaderboard-bigcodebench).
+<h3>1M Context Local Inference on a Desktop with Only 24GB VRAM</h3>
+<p align="center">
+
+https://github.com/user-attachments/assets/a865e5e4-bca3-401e-94b8-af3c080e6c12
+
+* **1M Context InternLM 2.5 7B**: Operates at full bf16 precision, utilizing 24GB VRAM and 150GB DRAM, which is feasible on a local desktop setup. It achieves a 92.88% success rate on the 1M "Needle In a Haystack" test and 100% on the 128K NIAH test.
 
 <p align="center">
   <picture>
-    <img alt="DeepSeek-Coder-V2 Score" src="https://github.com/user-attachments/assets/d052924e-8631-44de-aad2-97c54b965693" width=100%>
+    <img alt="Single Needle Retrieval 128K" src="./doc/assets/needle_128K.png" width=100%>
   </picture>
 </p>
 
-- **Faster Speed:** Achieving 126 tokens/s for 2K prompt prefill and 13.6 tokens/s for generation through MoE offloading and injecting advanced kernels from [Llamafile](https://github.com/Mozilla-Ocho/llamafile/tree/main) and [Marlin](https://github.com/IST-DASLab/marlin).
-- **VSCode Integration:** Wrapped into an OpenAI and Ollama compatible API for seamless integration as a backend for [Tabby](https://github.com/TabbyML/tabby) and various other frontends.
-
 <p align="center">
+  <picture>
+    <img alt="Single Needle Retrieval 1000K" src="./doc/assets/needle_1M.png" width=100%>
+  </picture>
+</p>
+
+* **Enhanced Speed**: Reaches 16.91 tokens/s for generation with a 1M context using sparse attention, powered by llamafile kernels. This method is over 10 times faster than full attention approach of llama.cpp.
+
+* **Flexible Sparse Attention Framework**: Offers a flexible block sparse attention framework for CPU offloaded decoding. Compatible with SnapKV, Quest, and InfLLm. Further information is available [here](./doc/en/long_context_introduction.md).
 
-https://github.com/user-attachments/assets/4c6a8a38-05aa-497d-8eb1-3a5b3918429c
 
-</p>
 
 <strong>More advanced features will coming soon, so stay tuned!</strong>
 

diff --git a/doc/en/DeepseekR1_V3_tutorial.md b/doc/en/DeepseekR1_V3_tutorial.md
@@ -0,0 +1,139 @@
+# GPT-4/o1-level Local VSCode Copilot on a Desktop with only 24GB VRAM
+# SUMMARY
+
+> **Fed 10, 2025**: Support DeepseekR1 and V3 on single (24GB VRAM)/multi gpu and 382G DRAM, up to 3~64x speedup.<br>
+
+Hi, we're the KTransformers team (formerly known for our local CPU/GPU hybrid inference open source project with DeepSeek-V2).  
+
+We've heard your requests for DeepSeek-R1/V3 support—and we're excited to finally deliver! 
+Apologies for the wait, but we've been cooking up something truly amazing!
+
+Today, we're proud to announce that we not only support DeepSeek-R1/V3, as showcased in the video below:  
+
+https://github.com/user-attachments/assets/ebd70bfa-b2c1-4abb-ae3b-296ed38aa285
+
+</p>
+
+- **[NEW!!!] Local 671B DeepSeek-Coder-V3/R1:** Running its Q4_K_M version using only 14GB VRAM and 382GB DRAM.
+	- Prefill Speed (tokens/s): 
+ 		- KTransfermor: 54.21 (32 cores) → 74.362 (dual-socket, 2×32 cores) → 255.26 (optimized AMX-based MoE kernel, V0.3 only) → 286.55 (selectively using 6 experts, V0.3 only)  
+ 		- Compared to 4.51 tokens/s in llama.cpp with 2×32 cores, achieving up to **63.53× speedup**.  
+ 	- Decode Speed (tokens/s):  
+ 		- KTransfermor: 8.73 (32 cores) → 11.26 (dual-socket, 2×32 cores) → 13.69 (selectively using 6 experts, V0.3 only)  
+ 		- Compared to 4.51 tokens/s in llama.cpp with 2×32 cores, achieving up to **3.03× speedup**.  
+
+
+We also give our upcoming optimizations previews, including an Intel AMX-accelerated kernel and a selective expert activation method, which will significantly enhance performance. With V0.3-preview, we achieve up to 286 tokens/s for prefill, making it up to **64× faster than llama.cpp** for local inference.
+The binary distribution is available now and the source code will come ASAP! Check out the wheel package [here](https://github.com/kvcache-ai/ktransformers/releases/download/v0.1.4/ktransformers-0.3.0rc0+cu126torch26fancy-cp311-cp311-linux_x86_64.whl)  
+
+
+## Prerequisites
+We run our best performance tests (V0.2) on <br>
+CPU: Intel (R) Xeon (R) Gold 6454S 1T DRAM (2 NUMA nodes) <br>
+GPU: 4090D 24G VRAM <br>
+## Bench Result
+### V0.2
+#### Settings
+- Model: DeepseekV3-q4km (int4)<br>
+- CPU: cpu_model_name: Intel (R) Xeon (R) Gold 6454S, 32 cores per socket, 2 sockets, 2 numa nodes
+- GPU: 4090D 24G VRAM
+- We test after enough warm up
+#### Memory consumption:
+  - Single socket: 382G DRAM, at least 14GB VRAM
+  - Dual socket: 1T DRAM, at least 14GB VRAM
+
+#### Benchmark Results
+
+"6 experts" case is part of V0.3's preview
+
+| Prompt<br>(500 tokens) | Dual socket Ktrans (6 experts) | Dual socket Ktrans (8 experts) | Single socket Ktrans (6 experts) | Single socket Ktrans (8 experts)| llama.cpp (8 experts) | 
+| --- | --- | --- | --- | --- | --- | 
+| Prefill token/s | 97.32 | 82.94 | 65.14 | 54.21 | 10.31 |
+| Decode token/s | 13.69 | 12.208 | 10.303 | 8.73 |4.51 |
+
+**The highest speedup reaches up to <u>3.03x</u> in decoding and <u>9.44x</u> in prefill.**
+
+### V0.3-Preview
+#### Settings
+- Model: DeepseekV3-BF16 (online quant into int8 for CPU and int4 for GPU)
+- CPU: cpu_model_name: Intel (R) Xeon (R) Gold 6454S, 32 cores per socket, 2 socket, 2 numa nodes
+- GPU: (1~4)x 4090D 24GVRAM (requires more VRAM for longer prompt)
+
+#### Memory consumptions:
+- 644GB DRAM, at least 14GB VRAM
+
+#### Benchmark results
+| Prompt length  | 1K  | 2K  | 4K  | 8K |
+|---------------|-----|-----|-----|-----|
+| KTrans (8 experts) Prefill token/s |   185.96  |  255.26   |  252.58   |  195.62   |
+| KTrans (6 experts) Prefill token/s |   203.70  |  286.55   |  271.08   |  207.20   |
+
+**The prefill of KTrans V0.3 is up to <u>3.45x</u> times faster than KTrans V0.2, and is up to <u>63.53x</u> times faster than llama.cpp.**
+**The decoding speed is the same as KTrans V0.2 (6 experts version) so it is omitted**
+
+The main acceleration comes from 
+- Intel AMX instruction set and our specially designed cache friendly memory layout
+- Expert selection strategy that selects fewer experts based on offline profile results of out of domain data
+
+
+*From our research on DeepSeekV2, DeepSeekV3 and DeepSeekR1, 
+when we slightly decrease the activation experts num in inference, 
+the output quality doesn't change. But the speed of decoding and prefill 
+is speed up which is inspiring. So our showcase makes use of this finding*
+
+## How to Run
+### V0.2 Showcase
+#### Single socket version (32 cores)
+Our local_chat test command is:
+``` shell
+git clone https://github.com/kvcache-ai/ktransformers.git
+cd ktransformers
+numactl -N 1 -m 1 python ./ktransformers/local_chat.py --model_path <your model path> --gguf_path <your gguf path>  --prompt_file <your prompt txt file>  --cpu_infer 33  --cache_lens 1536 
+<when you see chat, then press enter to load the text prompt_file>
+```
+\<your model path\> can be local or set from online hugging face like deepseek-ai/DeepSeek-V3. If online encounters connection problem, try use mirror (hf-mirror.com) <br>
+\<your gguf path\> can also be online, but as its large we recommend you download it and quantize the model to what you want <br>
+The command numactl -N 1 -m 1 aims to advoid data transfer between numa nodes
+#### Dual socket version (64 cores)
+Make suer before you install (use install.sh or `make dev_install`), setting the env var `USE_NUMA=1` by `export USE_NUMA=1` (if already installed, reinstall it with this env var set) <br>
+Our local_chat test command is:
+``` shell
+git clone https://github.com/kvcache-ai/ktransformers.git
+cd ktransformers
+export USE_NUMA=1
+make dev_install # or sh ./install.sh
+python ./ktransformers/local_chat.py --model_path <your model path> --gguf_path <your gguf path>  --prompt_file <your prompt txt file>  --cpu_infer 65  --cache_lens 1536 
+<when you see chat, then press enter to load the text prompt_file>
+```
+The parameters' meaning is the same. But As we  use dual socket, we set cpu_infer to 65
+
+### V0.3 Showcase
+#### Dual socket version (64 cores)
+Our local_chat test command is:
+``` shell
+wget https://github.com/kvcache-ai/ktransformers/releases/download/v0.1.4/ktransformers-0.3.0rc0+cu126torch26fancy-cp311-cp311-linux_x86_64.whl
+pip install ./ktransformers-0.3.0rc0+cu126torch26fancy-cp311-cp311-linux_x86_64.whl
+python -m ktransformers.local_chat --model_path <your model path> --gguf_path <your gguf path>  --prompt_file <your prompt txt file>  --cpu_infer 65  --cache_lens 1536 
+<when you see chat, then press enter to load the text prompt_file>
+```
+The parameters' meaning is the same with V0.2. But As we  use dual socket, we set cpu_infer to 65
+
+## Some Explanations
+1. Also we want to make further use of our two NUMA nodes on Xeon Gold cpu. 
+To avoid the cost of data transfer between nodes, we "copy" the critical matrix on 
+both nodes which takes more memory consumption but accelerates the prefill and decoding process.
+But this method takes huge memory and slow when loading weights, So be patient when loading
+and monitor the memory usage. We are going to optimize this huge memory overhead. Stay tuned~ <br>
+2. The command args `--cpu_infer 65` specifies how many cores to use (it's ok that it exceeds the physical number, 
+but it's not the more the better. Adjust it slightly lower to your actual number of cores)<br>
+
+3. Why CPU/GPU Hybrid Inference?
+DeepSeek's MLA operators are highly computationally intensive. While running everything on CPU is possible, offloading the heavy computations to the GPU results in a massive performance boost.  
+
+4. Where Does the Speedup Come From?
+
+   - Expert Offload: Unlike traditional layer-based or KVCache offloading (as seen in llama.cpp), we offload the expert computation to the CPU and MLA/KVCache to GPU, aligning perfectly with DeepSeek’s architecture for optimal efficiency.  
+   - Intel AMX Optimization – Our AMX-accelerated kernel is meticulously tuned, running several times faster than existing llama.cpp implementations. We plan to open-source this kernel after cleansing and are considering upstream contributions to llama.cpp.  
+
+5. Why Intel CPUs?
+Intel is currently the only CPU vendor that supports AMX-like instructions, which delivers significantly better performance compared to AVX-only alternatives.
diff --git a/ktransformers/__init__.py b/ktransformers/__init__.py
@@ -5,7 +5,7 @@
 Author       : kkk1nak0
 Date         : 2024-08-15 07:34:46
 Version      : 1.0.0
-LastEditors  : Azure-Tang 
-LastEditTime : 2024-08-29 22:35:51
+LastEditors  : unicornchan 
+LastEditTime : 2025-02-10 00:59:53
 '''
-__version__ = "0.1.4"
+__version__ = "0.2.0"
diff --git a/ktransformers/configs/config.yaml b/ktransformers/configs/config.yaml
@@ -54,4 +54,4 @@ long_context:
   token_step: 
 
 local_chat:
-  prompt_file: "./ktransformers/p.txt"
+  prompt_file: ""
diff --git a/ktransformers/ktransformers_ext/CMakeLists.txt b/ktransformers/ktransformers_ext/CMakeLists.txt
@@ -230,3 +230,24 @@ elseif(UNIX)
     endif()
     target_link_libraries(${PROJECT_NAME} PRIVATE "$ENV{CUDA_HOME}/lib64/libcudart.so")
 endif()
+
+# Define the USE_NUMA option
+option(USE_NUMA "Disable NUMA support" OFF)
+# Check if the USE_NUMA environment variable is set
+if(DEFINED ENV{USE_NUMA})
+    set(USE_NUMA ON)
+endif()
+if (USE_NUMA)
+    message(STATUS "NUMA support is enabled")
+else()
+    message(STATUS "NUMA support is disabled")
+endif()
+
+find_library(NUMA_LIBRARY NAMES numa)
+if (NUMA_LIBRARY AND USE_NUMA)
+    message(STATUS "NUMA library found: ${NUMA_LIBRARY} - enabling NUMA support")
+    target_link_libraries(${PROJECT_NAME} PRIVATE ${NUMA_LIBRARY})
+    target_compile_definitions(${PROJECT_NAME} PRIVATE USE_NUMA)
+else()
+    message(STATUS "NUMA library not found or user not set USE_NUMA - disabling NUMA support")
+endif()
diff --git a/ktransformers/ktransformers_ext/cpu_backend/backend.cpp b/ktransformers/ktransformers_ext/cpu_backend/backend.cpp
@@ -10,6 +10,13 @@
 
 #include "backend.h"
 
+#ifdef USE_NUMA
+#include <numa.h>
+#include <numaif.h>
+
+thread_local int Backend::numa_node = -1;
+#endif
+
 thread_local int Backend::thread_local_id = -1;
 
 Backend::Backend(int max_thread_num) {
@@ -74,6 +81,16 @@ void Backend::do_work_stealing_job(int task_num,
 }
 
 void Backend::process_tasks(int thread_id) {
+
+    #ifdef USE_NUMA
+    if(numa_node == -1){
+        numa_node = thread_id * numa_num_configured_nodes() / thread_num_;
+        struct bitmask* mask = numa_bitmask_alloc(numa_num_configured_nodes());
+        numa_bitmask_setbit(mask, numa_node);
+        numa_bind(mask);
+    }
+    #endif
+
     if (init_func_ != nullptr) {
         init_func_(thread_id);
     }

diff --git a/ktransformers/ktransformers_ext/cpu_backend/backend.h b/ktransformers/ktransformers_ext/cpu_backend/backend.h
@@ -38,6 +38,9 @@ class Backend {
     void do_work_stealing_job(int, std::function<void(int)>,
                               std::function<void(int)>,
                               std::function<void(int)>);
+    #ifdef USE_NUMA
+    static thread_local int numa_node;
+    #endif
     static thread_local int thread_local_id;
 
   private: