[Bug] Deployment of Llama3.1-70b getting struck #2724

pulkitmehtaworkmetacube · 2024-11-07T09:43:54Z

Checklist

1. I have searched related issues but cannot get the expected help.
2. The bug has not been fixed in the latest version.
3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

We are trying to deploy llama3.1-70b on GCP with below specs .
GPU - 2 x NVIDIA A100 80GB
Machine Type - a2-ultragpu-2g (350GB Ram)
SSD - 2TB

Command we tried for deployment : lmdeploy serve api_server meta-llama/Llama-3.1-70B-Instruct --tp 2
During deployment , we get struck at
Fetching 42 files: 100%|████████████████████████████████████████████████████████████████████| 42/42 [00:00<00:00, 12190.21it/s]
[WARNING] gemm_config.in is not found; using default GEMM algo
[WARNING] gemm_config.in is not found; using default GEMM algo

It gets struck here for hours without any other error . We checked GPu , CPU usage as well .Please suggest

$ free -g total used free shared buff/cache available Mem: 334 0 200 0 133 330 Swap: 0 0 0

Reproduction

lmdeploy serve api_server meta-llama/Llama-3.1-70B-Instruct --tp 2

Environment

GPU - 2 x NVIDIA A100 80GB
Machine Type - a2-ultragpu-2g (350GB Ram)
SSD - 2TB

Error traceback

During deployment , we get struck at 
Fetching 42 files: 100%|████████████████████████████████████████████████████████████████████| 42/42 [00:00<00:00, 12190.21it/s]
[WARNING] gemm_config.in is not found; using default GEMM algo
[WARNING] gemm_config.in is not found; using default GEMM algo

The text was updated successfully, but these errors were encountered:

zhyncs · 2024-11-07T09:53:11Z

use latest version

pulkitmehtaworkmetacube · 2024-11-07T10:10:36Z

@zhyncs Currently using LMDEPLOY_VERSION=0.6.2
Driver Version: 550.90.07
CUDA Version: 12.4

pulkitmehtaworkmetacube · 2024-11-07T11:43:19Z

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 3304 C /opt/conda/envs/lmdeploy/bin/python3.8 76638MiB |
| 1 N/A N/A 3304 C /opt/conda/envs/lmdeploy/bin/python3.8 80722MiB |

lvhan028 · 2024-11-07T11:54:24Z

May upgrade to v6.2.0.post1.
And append --log-level INFO when start the service. Let's check the log

jatin-wald · 2024-11-07T12:23:43Z

$ lmdeploy serve api_server meta-llama/Llama-3.1-70B-Instruct --tp 2 --dtype float16 --log-level INFO

`Fetching 42 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 42/42 [00:00<00:00, 4813.00it/s]
2024-11-07 12:04:20,426 - lmdeploy - INFO - async_engine.py:142 - input backend=turbomind, backend_config=TurbomindEngineConfig(dtype='float16', model_format=None, tp=2, session_len=None, max_batch_size=256, cache_max_entry_count=0.8, cache_chunk_size=-1, cache_block_seq_len=64, enable_prefix_caching=False, quant_policy=0, rope_scaling_factor=0.0, use_logn_attn=False, download_dir=None, revision=None, max_prefill_token_num=8192, num_tokens_per_iter=0, max_prefill_iters=1)
2024-11-07 12:04:20,426 - lmdeploy - INFO - async_engine.py:144 - input chat_template_config=None
2024-11-07 12:04:20,439 - lmdeploy - INFO - async_engine.py:154 - updated chat_template_onfig=ChatTemplateConfig(model_name='llama3_1', system=None, meta_instruction=None, eosys=None, user=None, eoh=None, assistant=None, eoa=None, separator=None, capability=None, stop_words=None)
2024-11-07 12:04:20,439 - lmdeploy - INFO - turbomind.py:301 - model_source: hf_model
2024-11-07 12:04:21,556 - lmdeploy - INFO - turbomind.py:200 - turbomind model config:

{
"model_config": {
"model_name": "",
"chat_template": "",
"model_arch": "LlamaForCausalLM",
"head_num": 64,
"kv_head_num": 8,
"hidden_units": 8192,
"vocab_size": 128256,
"num_layer": 80,
"inter_size": 28672,
"norm_eps": 1e-05,
"attn_bias": 0,
"start_id": 128000,
"end_id": 128009,
"size_per_head": 128,
"group_size": 128,
"weight_type": "float16",
"session_len": 131072,
"tp": 2,
"model_format": "hf",
"expert_num": 0,
"expert_inter_size": 0,
"experts_per_token": 0
},
"attention_config": {
"rotary_embedding": 128,
"rope_theta": 500000.0,
"max_position_embeddings": 131072,
"original_max_position_embeddings": 8192,
"rope_scaling_type": "llama3",
"rope_scaling_factor": 8.0,
"use_dynamic_ntk": 0,
"low_freq_factor": 1.0,
"high_freq_factor": 4.0,
"use_logn_attn": 0,
"cache_block_seq_len": 64
},
"lora_config": {
"lora_policy": "",
"lora_r": 0,
"lora_scale": 0.0,
"lora_max_wo_r": 0,
"lora_rank_pattern": "",
"lora_scale_pattern": ""
},
"engine_config": {
"dtype": "float16",
"model_format": null,
"tp": 2,
"session_len": null,
"max_batch_size": 256,
"cache_max_entry_count": 0.8,
"cache_chunk_size": -1,
"cache_block_seq_len": 64,
"enable_prefix_caching": false,
"quant_policy": 0,
"rope_scaling_factor": 0.0,
"use_logn_attn": false,
"download_dir": null,
"revision": null,
"max_prefill_token_num": 8192,
"num_tokens_per_iter": 8192,
"max_prefill_iters": 16
}
}
[TM][WARNING] [LlamaTritonModel] max_context_token_num is not set, default to 131072.
[TM][INFO] Model:
head_num: 64
kv_head_num: 8
size_per_head: 128
inter_size: 28672
num_layer: 80
vocab_size: 128256
attn_bias: 0
max_batch_size: 256
max_prefill_token_num: 8192
max_context_token_num: 131072
num_tokens_per_iter: 8192
max_prefill_iters: 16
session_len: 131072
cache_max_entry_count: 0.8
cache_block_seq_len: 64
cache_chunk_size: -1
enable_prefix_caching: 0
start_id: 128000
tensor_para_size: 2
pipeline_para_size: 1
enable_custom_all_reduce: 0
model_name:
model_dir:
quant_policy: 0
group_size: 128
expert_num: 0
expert_per_token: 0
moe_method: 1

[TM][INFO] TM_FUSE_SILU_ACT=1
2024-11-07 12:04:22,680 - lmdeploy - WARNING - turbomind.py:231 - get 965 model params
[TM][INFO] [LlamaWeight::prepare] workspace size: 469762048

[TM][INFO] [LlamaWeight::prepare] workspace size: 469762048

[WARNING] gemm_config.in is not found; using default GEMM algo
[WARNING] gemm_config.in is not found; using default GEMM algo
[TM][INFO] [BlockManager] block_size = 10 MB
[TM][INFO] [BlockManager] max_block_count = 534
[TM][INFO] [BlockManager] chunk_size = 534
[TM][INFO] [BlockManager] block_size = 10 MB
[TM][INFO] [BlockManager] max_block_count = 534
[TM][INFO] [BlockManager] chunk_size = 534
[TM][WARNING] No enough blocks for session_len (131072), session_len truncated to 34176.
[TM][INFO] LlamaBatch::Start()
[TM][INFO] LlamaBatch::Start()
[TM][INFO] [Gemm2] Tuning sequence: 8, 16, 32, 48, 64, 96, 128, 192, 256, 384, 512, 768, 1024, 1536, 2048, 3072, 4096, 6144, 8192
[TM][INFO] [Gemm2] 8
[TM][INFO] [Gemm2] 16
[TM][INFO] [Gemm2] 32
[TM][INFO] [Gemm2] 48
[TM][INFO] [Gemm2] 64
[TM][INFO] [Gemm2] 96
[TM][INFO] [Gemm2] 128
[TM][INFO] [Gemm2] 192
[TM][INFO] [Gemm2] 256
[TM][INFO] [Gemm2] 384
[TM][INFO] [Gemm2] 512
[TM][INFO] [Gemm2] 768
[TM][INFO] [Gemm2] 1024
[TM][INFO] [Gemm2] 1536
[TM][INFO] [Gemm2] 2048
[TM][INFO] [Gemm2] 3072
[TM][INFO] [InternalThreadEntry] stop requested.
[TM][INFO] [InternalThreadEntry] stop requested.
[TM][WARNING] pointer_mapping_ does not have information of ptr at 0x2d43d5f200.`

GPU USAGE

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A     10138      C   /opt/conda/envs/lmdeploy/bin/python3.8      75230MiB |
|    1   N/A  N/A     10138      C   /opt/conda/envs/lmdeploy/bin/python3.8      80716MiB |
+-----------------------------------------------------------------------------------------+
Thu Nov  7 12:20:43 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-SXM4-80GB          On  |   00000000:00:05.0 Off |                    0 |
| N/A   39C    P0             99W /  400W |   75241MiB /  81920MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB          On  |   00000000:00:06.0 Off |                    0 |
| N/A   35C    P0             72W /  400W |   80727MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A     10138      C   /opt/conda/envs/lmdeploy/bin/python3.8      75230MiB |
|    1   N/A  N/A     10138      C   /opt/conda/envs/lmdeploy/bin/python3.8      80716MiB |
+-----------------------------------------------------------------------------------------+
Thu Nov  7 12:20:49 2024

jatin-wald · 2024-11-13T10:39:53Z

Any luck anyone?

lzhangzz · 2024-11-13T14:05:01Z

So you get the log just by starting the server without sending any requests? This is more likely to be caused by the bug in v0.6.2 (instead of v0.6.2.post1)

lvhan028 self-assigned this Nov 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Deployment of Llama3.1-70b getting struck #2724

[Bug] Deployment of Llama3.1-70b getting struck #2724

pulkitmehtaworkmetacube commented Nov 7, 2024

zhyncs commented Nov 7, 2024

pulkitmehtaworkmetacube commented Nov 7, 2024

pulkitmehtaworkmetacube commented Nov 7, 2024

lvhan028 commented Nov 7, 2024

jatin-wald commented Nov 7, 2024 •

edited by lvhan028

Loading

jatin-wald commented Nov 13, 2024

lzhangzz commented Nov 13, 2024

[Bug] Deployment of Llama3.1-70b getting struck #2724

[Bug] Deployment of Llama3.1-70b getting struck #2724

Comments

pulkitmehtaworkmetacube commented Nov 7, 2024

Checklist

Describe the bug

Reproduction

Environment

Error traceback

zhyncs commented Nov 7, 2024

pulkitmehtaworkmetacube commented Nov 7, 2024

pulkitmehtaworkmetacube commented Nov 7, 2024

lvhan028 commented Nov 7, 2024

jatin-wald commented Nov 7, 2024 • edited by lvhan028 Loading

jatin-wald commented Nov 13, 2024

lzhangzz commented Nov 13, 2024

jatin-wald commented Nov 7, 2024 •

edited by lvhan028

Loading