Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Deployment of Llama3.1-70b getting struck #2724

Open
3 tasks done
pulkitmehtaworkmetacube opened this issue Nov 7, 2024 · 7 comments
Open
3 tasks done

[Bug] Deployment of Llama3.1-70b getting struck #2724

pulkitmehtaworkmetacube opened this issue Nov 7, 2024 · 7 comments
Assignees

Comments

@pulkitmehtaworkmetacube
Copy link

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

We are trying to deploy llama3.1-70b on GCP with below specs .
GPU - 2 x NVIDIA A100 80GB
Machine Type - a2-ultragpu-2g (350GB Ram)
SSD - 2TB

Command we tried for deployment : lmdeploy serve api_server meta-llama/Llama-3.1-70B-Instruct --tp 2
During deployment , we get struck at
Fetching 42 files: 100%|████████████████████████████████████████████████████████████████████| 42/42 [00:00<00:00, 12190.21it/s]
[WARNING] gemm_config.in is not found; using default GEMM algo
[WARNING] gemm_config.in is not found; using default GEMM algo

It gets struck here for hours without any other error . We checked GPu , CPU usage as well .Please suggest

$ free -g total used free shared buff/cache available Mem: 334 0 200 0 133 330 Swap: 0 0 0

+-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA A100-SXM4-80GB On | 00000000:00:05.0 Off | 0 | | N/A 35C P0 94W / 400W | 76649MiB / 81920MiB | 100% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA A100-SXM4-80GB On | 00000000:00:06.0 Off | 0 | | N/A 34C P0 69W / 400W | 80733MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+

Reproduction

lmdeploy serve api_server meta-llama/Llama-3.1-70B-Instruct --tp 2

Environment

GPU - 2 x NVIDIA A100 80GB
Machine Type - a2-ultragpu-2g (350GB Ram)
SSD - 2TB

Error traceback

During deployment , we get struck at 
Fetching 42 files: 100%|████████████████████████████████████████████████████████████████████| 42/42 [00:00<00:00, 12190.21it/s]
[WARNING] gemm_config.in is not found; using default GEMM algo
[WARNING] gemm_config.in is not found; using default GEMM algo
@zhyncs
Copy link
Collaborator

zhyncs commented Nov 7, 2024

use latest version

@pulkitmehtaworkmetacube
Copy link
Author

@zhyncs Currently using LMDEPLOY_VERSION=0.6.2
Driver Version: 550.90.07
CUDA Version: 12.4

@pulkitmehtaworkmetacube
Copy link
Author

We tried latest version .. for 1 TP , we are getting cuda out of memory error .. We observed that when we did 2 TP , memory from 2nd GPU was not getting used . Please suggest .
nvidia-smi
Thu Nov 7 11:12:24 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100-SXM4-80GB On | 00000000:00:05.0 Off | 0 |
| N/A 39C P0 98W / 400W | 76649MiB / 81920MiB | 100% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA A100-SXM4-80GB On | 00000000:00:06.0 Off | 0 |
| N/A 34C P0 72W / 400W | 80733MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 3304 C /opt/conda/envs/lmdeploy/bin/python3.8 76638MiB |
| 1 N/A N/A 3304 C /opt/conda/envs/lmdeploy/bin/python3.8 80722MiB |

@lvhan028
Copy link
Collaborator

lvhan028 commented Nov 7, 2024

May upgrade to v6.2.0.post1.
And append --log-level INFO when start the service. Let's check the log

@lvhan028 lvhan028 self-assigned this Nov 7, 2024
@jatin-wald
Copy link

jatin-wald commented Nov 7, 2024

$ lmdeploy serve api_server meta-llama/Llama-3.1-70B-Instruct --tp 2 --dtype float16 --log-level INFO

`Fetching 42 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████| 42/42 [00:00<00:00, 4813.00it/s]
2024-11-07 12:04:20,426 - lmdeploy - INFO - async_engine.py:142 - input backend=turbomind, backend_config=TurbomindEngineConfig(dtype='float16', model_format=None, tp=2, session_len=None, max_batch_size=256, cache_max_entry_count=0.8, cache_chunk_size=-1, cache_block_seq_len=64, enable_prefix_caching=False, quant_policy=0, rope_scaling_factor=0.0, use_logn_attn=False, download_dir=None, revision=None, max_prefill_token_num=8192, num_tokens_per_iter=0, max_prefill_iters=1)
2024-11-07 12:04:20,426 - lmdeploy - INFO - async_engine.py:144 - input chat_template_config=None
2024-11-07 12:04:20,439 - lmdeploy - INFO - async_engine.py:154 - updated chat_template_onfig=ChatTemplateConfig(model_name='llama3_1', system=None, meta_instruction=None, eosys=None, user=None, eoh=None, assistant=None, eoa=None, separator=None, capability=None, stop_words=None)
2024-11-07 12:04:20,439 - lmdeploy - INFO - turbomind.py:301 - model_source: hf_model
2024-11-07 12:04:21,556 - lmdeploy - INFO - turbomind.py:200 - turbomind model config:

{
"model_config": {
"model_name": "",
"chat_template": "",
"model_arch": "LlamaForCausalLM",
"head_num": 64,
"kv_head_num": 8,
"hidden_units": 8192,
"vocab_size": 128256,
"num_layer": 80,
"inter_size": 28672,
"norm_eps": 1e-05,
"attn_bias": 0,
"start_id": 128000,
"end_id": 128009,
"size_per_head": 128,
"group_size": 128,
"weight_type": "float16",
"session_len": 131072,
"tp": 2,
"model_format": "hf",
"expert_num": 0,
"expert_inter_size": 0,
"experts_per_token": 0
},
"attention_config": {
"rotary_embedding": 128,
"rope_theta": 500000.0,
"max_position_embeddings": 131072,
"original_max_position_embeddings": 8192,
"rope_scaling_type": "llama3",
"rope_scaling_factor": 8.0,
"use_dynamic_ntk": 0,
"low_freq_factor": 1.0,
"high_freq_factor": 4.0,
"use_logn_attn": 0,
"cache_block_seq_len": 64
},
"lora_config": {
"lora_policy": "",
"lora_r": 0,
"lora_scale": 0.0,
"lora_max_wo_r": 0,
"lora_rank_pattern": "",
"lora_scale_pattern": ""
},
"engine_config": {
"dtype": "float16",
"model_format": null,
"tp": 2,
"session_len": null,
"max_batch_size": 256,
"cache_max_entry_count": 0.8,
"cache_chunk_size": -1,
"cache_block_seq_len": 64,
"enable_prefix_caching": false,
"quant_policy": 0,
"rope_scaling_factor": 0.0,
"use_logn_attn": false,
"download_dir": null,
"revision": null,
"max_prefill_token_num": 8192,
"num_tokens_per_iter": 8192,
"max_prefill_iters": 16
}
}
[TM][WARNING] [LlamaTritonModel] max_context_token_num is not set, default to 131072.
[TM][INFO] Model:
head_num: 64
kv_head_num: 8
size_per_head: 128
inter_size: 28672
num_layer: 80
vocab_size: 128256
attn_bias: 0
max_batch_size: 256
max_prefill_token_num: 8192
max_context_token_num: 131072
num_tokens_per_iter: 8192
max_prefill_iters: 16
session_len: 131072
cache_max_entry_count: 0.8
cache_block_seq_len: 64
cache_chunk_size: -1
enable_prefix_caching: 0
start_id: 128000
tensor_para_size: 2
pipeline_para_size: 1
enable_custom_all_reduce: 0
model_name:
model_dir:
quant_policy: 0
group_size: 128
expert_num: 0
expert_per_token: 0
moe_method: 1

[TM][INFO] TM_FUSE_SILU_ACT=1
2024-11-07 12:04:22,680 - lmdeploy - WARNING - turbomind.py:231 - get 965 model params
[TM][INFO] [LlamaWeight::prepare] workspace size: 469762048

[TM][INFO] [LlamaWeight::prepare] workspace size: 469762048

[WARNING] gemm_config.in is not found; using default GEMM algo
[WARNING] gemm_config.in is not found; using default GEMM algo
[TM][INFO] [BlockManager] block_size = 10 MB
[TM][INFO] [BlockManager] max_block_count = 534
[TM][INFO] [BlockManager] chunk_size = 534
[TM][INFO] [BlockManager] block_size = 10 MB
[TM][INFO] [BlockManager] max_block_count = 534
[TM][INFO] [BlockManager] chunk_size = 534
[TM][WARNING] No enough blocks for session_len (131072), session_len truncated to 34176.
[TM][INFO] LlamaBatch::Start()
[TM][INFO] LlamaBatch::Start()
[TM][INFO] [Gemm2] Tuning sequence: 8, 16, 32, 48, 64, 96, 128, 192, 256, 384, 512, 768, 1024, 1536, 2048, 3072, 4096, 6144, 8192
[TM][INFO] [Gemm2] 8
[TM][INFO] [Gemm2] 16
[TM][INFO] [Gemm2] 32
[TM][INFO] [Gemm2] 48
[TM][INFO] [Gemm2] 64
[TM][INFO] [Gemm2] 96
[TM][INFO] [Gemm2] 128
[TM][INFO] [Gemm2] 192
[TM][INFO] [Gemm2] 256
[TM][INFO] [Gemm2] 384
[TM][INFO] [Gemm2] 512
[TM][INFO] [Gemm2] 768
[TM][INFO] [Gemm2] 1024
[TM][INFO] [Gemm2] 1536
[TM][INFO] [Gemm2] 2048
[TM][INFO] [Gemm2] 3072
[TM][INFO] [InternalThreadEntry] stop requested.
[TM][INFO] [InternalThreadEntry] stop requested.
[TM][WARNING] pointer_mapping_ does not have information of ptr at 0x2d43d5f200.`

GPU USAGE

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A     10138      C   /opt/conda/envs/lmdeploy/bin/python3.8      75230MiB |
|    1   N/A  N/A     10138      C   /opt/conda/envs/lmdeploy/bin/python3.8      80716MiB |
+-----------------------------------------------------------------------------------------+
Thu Nov  7 12:20:43 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-SXM4-80GB          On  |   00000000:00:05.0 Off |                    0 |
| N/A   39C    P0             99W /  400W |   75241MiB /  81920MiB |    100%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB          On  |   00000000:00:06.0 Off |                    0 |
| N/A   35C    P0             72W /  400W |   80727MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A     10138      C   /opt/conda/envs/lmdeploy/bin/python3.8      75230MiB |
|    1   N/A  N/A     10138      C   /opt/conda/envs/lmdeploy/bin/python3.8      80716MiB |
+-----------------------------------------------------------------------------------------+
Thu Nov  7 12:20:49 2024

@jatin-wald
Copy link

Any luck anyone?

@lzhangzz
Copy link
Collaborator

So you get the log just by starting the server without sending any requests? This is more likely to be caused by the bug in v0.6.2 (instead of v0.6.2.post1)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants