Single node inference benchmark of Llama 3.1 405B with NVIDIA TensorRT-LLM (TRT-LLM) on A3 Ultra GKE Node Pool
This recipe outlines the steps to benchmark inference of a Llama 3.1 405B model using NVIDIA TensorRT-LLM on an A3 Ultra GKE Node pool with a single node.
For this recipe, the following setup is used:
- Orchestration - Google Kubernetes Engine (GKE)
- Job configuration and deployment - Helm chart is used to configure and deploy the Kubernetes Index Job. This job encapsulates inference of Llama 3.1 405B model using TensorRT-LLM. The chart generates the job's manifest, adhering to best practices for using RDMA Over Ethernet (RoCE) with Google Kubernetes Engine (GKE).
Before running this recipe, ensure your environment is configured as follows:
- A GKE cluster with the following setup:
- An A3 Ultra node pool (1 node, 8 GPUs)
- Topology-aware scheduling enabled
- An Artifact Registry repository to store the Docker image.
- A Google Cloud Storage (GCS) bucket to store results. Important: This bucket must be in the same region as the GKE cluster.
- A client workstation with the following pre-installed:
- Google Cloud SDK
- Helm
- kubectl
- To access the Llama 3.1 405B model through Hugging Face, you'll need a Hugging Face token. Follow these steps to generate a new token if you don't have one already:
- Create a Hugging Face account, if you don't already have one.
- Click Your Profile > Settings > Access Tokens.
- Select New Token.
- Specify a Name of your choice and a Role of at least
Read
. - Select Generate a token.
- Copy the generated token to your clipboard.
- Get access to the Llama 3.1 405B model checkpoints from Hugging Face.
- You can get access to the Llama 3.1 405B model checkpoints by signing up for a Hugging Face account and joining the Llama 3.1 405B model family.
To prepare the required environment, see GKE environment setup guide.
It is recommended to use Cloud Shell as your client to complete the steps.
Cloud Shell comes pre-installed with the necessary utilities, including
kubectl
, the Google Cloud SDK
, and Helm
.
In the Google Cloud console, start a Cloud Shell Instance.
From your client, complete the following steps:
- Set the environment variables to match your environment:
export PROJECT_ID=<PROJECT_ID>
export REGION=<REGION>
export CLUSTER_REGION=<CLUSTER_REGION>
export CLUSTER_NAME=<CLUSTER_NAME>
export GCS_BUCKET=<GCS_BUCKET>
export ARTIFACT_REGISTRY=<ARTIFACT_REGISTRY>
export TRT_LLM_IMAGE=trtllm
export TRT_LLM_VERSION=0.16.0
Replace the following values:
<PROJECT_ID>
: your Google Cloud project ID<REGION>
: the region where you want to run Cloud Build<CLUSTER_REGION>
: the region where your cluster is located<CLUSTER_NAME>
: the name of your GKE cluster<GCS_BUCKET>
: the name of your Cloud Storage bucket. Do not include thegs://
prefix<ARTIFACT_REGISTRY>
: the full name of your Artifact Registry in the following format: LOCATION-docker.pkg.dev/PROJECT_ID/REPOSITORY<TRT_LLM_IMAGE>
: the name of the TensorRT-LLM image<TRT_LLM_VERSION>
: the version of the TensorRT-LLM image
- Set the default project:
gcloud config set project $PROJECT_ID
From your client, clone the gpu-recipes
repository and set a reference to the recipe folder.
git clone https://github.com/ai-hypercomputer/gpu-recipes.git
cd gpu-recipes
export REPO_ROOT=`git rev-parse --show-toplevel`
export RECIPE_ROOT=$REPO_ROOT/inference/a3ultra/llama-3.1-405b/trtllm-inference-gke/single-node
From your client, get the credentials for your cluster.
gcloud container clusters get-credentials $CLUSTER_NAME --region $CLUSTER_REGION
To build the container, complete the following steps from your client:
-
Use Cloud Build to build and push the container image.
cd $REPO_ROOT/src/docker/trtllm-0.16.0 gcloud builds submit --region=${REGION} \ --config cloudbuild.yml \ --substitutions _ARTIFACT_REGISTRY=$ARTIFACT_REGISTRY,_TRT_LLM_IMAGE=$TRT_LLM_IMAGE,_TRT_LLM_VERSION=$TRT_LLM_VERSION \ --timeout "2h" \ --machine-type=e2-highcpu-32 \ --disk-size=1000 \ --quiet \ --async
This command outputs the build ID
.
-
You can monitor the build progress by streaming the logs for the
build ID
. To do this, run the following command.Replace
<BUILD_ID>
with your build ID.BUILD_ID=<BUILD_ID> gcloud beta builds log $BUILD_ID --region=$REGION
The recipe runs inference benchmark for Llama 3.1 405B model on a single A3 Ultra node converting the Hugging Face checkpoint to TensorRT-LLM optimized format with FP8 quantization.
trtllm-bench
is a command-line tool from NVIDIA that can be used to benchmark the performance of TensorRT-LLM engine.
For more information about trtllm-bench
, see the TensorRT-LLM documentation.
To run the benchmarking, the recipe does the following steps:
- Download the full Llama 3.1 405B model checkpoints from Hugging Face. Please see the prerequisites section to get access to the model.
- Convert the model checkpoints to TensorRT-LLM optimized format
- Build TensorRT-LLM engines for the model with FP8 quantization
- Run the throughput and latency benchmarking
The recipe uses the helm chart to run the above steps.
-
Create Kubernetes Secret with Hugging Face token to allow the job to download the model checkpoints.
export HF_TOKEN=<YOUR_HUGGINGFACE_TOKEN>
kubectl create secret generic hf-secret \ --from-literal=hf_api_token=${HF_TOKEN} \ --dry-run=client -o yaml | kubectl apply -f -
-
Install the helm chart to prepare the model.
NOTE: This helm chart currently runs only a single experiment for 30k requests for 128 tokens of input/output lengths. To run other experiments, you can uncomment the various combinations provided in the values.yaml file.
cd $RECIPE_ROOT helm install -f values.yaml \ --set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \ --set clusterName=$CLUSTER_NAME \ --set job.image.repository=${ARTIFACT_REGISTRY}/${TRT_LLM_IMAGE} \ --set job.image.tag=${TRT_LLM_VERSION} \ $USER-benchmark-llama-model \ $REPO_ROOT/src/helm-charts/a3ultra/trtllm-inference/single-node
-
To view the logs for the job, you can run
kubectl logs -f job/$USER-benchmark-llama-model
-
Verify the job has completed by running
kubectl get job/$USER-benchmark-llama-model
If the job has completed, you should see the following output:
NAME COMPLETIONS DURATION AGE $USER-benchmark-llama-model 1/1 ##s #m##s
-
Once the job starts running, you will see logs similar to this:
Running benchmark for meta-llama/Llama-3.1-405B with ISL=128, OSL=128, TP=8 [TensorRT-LLM] TensorRT-LLM version: 0.16.0 Parse safetensors files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 191/191 [00:01<00:00, 101.48it/s] [01/21/2025-02:18:19] [TRT-LLM] [I] Found dataset. [01/21/2025-02:18:20] [TRT-LLM] [I] =========================================================== = DATASET DETAILS =========================================================== Max Input Sequence Length: 128 Max Output Sequence Length: 128 Max Sequence Length: 256 Target (Average) Input Sequence Length: 128 Target (Average) Output Sequence Length: 128 Number of Sequences: 30000 =========================================================== [01/21/2025-02:18:20] [TRT-LLM] [I] Max batch size and max num tokens are not provided, use tuning heuristics or pre-defined setting from trtllm-bench. [01/21/2025-02:18:20] [TRT-LLM] [I] Estimated total available memory for KV cache: 717.36 GB [01/21/2025-02:18:20] [TRT-LLM] [I] Estimated total KV cache memory: 681.49 GB [01/21/2025-02:18:20] [TRT-LLM] [I] Estimated max number of requests in KV cache memory: 11076.91 [01/21/2025-02:18:20] [TRT-LLM] [I] Set dtype to bfloat16. [01/21/2025-02:18:20] [TRT-LLM] [I] Set multiple_profiles to True. [01/21/2025-02:18:20] [TRT-LLM] [I] Set use_paged_context_fmha to True. [01/21/2025-02:18:20] [TRT-LLM] [I] Set use_fp8_context_fmha to True. [01/21/2025-02:18:20] [TRT-LLM] [I] =========================================================== = ENGINE BUILD INFO =========================================================== Model Name: meta-llama/Llama-3.1-405B Model Path: /ssd/meta-llama/Llama-3.1-405B Workspace Directory: /ssd Engine Directory: /ssd/meta-llama/Llama-3.1-405B/tp_8_pp_1 =========================================================== = ENGINE CONFIGURATION DETAILS =========================================================== Max Sequence Length: 256 Max Batch Size: 4096 Max Num Tokens: 8192 Quantization: FP8 =========================================================== [TensorRT-LLM] TensorRT-LLM version: 0.16.0 [TensorRT-LLM] TensorRT-LLM version: 0.16.0 [TensorRT-LLM] TensorRT-LLM version: 0.16.0 [TensorRT-LLM] TensorRT-LLM version: 0.16.0 [TensorRT-LLM] TensorRT-LLM version: 0.16.0 [TensorRT-LLM] TensorRT-LLM version: 0.16.0 [TensorRT-LLM] TensorRT-LLM version: 0.16.0 [TensorRT-LLM] TensorRT-LLM version: 0.16.0 Loading Model: [1/2] Loading HF model to memory Loading checkpoint shards: 100%|██████████| 191/191 [02:48<00:00, 1.14it/s] Inserted 2649 quantizers Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32) Disable lm_head quantization for TRT-LLM export due to deployment limitations. current rank: 0, tp rank: 0, pp rank: 0 Time: 1723.944s Loading Model: [2/2] Building TRT-LLM engine Time: 600.103s Loading model done. Total latency: 2324.046s [TensorRT-LLM] TensorRT-LLM version: 0.16.0 [01/22/2025-08:40:31] [TRT-LLM] [I] Preparing to run throughput benchmark... [01/22/2025-08:40:33] [TRT-LLM] [I] Setting up benchmarker and infrastructure. [01/22/2025-08:40:33] [TRT-LLM] [I] Initializing Throughput Benchmark. [rate=-1 req/s] [01/22/2025-08:40:33] [TRT-LLM] [I] Ready to start benchmark. [01/22/2025-08:40:33] [TRT-LLM] [I] Initializing Executor. 01/22/2025-08:41:37] [TRT-LLM] [I] WAITING ON EXECUTOR... [01/22/2025-08:41:37] [TRT-LLM] [I] Starting response daemon... [01/22/2025-08:41:37] [TRT-LLM] [I] Executor started. [01/22/2025-08:41:37] [TRT-LLM] [I] WAITING ON BACKEND TO BE READY... [01/22/2025-08:41:37] [TRT-LLM] [I] Request serving started. [01/22/2025-08:41:37] [TRT-LLM] [I] Starting statistics collection. [01/22/2025-08:41:37] [TRT-LLM] [I] Collecting live stats... [01/22/2025-08:41:37] [TRT-LLM] [I] Benchmark started. [01/22/2025-08:41:37] [TRT-LLM] [I] Request serving stopped. [01/22/2025-08:57:55] [TRT-LLM] [I] Collecting last stats... [01/22/2025-08:57:55] [TRT-LLM] [I] Ending statistics collection. [01/22/2025-08:57:55] [TRT-LLM] [I] Stop received. [01/22/2025-08:57:55] [TRT-LLM] [I] Stopping response parsing. [01/22/2025-08:57:55] [TRT-LLM] [I] Collecting last responses before shutdown. [01/22/2025-08:57:55] [TRT-LLM] [I] Completed request parsing. [01/22/2025-08:57:55] [TRT-LLM] [I] Parsing stopped. [01/22/2025-08:57:55] [TRT-LLM] [I] Request generator successfully joined. [01/22/2025-08:57:55] [TRT-LLM] [I] Statistics process successfully joined. [01/22/2025-08:57:55] [TRT-LLM] [I] =========================================================== = ENGINE DETAILS =========================================================== Model: meta-llama/Llama-3.1-405B Engine Directory: /ssd/meta-llama/Llama-3.1-405B/tp_8_pp_1 TensorRT-LLM Version: 0.16.0 Dtype: bfloat16 KV Cache Dtype: FP8 Quantization: FP8 Max Sequence Length: 256 =========================================================== = WORLD + RUNTIME INFORMATION =========================================================== TP Size: 8 PP Size: 1 Max Runtime Batch Size: 4096 Max Runtime Tokens: 8192 Scheduling Policy: Guaranteed No Evict KV Memory Percentage: 95.00% Issue Rate (req/sec): 9.5448E+12 =========================================================== = PERFORMANCE OVERVIEW =========================================================== Number of requests: 30000 Average Input Length (tokens): 128.0000 Average Output Length (tokens): 128.0000 Token Throughput (tokens/sec): 3926.0575 Request Throughput (req/sec): 30.6723 Total Latency (ms): 978080.4323 =========================================================== [01/22/2025-08:57:55] [TRT-LLM] [I] Benchmark Shutdown called! [01/22/2025-08:57:55] [TRT-LLM] [I] Shutting down ExecutorServer. [TensorRT-LLM][INFO] Orchestrator sendReq thread exiting [TensorRT-LLM][INFO] Orchestrator recv thread exiting [01/22/2025-08:57:55] [TRT-LLM] [I] Executor shutdown.
-
Once the job has completed, you can see the results in the Cloud Storage bucket.
gsutil ls gs://${GCS_BUCKET}/benchmark_logs/
To clean up the resources created by this recipe, complete the following steps:
-
Uninstall the helm chart.
helm uninstall $USER-benchmark-llama-model
-
Delete the Kubernetes Secret.
kubectl delete secret hf-secret
If you created your cluster using the GKE environment setup guide, it is configured with default settings that include the names for networks and subnetworks used for communication between:
- The host to external services.
- GPU-to GPU communication.
For clusters with this default configuration, the Helm chart can automatically generate the required networking annotations in a Pod's metadata. Therefore, you can use the streamlined command to install the chart, as described in the the Single A3 Ultra Node Benchmarking using FP8 Quantization section.
To configure the correct networking annotations for a cluster that uses non-default names for GKE Network resources, you must provide the names of the GKE Network resources in you cluster when installing the chart. Use the following example command, remembering to replace the example values with the actual names of your cluster's GKE Network resources:
cd $RECIPE_ROOT
helm install -f values.yaml \
--set workload.image=${ARTIFACT_REGISTRY}/${TRT_LLM_IMAGE}:${TRT_LLM_VERSION} \
--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \
--set network.subnetworks[0]=default \
--set network.subnetworks[1]=gvnic-1 \
--set network.subnetworks[2]=rdma-0 \
--set network.subnetworks[3]=rdma-1 \
--set network.subnetworks[4]=rdma-2 \
--set network.subnetworks[5]=rdma-3 \
--set network.subnetworks[6]=rdma-4 \
--set network.subnetworks[7]=rdma-5 \
--set network.subnetworks[8]=rdma-6 \
--set network.subnetworks[9]=rdma-7 \
$USER-benchmark-llama-model \
$REPO_ROOT/src/helm-charts/a3ultra/trtllm-inference/single-node