You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to optimize the loading of a ~14.2GB tensorrt-llm engine on a 16GB CPU RAM node into a 16GB VRAM. As the rest of my program takes around ~1GB CPU RAM, there is little room for not streaming the CudaEngine from disk to cuda.
Upon trying out the trt.IStreamReader the class does not hold its promises.
its slower then reading the file in python.
it requires ~15GB CPU RAM overhead instead of 1GB CPU RAM with a naive implementation
Environment
TensorRT Version:
NVIDIA GPU: H100
/baseten/engine-builder/tei_trt# nvidia-smi
Wed Jan 15 23:59:04 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
Operating System: Ubuntu 22.04
Python Version (if applicable): 3.10.2
PyTorch Version (if applicable): 2.5.1
Baremetal or Container (if so, version):
Relevant Files
Llama-7B engine created with TensorRT-LLM 0.16.0
Steps To Reproduce
iimporttimeimporttensorrtastrtfrompathlibimportPathdefFileReaderVanilla(filepath):
ifnotPath(filepath).exists():
raiseValueError(f"File at {filepath} does not exist!")
withopen(filepath, "rb") asf:
returnf.read()
classFileReaderV1(trt.IStreamReader):
""" Class that supplies data to TensorRT from a stream. This may help reduce memory usage during deserialization. Moves engine file directly to CUDA memory, without loading it into CPU memory first. https://github.com/NVIDIA/TensorRT/blob/97ff24489d0ea979c418c7a0847dfc14c8483846/tools/Polygraphy/polygraphy/backend/trt/file_reader.py#L28 Args: filepath (str): The path to the serialized file. ```python # roughly equivalent to: if not self.serialize_path.exists(): raise ValueError( f"missing engine at serialize_path={self.serialize_path}" ) with open(self.serialize_path, "rb") as f: yield f.read() # stream equivalent ``` """def__init__(self, filepath):
# Must explicitly initialize parent for any trampoline class! Will mysteriously segfault without this.trt.IStreamReader.__init__(self) # type: ignoreself.filepath=filepathifnotPath(self.filepath).exists():
raiseValueError(f"File at {self.filepath} does not exist!")
self.file=open(self.filepath, "rb")
defread(self, size: int) ->bytes:
print(f"Reading {size} bytes")
returnself.file.read(size)
deffree(self):
ifself.file:
self.file.close()
def__enter__(self):
# Open the file and create a memory mapreturnselfdef__exit__(self, exc_type, exc_value, traceback):
self.free()
classFileReaderV2(trt.IStreamReaderV2):
""" Class that supplies data to TensorRT from a stream, without loading the whole file into memory. Moves engine file directly to CUDA memory, without first allocating it all in CPU memory. Args: file (Path): The path to the serialized engine file. """def__init__(self, file_path):
trt.IStreamReaderV2.__init__(self)
self.bytes=Path(file_path).read_bytes()
self.len=len(self.bytes)
self.index=0defread(self, size, cudaStreamPtr):
assertself.index+size<=self.lendata=self.bytes[self.index:self.index+size]
self.index+=sizeprint(f"Reading {size} bytes, actual size: {len(data)}")
returndatadefseek(self, offset, where):
print(f" seek position: {offset}{where}")
ifwhere==trt.SeekPosition.SET:
self.index=offsetelifwhere==trt.SeekPosition.CUR:
self.index+=offsetelifwhere==trt.SeekPosition.END:
self.index=self.len-offsetelse:
raiseValueError(f"Invalid seek position: {where}")
definit_runtime(reader):
runtime=trt.Runtime(trt.Logger(trt.Logger.INFO))
engine=runtime.deserialize_cuda_engine(reader)
assertengineisnotNonereturnruntime, enginedefdebug_max_memory_usage_filereaderv2():
_=init_runtime(FileReaderV2("/app/engines/rank0.engine"))
time.sleep(1)
defdebug_max_memory_usage_filereaderv1():
_=init_runtime(FileReaderV1("/app/engines/rank0.engine"))
time.sleep(1)
defdebug_max_memory_usage_filereader_vanilla():
_=init_runtime(FileReaderVanilla("/app/engines/rank0.engine"))
time.sleep(1)
if__name__=="__main__":
# /usr/bin/time -v poetry run python ./tests/test_runtime_filereader.pydebug_max_memory_usage_filereaderv2()
Vanilla results
8.4s + peak memory 15524688kB
/usr/bin/time -v poetry run python --vanilla
debug_max_memory_usage_filereader_vanilla()
warnings.warn(
[TensorRT-LLM] TensorRT-LLM version: 0.16.0
Command being timed: "poetry run python ./tests/test_runtime_filereader.py"
User time (seconds): 8.40
System time (seconds): 17.13
Percent of CPU this job got: 109%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:23.25
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 15524688
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 6318
Minor (reclaiming a frame) page faults: 3824756
Voluntary context switches: 53551
Involuntary context switches: 537
Swaps: 0
File system inputs: 0
File system outputs: 24
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
(trt-tei-runtime-py3.10) root@michaelfeil-dev-pod-h100-0:~/baseten/engine-builde
IStreamReaderV1 loading:
User time (seconds): 10.27 (worse)
Maximum resident set size (kbytes): 29217388 (almost double)
/usr/bin/time -v poetry run python --stream
debug_max_memory_usage_filereader()
[TensorRT-LLM] TensorRT-LLM version: 0.16.0
Command being timed: "poetry run python ./tests/test_runtime_filereader.py"
User time (seconds): 10.27
System time (seconds): 22.72
Percent of CPU this job got: 111%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:29.65
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 29217388
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 6284
Minor (reclaiming a frame) page faults: 7312826
Voluntary context switches: 54294
Involuntary context switches: 538
Swaps: 0
File system inputs: 0
File system outputs: 24
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
Analysis
The duplication of the memory is likely because of a parsing from python to cpp, which uses a copy. If the API was to read it in smaller chunks, this would not be as bad.
The .read(size) API is called twice with StreamV1 class, requesting the initial 32Bytes and then the rest.
# successful read that needs 29217388kB
reading 32 bytes from /app/engines/rank0.engine
reading 14244750076 bytes from /app/engines/rank0.engine
pdb breakpoint delivers no additional info
builder/tei_trt/tests/test_runtime_filereader.py(7)init_runtime()
6 runtime = trt.Runtime(trt.Logger([trt.Logger.INFO](http://trt.logger.info/)))
----> 7 engine = runtime.deserialize_cuda_engine(reader)
8 assert engine is not None
> /workspace/model-performance/michaelfeil/baseten/engine-builder/tei_trt/trt_tei_runtime/trt_model.py(137)read()
136 ipdb.set_trace()
--> 137 print(f"reading {size} bytes from {self.filepath}")
138 return self.file.read(size)
Analysis IStreamReaderV2
Streamreaderv2 also reads out most in one file. This actually does fail.
Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (polygraphy run <model.onnx> --onnxrt): polygraphy / tensorrt_llm
The text was updated successfully, but these errors were encountered:
michaelfeil
changed the title
trt.IStreamReader poor performance of TensorRT 10.7trt.IStreamReader usage from polygraphy of TensorRT 10.7 requires higher memory and time than naive implementation.
Jan 16, 2025
michaelfeil
changed the title
trt.IStreamReader usage from polygraphy of TensorRT 10.7 requires higher memory and time than naive implementation.trt.IStreamReader usage from polygraphy requires higher peak memory and more time than naive python implementation.
Jan 16, 2025
michaelfeil
changed the title
trt.IStreamReader usage from polygraphy requires higher peak memory and more time than naive python implementation.trt.IStreamReader usage from polygraphy requires higher peak CPU memory and more time than naive python implementation.
Jan 16, 2025
michaelfeil
changed the title
trt.IStreamReader usage from polygraphy requires higher peak CPU memory and more time than naive python implementation.trt.IStreamReader (as implemented e.g. in polygraphy) requires higher peak CPU memory and more time than naive python implementation.
Jan 17, 2025
Description
I am trying to optimize the loading of a ~14.2GB tensorrt-llm engine on a 16GB CPU RAM node into a 16GB VRAM. As the rest of my program takes around ~1GB CPU RAM, there is little room for not streaming the CudaEngine from disk to cuda.
Upon trying out the
trt.IStreamReader
the class does not hold its promises.Environment
TensorRT Version:
NVIDIA GPU: H100
/baseten/engine-builder/tei_trt# nvidia-smi
Wed Jan 15 23:59:04 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
Operating System: Ubuntu 22.04
Python Version (if applicable): 3.10.2
PyTorch Version (if applicable): 2.5.1
Baremetal or Container (if so, version):
Relevant Files
Llama-7B engine created with TensorRT-LLM 0.16.0
Steps To Reproduce
Vanilla results
8.4s + peak memory 15524688kB
IStreamReaderV1 loading:
Analysis
The duplication of the memory is likely because of a parsing from python to cpp, which uses a copy. If the API was to read it in smaller chunks, this would not be as bad.
The
.read(size)
API is called twice with StreamV1 class, requesting the initial 32Bytes and then the rest.pdb breakpoint delivers no additional info
Analysis IStreamReaderV2
Streamreaderv2 also reads out most in one file. This actually does fail.
Commands or scripts:
Have you tried the latest release?: YES
Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (
polygraphy run <model.onnx> --onnxrt
): polygraphy / tensorrt_llmThe text was updated successfully, but these errors were encountered: