You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Running a model for N iterations in a single ONNX session is way faster than running the same model in 2 independent sessions, each session is run for N/2 iterations each.
¿What's going on? I'm writing a realtime inference pipeline that involves two models (M1 and M2) at the same time. One of the models, M1, is quite small and has a fixed batch-size of 1. On the other hand, the input for M2 is relies in a sequence of M1 outputs bundled together... This requirement is hindered by the batch-size of 1 of the first model, as i have to run multiple inference requests instead of a single one, increasing the overall inference latency.
In this quest to reduce the latency, i tried tinkering around with multiple instances of the same model, as the model is quite small and maybe it could benefit from a multi-threading speed-up or something... But i found the opposite.
Running a single instance took on average 2.32ms
Running 2 instances took on average 9.29ms
¿What's going on? And more importantly, ¿Does having two sessions with two independent models (M1 and M2 pipeline) affect the overall performance of ONNX?
To reproduce
Tinker with
n_models = 2
single_model = False
importtimefrompathlibimportPathimportnumpyasnpimportonnxruntimeasort# Load the MobileNetV2 modelmodel_path=Path(__file__).parent/"mobilenetv2-10.onnx"# Configure ONNX Runtime to use GPUproviders= ["CUDAExecutionProvider"] # Ensures the model runs on GPUn_models=2single_model=False# Initialize the ONNX Runtime sessionifsingle_model:
sessions= [ort.InferenceSession(model_path, providers=providers)] *n_modelselse:
sessions= [
ort.InferenceSession(model_path, providers=providers) for_inrange(n_models)
]
# Create a random input tensor simulating a batch of images (3x224x224)input_shape= (1, 3, 224, 224)
# Get model input nameinput_name=sessions[0].get_inputs()[0].name# Measure performancenum_runs=100times= []
for_inrange(num_runs):
forsessioninsessions:
input_data=np.random.rand(*input_shape).astype(np.float32)
start_time=time.perf_counter()
session.run(None, {input_name: input_data})
times.append(time.perf_counter() -start_time)
assertlen(times) ==n_models*num_runsprint(f"Average time per inference: {np.average(times[1:]) *1000:.6f} ms")
print(f"STD {np.std(times[1:]) *1000} ms")
Describe the issue
Running a model for N iterations in a single ONNX session is way faster than running the same model in 2 independent sessions, each session is run for N/2 iterations each.
¿What's going on? I'm writing a realtime inference pipeline that involves two models (M1 and M2) at the same time. One of the models, M1, is quite small and has a fixed batch-size of 1. On the other hand, the input for M2 is relies in a sequence of M1 outputs bundled together... This requirement is hindered by the batch-size of 1 of the first model, as i have to run multiple inference requests instead of a single one, increasing the overall inference latency.
In this quest to reduce the latency, i tried tinkering around with multiple instances of the same model, as the model is quite small and maybe it could benefit from a multi-threading speed-up or something... But i found the opposite.
Running a single instance took on average 2.32ms
Running 2 instances took on average 9.29ms
¿What's going on? And more importantly, ¿Does having two sessions with two independent models (M1 and M2 pipeline) affect the overall performance of ONNX?
To reproduce
Tinker with
n_models = 2
single_model = False
Urgency
No response
Platform
Windows
OS Version
11
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
onnxruntime-gpu==1.19.2
ONNX Runtime API
Python
Architecture
X64
Execution Provider
CUDA
Execution Provider Library Version
CUDA 11.7 + RTX3060 12GB
Model File
Replicated with a private model and MobilenetV2
Is this a quantized model?
No
The text was updated successfully, but these errors were encountered: