[Performance] Multiple instances of the same model are slower #22778

JorgeRuizITCL · 2024-11-08T13:58:41Z

Describe the issue

Running a model for N iterations in a single ONNX session is way faster than running the same model in 2 independent sessions, each session is run for N/2 iterations each.

¿What's going on? I'm writing a realtime inference pipeline that involves two models (M1 and M2) at the same time. One of the models, M1, is quite small and has a fixed batch-size of 1. On the other hand, the input for M2 is relies in a sequence of M1 outputs bundled together... This requirement is hindered by the batch-size of 1 of the first model, as i have to run multiple inference requests instead of a single one, increasing the overall inference latency.

In this quest to reduce the latency, i tried tinkering around with multiple instances of the same model, as the model is quite small and maybe it could benefit from a multi-threading speed-up or something... But i found the opposite.

Running a single instance took on average 2.32ms
Running 2 instances took on average 9.29ms

¿What's going on? And more importantly, ¿Does having two sessions with two independent models (M1 and M2 pipeline) affect the overall performance of ONNX?

To reproduce

Tinker with
n_models = 2
single_model = False

import time
from pathlib import Path

import numpy as np
import onnxruntime as ort

# Load the MobileNetV2 model
model_path = Path(__file__).parent / "mobilenetv2-10.onnx"

# Configure ONNX Runtime to use GPU
providers = ["CUDAExecutionProvider"]  # Ensures the model runs on GPU
n_models = 2
single_model = False
# Initialize the ONNX Runtime session


if single_model:
    sessions = [ort.InferenceSession(model_path, providers=providers)] * n_models
else:
    sessions = [
        ort.InferenceSession(model_path, providers=providers) for _ in range(n_models)
    ]


# Create a random input tensor simulating a batch of images (3x224x224)
input_shape = (1, 3, 224, 224)


# Get model input name
input_name = sessions[0].get_inputs()[0].name

# Measure performance
num_runs = 100


times = []

for _ in range(num_runs):
    for session in sessions:
        input_data = np.random.rand(*input_shape).astype(np.float32)

        start_time = time.perf_counter()
        session.run(None, {input_name: input_data})
        times.append(time.perf_counter() - start_time)

assert len(times) == n_models * num_runs

print(f"Average time per inference: {np.average(times[1:]) * 1000:.6f} ms")
print(f"STD {np.std(times[1:]) * 1000} ms")

Urgency

No response

Platform

Windows

OS Version

11

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

onnxruntime-gpu==1.19.2

ONNX Runtime API

Python

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

CUDA 11.7 + RTX3060 12GB

Model File

Replicated with a private model and MobilenetV2

Is this a quantized model?

No

JorgeRuizITCL added the performance issues related to performance regressions label Nov 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] Multiple instances of the same model are slower #22778

[Performance] Multiple instances of the same model are slower #22778

JorgeRuizITCL commented Nov 8, 2024 •

edited

Loading

[Performance] Multiple instances of the same model are slower #22778

[Performance] Multiple instances of the same model are slower #22778

Comments

JorgeRuizITCL commented Nov 8, 2024 • edited Loading

Describe the issue

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

Model File

Is this a quantized model?

JorgeRuizITCL commented Nov 8, 2024 •

edited

Loading