Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance] Multiple instances of the same model are slower #22778

Open
JorgeRuizITCL opened this issue Nov 8, 2024 · 0 comments
Open

[Performance] Multiple instances of the same model are slower #22778

JorgeRuizITCL opened this issue Nov 8, 2024 · 0 comments
Labels
performance issues related to performance regressions

Comments

@JorgeRuizITCL
Copy link

JorgeRuizITCL commented Nov 8, 2024

Describe the issue

Running a model for N iterations in a single ONNX session is way faster than running the same model in 2 independent sessions, each session is run for N/2 iterations each.

¿What's going on? I'm writing a realtime inference pipeline that involves two models (M1 and M2) at the same time. One of the models, M1, is quite small and has a fixed batch-size of 1. On the other hand, the input for M2 is relies in a sequence of M1 outputs bundled together... This requirement is hindered by the batch-size of 1 of the first model, as i have to run multiple inference requests instead of a single one, increasing the overall inference latency.

In this quest to reduce the latency, i tried tinkering around with multiple instances of the same model, as the model is quite small and maybe it could benefit from a multi-threading speed-up or something... But i found the opposite.

Running a single instance took on average 2.32ms
Running 2 instances took on average 9.29ms

¿What's going on? And more importantly, ¿Does having two sessions with two independent models (M1 and M2 pipeline) affect the overall performance of ONNX?

To reproduce

Tinker with
n_models = 2
single_model = False

import time
from pathlib import Path

import numpy as np
import onnxruntime as ort

# Load the MobileNetV2 model
model_path = Path(__file__).parent / "mobilenetv2-10.onnx"

# Configure ONNX Runtime to use GPU
providers = ["CUDAExecutionProvider"]  # Ensures the model runs on GPU
n_models = 2
single_model = False
# Initialize the ONNX Runtime session


if single_model:
    sessions = [ort.InferenceSession(model_path, providers=providers)] * n_models
else:
    sessions = [
        ort.InferenceSession(model_path, providers=providers) for _ in range(n_models)
    ]


# Create a random input tensor simulating a batch of images (3x224x224)
input_shape = (1, 3, 224, 224)


# Get model input name
input_name = sessions[0].get_inputs()[0].name

# Measure performance
num_runs = 100


times = []

for _ in range(num_runs):
    for session in sessions:
        input_data = np.random.rand(*input_shape).astype(np.float32)

        start_time = time.perf_counter()
        session.run(None, {input_name: input_data})
        times.append(time.perf_counter() - start_time)

assert len(times) == n_models * num_runs

print(f"Average time per inference: {np.average(times[1:]) * 1000:.6f} ms")
print(f"STD {np.std(times[1:]) * 1000} ms")

Urgency

No response

Platform

Windows

OS Version

11

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

onnxruntime-gpu==1.19.2

ONNX Runtime API

Python

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

CUDA 11.7 + RTX3060 12GB

Model File

Replicated with a private model and MobilenetV2

Is this a quantized model?

No

@JorgeRuizITCL JorgeRuizITCL added the performance issues related to performance regressions label Nov 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance issues related to performance regressions
Projects
None yet
Development

No branches or pull requests

1 participant