No speedup from float16 with directml compared to cuda #23359

Samy-mri · 2025-01-14T11:11:26Z

I am using direct-ml for inference of UNet models trained with PyTorch. UNets are mostly Conv3D + BatchNormalization + Relu operations. No transformers used.

The inference results are great and I am looking for model optimization for faster inference.
I hoped that converting the model weights to float16 would be about twice as fast, however I took just as long or sometimes 5-10% longer than float32.

With the same models and cuda execution provider I got halve the inference time, as expected. However I like the portability from directml.

I export models as:

    torch.onnx.export(
        model,  # model being run
        input,  # model input (or a tuple for multiple inputs)
        output,  # where to save the model (can be a file or file-like object)
        export_params=True,  # store the trained parameter weights inside the model file
        opset_version=opset_version,  # the ONNX version to export the model to
        do_constant_folding=True,  # whether to execute constant folding for optimization
        input_names=["input"],  # the model's input names
        output_names=["output"],  # the model's output names
        dynamic_axes={
            "input": {0: "batch_size"},  # variable length axes
            "output": {0: "batch_size"},
        },
    )

I tried the following:

Converted from float32 to float16 using onnxconverter_common
with and without IObinding, no effect on inference time.
Monitoring GPU, I found that while running float16, GPU utillity deviated between 60-80%. While running float32 models, GPU utility ranged 90-100%. GPU RAM was halved as expected.
Opset version 7 and 20 and compatible onnxruntime (1.17 - 1.20)
I suspected some float16 operations not implemented in directml, so using Nvidia Nsight systems I checked if there was a sudden large Tx/Rx to CPU, or process-linked CPU activity during inference. I couldn't immediately see something suspicious but I am a beginner in CPU-GPU profiling.
Batch prediction: Since GPU-util for float16 was under-utilized, I experimented with doubling batch_size. GPU Util does go up to >90% however each prediction also doubles, resulting in no inference speed up.
Olive optimization exporting with model output float16. The model size did not decrease, nor did it the speed up inference. It's outside the scope of this forum but just for completeness.

Expected behaviour

I would expect half the inference time, since using the same platform and GPU I can get so with the cuda provider.

Are there any other options that I could try?
platform: Windows 11
python=3.11.9
Onnx=1.16
Onnxruntime=1.17 / 1.20
GPU: NVIDIA RTX 2080 8gb VRAM.

The text was updated successfully, but these errors were encountered:

github-actions bot added ep:CUDA issues related to the CUDA execution provider ep:DML issues related to the DirectML execution provider model:transformer issues related to a transformer model: BERT, GPT2, Hugging Face, Longformer, T5, etc. labels Jan 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No speedup from float16 with directml compared to cuda #23359

No speedup from float16 with directml compared to cuda #23359

Samy-mri commented Jan 14, 2025

No speedup from float16 with directml compared to cuda #23359

No speedup from float16 with directml compared to cuda #23359

Comments

Samy-mri commented Jan 14, 2025

Expected behaviour