No speedup from float16 with directml compared to cuda #23359
Labels
ep:CUDA
issues related to the CUDA execution provider
ep:DML
issues related to the DirectML execution provider
model:transformer
issues related to a transformer model: BERT, GPT2, Hugging Face, Longformer, T5, etc.
I am using direct-ml for inference of UNet models trained with PyTorch. UNets are mostly Conv3D + BatchNormalization + Relu operations. No transformers used.
The inference results are great and I am looking for model optimization for faster inference.
I hoped that converting the model weights to float16 would be about twice as fast, however I took just as long or sometimes 5-10% longer than float32.
With the same models and cuda execution provider I got halve the inference time, as expected. However I like the portability from directml.
I export models as:
I tried the following:
Expected behaviour
I would expect half the inference time, since using the same platform and GPU I can get so with the cuda provider.
Are there any other options that I could try?
platform: Windows 11
python=3.11.9
Onnx=1.16
Onnxruntime=1.17 / 1.20
GPU: NVIDIA RTX 2080 8gb VRAM.
The text was updated successfully, but these errors were encountered: