Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No speedup from float16 with directml compared to cuda #23359

Open
Samy-mri opened this issue Jan 14, 2025 · 0 comments
Open

No speedup from float16 with directml compared to cuda #23359

Samy-mri opened this issue Jan 14, 2025 · 0 comments
Labels
ep:CUDA issues related to the CUDA execution provider ep:DML issues related to the DirectML execution provider model:transformer issues related to a transformer model: BERT, GPT2, Hugging Face, Longformer, T5, etc.

Comments

@Samy-mri
Copy link

I am using direct-ml for inference of UNet models trained with PyTorch. UNets are mostly Conv3D + BatchNormalization + Relu operations. No transformers used.

The inference results are great and I am looking for model optimization for faster inference.
I hoped that converting the model weights to float16 would be about twice as fast, however I took just as long or sometimes 5-10% longer than float32.

With the same models and cuda execution provider I got halve the inference time, as expected. However I like the portability from directml.

I export models as:

    torch.onnx.export(
        model,  # model being run
        input,  # model input (or a tuple for multiple inputs)
        output,  # where to save the model (can be a file or file-like object)
        export_params=True,  # store the trained parameter weights inside the model file
        opset_version=opset_version,  # the ONNX version to export the model to
        do_constant_folding=True,  # whether to execute constant folding for optimization
        input_names=["input"],  # the model's input names
        output_names=["output"],  # the model's output names
        dynamic_axes={
            "input": {0: "batch_size"},  # variable length axes
            "output": {0: "batch_size"},
        },
    )

I tried the following:

  • Converted from float32 to float16 using onnxconverter_common
  • with and without IObinding, no effect on inference time.
  • Monitoring GPU, I found that while running float16, GPU utillity deviated between 60-80%. While running float32 models, GPU utility ranged 90-100%. GPU RAM was halved as expected.
  • Opset version 7 and 20 and compatible onnxruntime (1.17 - 1.20)
  • I suspected some float16 operations not implemented in directml, so using Nvidia Nsight systems I checked if there was a sudden large Tx/Rx to CPU, or process-linked CPU activity during inference. I couldn't immediately see something suspicious but I am a beginner in CPU-GPU profiling.
  • Batch prediction: Since GPU-util for float16 was under-utilized, I experimented with doubling batch_size. GPU Util does go up to >90% however each prediction also doubles, resulting in no inference speed up.
  • Olive optimization exporting with model output float16. The model size did not decrease, nor did it the speed up inference. It's outside the scope of this forum but just for completeness.

Expected behaviour

I would expect half the inference time, since using the same platform and GPU I can get so with the cuda provider.

Are there any other options that I could try?
platform: Windows 11
python=3.11.9
Onnx=1.16
Onnxruntime=1.17 / 1.20
GPU: NVIDIA RTX 2080 8gb VRAM.

@github-actions github-actions bot added ep:CUDA issues related to the CUDA execution provider ep:DML issues related to the DirectML execution provider model:transformer issues related to a transformer model: BERT, GPT2, Hugging Face, Longformer, T5, etc. labels Jan 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ep:CUDA issues related to the CUDA execution provider ep:DML issues related to the DirectML execution provider model:transformer issues related to a transformer model: BERT, GPT2, Hugging Face, Longformer, T5, etc.
Projects
None yet
Development

No branches or pull requests

1 participant