Accuracy with TensorRT 10.7 and self-attention #4328

mrjackbo · 2025-01-18T14:26:39Z

I am trying to convert an open-clip (pip install open_clip_torch==2.30.0) model to TensorRT:

import open_clip
import torch

model, _, _ = open_clip.create_model_and_transforms("ViT-SO400M-14-SigLIP-384", pretrained="webli")

image_input = torch.randn((1, 3, 384, 384), dtype=torch.float32)
onnx = torch.onnx.export(model, (image_input), "ViT-SO400M-14-SigLIP-384.onnx", input_names = ["input"], output_names = ["output"], dynamo=False, train=False, do_constant_folding=True, opset_version=18, export_params=True, dynamic_axes={"input": {0:"N"}, "output": {0:"N"}})

This produces a valid onnx file, such that onnx-runtime execution matches with pytorch of the original model.

To convert the model to TensorRT, I do:

docker run --gpus=all --rm -it -v $(pwd):/model \
    --env POLYGRAPHY_AUTOINSTALL_DEPS=1 \
    nvcr.io/nvidia/tensorrt:24.12-py3 \
    polygraphy run /model/ViT-SO400M-14-SigLIP-384.onnx --onnxrt --trt

[...]

[I]         Error Metrics: output
[I]             Minimum Required Tolerance: elemwise error | [abs=0.088555] OR [rel=1639.2] (requirements may be lower if both abs/rel tolerances are set)
[I]             Absolute Difference | Stats: mean=0.017181, std-dev=0.013923, var=0.00019386, median=0.014335, min=3.429e-05 at (0, 516), max=0.088555 at (0, 1013), avg-magnitude=0.017181, p90=0.035568, p95=0.043743, p99=0.065308
[I]                 ---- Histogram ----
                    Bin Range           |  Num Elems | Visualization
                    (3.43e-05, 0.00889) |        381 | ########################################
                    (0.00889 , 0.0177 ) |        323 | #################################
                    (0.0177  , 0.0266 ) |        208 | #####################
                    (0.0266  , 0.0354 ) |        122 | ############
                    (0.0354  , 0.0443 ) |         61 | ######
                    (0.0443  , 0.0531 ) |         29 | ###
                    (0.0531  , 0.062  ) |         11 | #
                    (0.062   , 0.0709 ) |          8 | 
                    (0.0709  , 0.0797 ) |          5 | 
                    (0.0797  , 0.0886 ) |          4 | 
[I]             Relative Difference | Stats: mean=5.9796, std-dev=51.921, var=2695.8, median=1.1482, min=0.0033405 at (0, 722), max=1639.2 at (0, 375), avg-magnitude=5.9796, p90=6.266, p95=17.039, p99=72.345
[I]                 ---- Histogram ----
                    Bin Range            |  Num Elems | Visualization
                    (0.00334 , 164     ) |       1147 | ########################################
                    (164     , 328     ) |          3 | 
                    (328     , 492     ) |          1 | 
                    (492     , 656     ) |          0 | 
                    (656     , 820     ) |          0 | 
                    (820     , 984     ) |          0 | 
                    (984     , 1.15e+03) |          0 | 
                    (1.15e+03, 1.31e+03) |          0 | 
                    (1.31e+03, 1.48e+03) |          0 | 
                    (1.48e+03, 1.64e+03) |          1 | 
[E]         FAILED | Output: 'output' | Difference exceeds tolerance (rel=1e-05, abs=1e-05)

Note the magnitude of the relative error: (p90=6.266 !!). This happens on my RTX A4500 Laptop GPU (driver 560) and on my V100 (but here I use tensorrt:24.06-py3, as TensorRT 10.7 does not support Volta anymore). The FP16/BF16 case is even worse.

When I do the same conversion with --fp8, the error vanishes (note that the A4500 and V100 do not support FP8 kernels). I compared the trtexec verbose logs, and found that in the fp32 case, TensorRT recognizes the self-attention pattern, but in the FP8 case it does not:

trtexec --onnx=ViT-SO400M-14-SigLIP-384.onnx --verbose
[...]
[01/18/2025-15:16:59] [V] [TRT] Found /visual/trunk/blocks/blocks.18/attn/MatMul to be part of self-attention pattern.                                                                                                                                                                                                        
[01/18/2025-15:16:59] [V] [TRT] Found /visual/trunk/blocks/blocks.18/attn/Softmax to be part of self-attention pattern.                                                                                                                                                                                                       
[01/18/2025-15:16:59] [V] [TRT] Found /visual/trunk/blocks/blocks.18/attn/MatMul_1 to be part of self-attention pattern.                                                                                                                                                                                                      
[01/18/2025-15:16:59] [V] [TRT] Found and reassigned Myelin backends for Self-Attention nodes  
[...]

This observation got me thinking...when I replace the /attn/Softmax nodes with a custom TensorRT softmax plugin, the TensorRT optimizer can no longer do the self-attention optimization, and the result is that I get TensorRT engines with acceptable accuracy (even in fp16).
My conclusion: Somehow, for this model, the myelin self-attenion fusion is buggy.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accuracy with TensorRT 10.7 and self-attention #4328

Accuracy with TensorRT 10.7 and self-attention #4328

mrjackbo commented Jan 18, 2025

Accuracy with TensorRT 10.7 and self-attention #4328

Accuracy with TensorRT 10.7 and self-attention #4328

Comments

mrjackbo commented Jan 18, 2025