Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accuracy with TensorRT 10.7 and self-attention #4328

Open
mrjackbo opened this issue Jan 18, 2025 · 0 comments
Open

Accuracy with TensorRT 10.7 and self-attention #4328

mrjackbo opened this issue Jan 18, 2025 · 0 comments

Comments

@mrjackbo
Copy link

I am trying to convert an open-clip (pip install open_clip_torch==2.30.0) model to TensorRT:

import open_clip
import torch

model, _, _ = open_clip.create_model_and_transforms("ViT-SO400M-14-SigLIP-384", pretrained="webli")

image_input = torch.randn((1, 3, 384, 384), dtype=torch.float32)
onnx = torch.onnx.export(model, (image_input), "ViT-SO400M-14-SigLIP-384.onnx", input_names = ["input"], output_names = ["output"], dynamo=False, train=False, do_constant_folding=True, opset_version=18, export_params=True, dynamic_axes={"input": {0:"N"}, "output": {0:"N"}})

This produces a valid onnx file, such that onnx-runtime execution matches with pytorch of the original model.

To convert the model to TensorRT, I do:

docker run --gpus=all --rm -it -v $(pwd):/model \
    --env POLYGRAPHY_AUTOINSTALL_DEPS=1 \
    nvcr.io/nvidia/tensorrt:24.12-py3 \
    polygraphy run /model/ViT-SO400M-14-SigLIP-384.onnx --onnxrt --trt

[...]

[I]         Error Metrics: output
[I]             Minimum Required Tolerance: elemwise error | [abs=0.088555] OR [rel=1639.2] (requirements may be lower if both abs/rel tolerances are set)
[I]             Absolute Difference | Stats: mean=0.017181, std-dev=0.013923, var=0.00019386, median=0.014335, min=3.429e-05 at (0, 516), max=0.088555 at (0, 1013), avg-magnitude=0.017181, p90=0.035568, p95=0.043743, p99=0.065308
[I]                 ---- Histogram ----
                    Bin Range           |  Num Elems | Visualization
                    (3.43e-05, 0.00889) |        381 | ########################################
                    (0.00889 , 0.0177 ) |        323 | #################################
                    (0.0177  , 0.0266 ) |        208 | #####################
                    (0.0266  , 0.0354 ) |        122 | ############
                    (0.0354  , 0.0443 ) |         61 | ######
                    (0.0443  , 0.0531 ) |         29 | ###
                    (0.0531  , 0.062  ) |         11 | #
                    (0.062   , 0.0709 ) |          8 | 
                    (0.0709  , 0.0797 ) |          5 | 
                    (0.0797  , 0.0886 ) |          4 | 
[I]             Relative Difference | Stats: mean=5.9796, std-dev=51.921, var=2695.8, median=1.1482, min=0.0033405 at (0, 722), max=1639.2 at (0, 375), avg-magnitude=5.9796, p90=6.266, p95=17.039, p99=72.345
[I]                 ---- Histogram ----
                    Bin Range            |  Num Elems | Visualization
                    (0.00334 , 164     ) |       1147 | ########################################
                    (164     , 328     ) |          3 | 
                    (328     , 492     ) |          1 | 
                    (492     , 656     ) |          0 | 
                    (656     , 820     ) |          0 | 
                    (820     , 984     ) |          0 | 
                    (984     , 1.15e+03) |          0 | 
                    (1.15e+03, 1.31e+03) |          0 | 
                    (1.31e+03, 1.48e+03) |          0 | 
                    (1.48e+03, 1.64e+03) |          1 | 
[E]         FAILED | Output: 'output' | Difference exceeds tolerance (rel=1e-05, abs=1e-05)

Note the magnitude of the relative error: (p90=6.266 !!). This happens on my RTX A4500 Laptop GPU (driver 560) and on my V100 (but here I use tensorrt:24.06-py3, as TensorRT 10.7 does not support Volta anymore). The FP16/BF16 case is even worse.

When I do the same conversion with --fp8, the error vanishes (note that the A4500 and V100 do not support FP8 kernels). I compared the trtexec verbose logs, and found that in the fp32 case, TensorRT recognizes the self-attention pattern, but in the FP8 case it does not:

trtexec --onnx=ViT-SO400M-14-SigLIP-384.onnx --verbose
[...]
[01/18/2025-15:16:59] [V] [TRT] Found /visual/trunk/blocks/blocks.18/attn/MatMul to be part of self-attention pattern.                                                                                                                                                                                                        
[01/18/2025-15:16:59] [V] [TRT] Found /visual/trunk/blocks/blocks.18/attn/Softmax to be part of self-attention pattern.                                                                                                                                                                                                       
[01/18/2025-15:16:59] [V] [TRT] Found /visual/trunk/blocks/blocks.18/attn/MatMul_1 to be part of self-attention pattern.                                                                                                                                                                                                      
[01/18/2025-15:16:59] [V] [TRT] Found and reassigned Myelin backends for Self-Attention nodes  
[...]

This observation got me thinking...when I replace the /attn/Softmax nodes with a custom TensorRT softmax plugin, the TensorRT optimizer can no longer do the self-attention optimization, and the result is that I get TensorRT engines with acceptable accuracy (even in fp16).
My conclusion: Somehow, for this model, the myelin self-attenion fusion is buggy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant