-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TensorRT 8.6.2 MatrixMultiply Operator Quantization #4322
Comments
which website ? |
https://docs.nvidia.com/deeplearning/tensorrt/operators/docs/MatrixMultiply.html this website |
can you upload the two case logs by follow cmd trtexec --verbose \
--best \
--separateProfileRun \
--onnx=spec \
--dumpProfile \
--dumpLayerInfo --profilingVerbosity=detailed \
--exportLayerInfo=li.json 2>&1 | tee out.log |
I have output the logs before. The first one is the log without inserting QDQ nodes in MatrixMultiply. The second one is the log after manually inserting QDQ nodes in MatrixMultiply. |
I am performing QAT quantization on the HRNet OCR model and using TensorRT 8.6.2 to convert and quantize the generated ONNX model with QDQ operations. After conversion, I found that the MatrixMultiply operator was not quantized to INT8. As shown in the figure below.
Then, I manually inserted QDQ operators between the two matrices being multiplied, and after conversion, the MatrixMultiply operator was successfully quantized to INT8.
However, an issue occurred: the conversion resulted in the INT8 version of MatrixMultiply taking more time than the original FP16 version. As shown in the figure below, the first bar represents the FP16 execution time, and the second bar represents the INT8 execution time.
"Why is this the case?"
"Moreover, I found on the official website that MatrixMultiply does not support INT8. Why is it that after I manually inserted the QDQ nodes, it can be quantized to INT8?"
The text was updated successfully, but these errors were encountered: