[Performance] Building on device (ARMv8.2 CPU) vs "pip install onnxruntime" #20437

kmn1024 · 2024-04-23T09:50:38Z

kmn1024
Apr 23, 2024

Describe the issue

Wondering if there's a performance difference between using "pip install onnxruntime" versus building onnxruntime on device (using instructions like https://onnxruntime.ai/docs/build/inferencing.html#linux).

I use CPUExecutionProvider, on Ubuntu 22.04, on a computer that uses RK3588 (which has ARMv8.2 CPUs). I currently use "pip install onnxruntime", and things work perfectly fine, but am wondering if more performance can be squeezed out.

To reproduce

Not a bug, just a general performance question.

Urgency

Not urgent, just a general performance question.

Platform

Linux

OS Version

Ubuntu 22.04

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.15.1

ONNX Runtime API

Python

Architecture

ARM64

Execution Provider

Default CPU

Execution Provider Library Version

No response

Model File

No response

Is this a quantized model?

No

Answered by snadampal

Apr 23, 2024

though the base arch used is armv8-a, onnxruntime does support dynamic dispatch of the arch specific kernels during runtime. For example, there are mmla, fp16 and bf16 kernels supported in python wheel. So, I doubt if anything is missing in the wheel and can be extracted in source builds. btw, do you have any particular kernel in mind?

View full answer

snadampal · 2024-04-23T18:22:43Z

snadampal
Apr 23, 2024

though the base arch used is armv8-a, onnxruntime does support dynamic dispatch of the arch specific kernels during runtime. For example, there are mmla, fp16 and bf16 kernels supported in python wheel. So, I doubt if anything is missing in the wheel and can be extracted in source builds. btw, do you have any particular kernel in mind?

2 replies

kmn1024 Apr 24, 2024
Author

I'm a beginner in this area so I don't know what's available. The model I'm working on mostly consists of Linear and Conv1d layers, and I execute it with CPUExecutionProvider in fp32 (no quantization, as all attempts to quantize have resulted in dramatic quality drop).

snadampal Apr 26, 2024

If bfloat16 is supported on your platform (for example, AWS Graviton3-based c7g/m7g/r7g instances support bfloat16), you can try with sbgemm kernels from onnxruntime. The kernels will take fp32 model (without any quantization) and accelerate the gemm compute with bfloat16 instructions. you can enable this path with

For C++ applications

SessionOptions so;
so.config_options.AddConfigEntry(
      kOrtSessionOptionsMlasGemmFastMathArm64Bfloat16, "1");

For Python applications

sess_options = onnxruntime.SessionOptions()
sess_options.add_session_config_entry("mlas.enable_gemm_fastmath_arm64_bfloat16", "1")

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] Building on device (ARMv8.2 CPU) vs "pip install onnxruntime" #20437

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

[Performance] Building on device (ARMv8.2 CPU) vs "pip install onnxruntime" #20437

kmn1024 Apr 23, 2024

Describe the issue

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

Model File

Is this a quantized model?

Replies: 1 comment · 2 replies

snadampal Apr 23, 2024

kmn1024 Apr 24, 2024 Author

snadampal Apr 26, 2024

For C++ applications

For Python applications

kmn1024
Apr 23, 2024

Replies: 1 comment 2 replies

snadampal
Apr 23, 2024

kmn1024 Apr 24, 2024
Author