Matmul CPU performance regression #3072

AlexandreEichenberger · 2025-02-12T19:36:40Z

When lowering the KrnlMatmul operation, we generate different code for full/partial simd/scalar tile. Currently, the temporary used to store the kernel's register were allocated using alignedAlloc, inside a scf:if. This prevented the buffer hoisting optimization from MLIR to hoist the alloc/free from inside the inner loops (iterating over all of the tiles / panel of mammals).

This resulted in a 2x slow down compared to earlier versions.

This PR fixes this by pre-allocating the data outside of the scf-if, thus enabling the buffer hoisting pass to move the allocations outside of the entire matmul loops (including the 3 nested loop iterating over the tiles.

Because the data is small, it's ok to use alloca in sequential mode (parallel disabled). This does not work when parallel is enabled, because the alloc would remain stuck inside of the outermost omp parallel for loop. TODO: migrate the alloc/alloca outside of the omp for loop into the omp parallel region. Then alloca should be used as well.

Migrating the alloc outside reduces 80% of the overheads, migrating to alloca nearly all of the removed the overheads.

Signed-off-by: Alexandre Eichenberger <[email protected]>

AlexandreEichenberger · 2025-02-12T20:27:19Z

Time for CPU CCFD with 4.2

/metis-workspace/main-4.2/onnx-mlir/build/Debug/bin/onnx-mlir
ONNX_MLIR_INSTRUMENT_FILE=ccfd-4.2.log RunONNXModel.py -m ccfd_dynamic.onnx -c="-shapeInformation=0:7x160 -mcpu=z16  -O3 -profile-ir=Onnx" -w 2 -n 10
root@ccfeee9c080a:/metis-workspace/main-4.2/amodels/ccfd# make-report.py -ums -w 2 -r ccfd-4.2.log
Statistics start, all ops, ordered_by time, tot_time 46.5553333
  onnx.LSTM, 2, 23.2680833, 46.5361667, 100.0%
  onnx.MatMul, 1, 0.0150000, 0.0150000, 0.0%
  onnx.Sigmoid, 1, 0.0060000, 0.0060000, 0.0%
  onnx.Constant, 11, 0.0003636, 0.0040000, 0.0%
  onnx.Squeeze, 2, 0.0005833, 0.0011667, 0.0%
  onnx.Add, 1, 0.0010000, 0.0010000, 0.0%
Statistics end, all ops, ordered_by time, tot_time 46.5553333

Dev Version before the fix

ONNX_MLIR_INSTRUMENT_FILE=ccfd-dev.log RunONNXModel.py -m ccfd_dynamic.onnx -c="-shapeInformation=0:7x160 -mcpu=z16  -O3 -profile-ir=Onnx" -w 2 -n 10
/metis-workspace/main/onnx-mlir/build/Debug/bin/onnx-mlir
root@ccfeee9c080a:/metis-workspace/main/amodels/ccfd# make-report.py -ums -w 2 -r ccfd-dev.log
Statistics start, all ops, ordered_by time, tot_time 128.4006667
  onnx.LSTM, 2, 64.1216667, 128.2433333, 99.9%
  onnx.MatMul, 1, 0.0256667, 0.0256667, 0.0%
  onnx.Sigmoid, 1, 0.0066667, 0.0066667, 0.0%
  onnx.Squeeze, 2, 0.0005833, 0.0011667, 0.0%
  onnx.Add, 1, 0.0010000, 0.0010000, 0.0%
  onnx.Constant, 3, 0.0001667, 0.0005000, 0.0%
Statistics end, all ops, ordered_by time, tot_time 128.4006667

and version with fix

Statistics start all ops ordered_by time, tot_time,  49.9953333
  onnx.LSTM, 2, 24.8744167, 49.7488333, 99.5%
  onnx.MatMul, 1, 0.0170000, 0.0170000, 0.0%
  onnx.Sigmoid, 1, 0.0075000, 0.0075000, 0.0%
  onnx.Constant, 3, 0.0004444, 0.0013333, 0.0%
  onnx.Add, 1, 0.0010000, 0.0010000, 0.0%
  onnx.Squeeze, 2, 0.0000833, 0.0001667, 0.0%
Statistics end all ops ordered_by time, tot_time,  49.9953333

So within 4 ms of original dev time, covering 95% of original degradation.

tungld

LGTM!

If possible, could you add some lit tests for different situations like parallelism, no-parallelism? Thanks!

Also I wonder whether we should do the same for compiler-generated stick/unstick or not.

AlexandreEichenberger · 2025-02-13T14:26:39Z

I agree, I was surprised there were no lit tests. Let me add them in a subsequent PR so that the perf team may start evaluating the benchmarks right away.

jenkins-droid · 2025-02-13T20:27:26Z

Jenkins Linux amd64 Build #16288 [push] null... failed after 1 hr 22 min

jenkins-droid · 2025-02-13T20:53:33Z

Jenkins Linux s390x Build #16290 [push] null... failed after 1 hr 48 min

AlexandreEichenberger added 23 commits December 19, 2024 16:20

merge from remote branch

ee16dee

Signed-off-by: Alexandre Eichenberger <[email protected]>

added files

5b6b918

Signed-off-by: Alexandre Eichenberger <[email protected]>

fix tests

5e7e21f

Signed-off-by: Alexandre Eichenberger <[email protected]>

update

97d871a

update

903fcb4

Signed-off-by: Alexandre Eichenberger <[email protected]>

update

0cb084d

update

7cc9a95

update

fb39de3

update

122c804

update

a4e3d3b

update

3092abb

update

e6806ee

update

2e1310e

revert KrnlMatmul to use alloca

d3520de

Signed-off-by: Alexandre Eichenberger <[email protected]>

fix

f3b7b24

Signed-off-by: Alexandre Eichenberger <[email protected]>

move alloc out of if

5dcd2c9

Signed-off-by: Alexandre Eichenberger <[email protected]>

move to alloca

d72a8af

Signed-off-by: Alexandre Eichenberger <[email protected]>

alloca in sequtial, malloc in parallel for krnlMatmul

642bb45

Signed-off-by: Alexandre Eichenberger <[email protected]>

fixes

a1dd524

Signed-off-by: Alexandre Eichenberger <[email protected]>

format

959daa2

Signed-off-by: Alexandre Eichenberger <[email protected]>

update

3a1b1a0

remove enableSIMD which was not used

5357258

Signed-off-by: Alexandre Eichenberger <[email protected]>

format

e0455ff

Signed-off-by: Alexandre Eichenberger <[email protected]>

AlexandreEichenberger requested review from tungld and chentong319 February 12, 2025 21:11

tungld approved these changes Feb 13, 2025

View reviewed changes

Merge branch 'main' into cpu-matmul

a33905a

AlexandreEichenberger merged commit 409a12c into onnx:main Feb 13, 2025
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Matmul CPU performance regression #3072

Matmul CPU performance regression #3072

AlexandreEichenberger commented Feb 12, 2025

AlexandreEichenberger commented Feb 12, 2025

tungld left a comment

AlexandreEichenberger commented Feb 13, 2025

jenkins-droid commented Feb 13, 2025

jenkins-droid commented Feb 13, 2025

Matmul CPU performance regression #3072

Matmul CPU performance regression #3072

Conversation

AlexandreEichenberger commented Feb 12, 2025

AlexandreEichenberger commented Feb 12, 2025

tungld left a comment

Choose a reason for hiding this comment

AlexandreEichenberger commented Feb 13, 2025

jenkins-droid commented Feb 13, 2025

jenkins-droid commented Feb 13, 2025