Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Matmul CPU performance regression #3072

Merged
merged 24 commits into from
Feb 13, 2025

Conversation

AlexandreEichenberger
Copy link
Collaborator

When lowering the KrnlMatmul operation, we generate different code for full/partial simd/scalar tile. Currently, the temporary used to store the kernel's register were allocated using alignedAlloc, inside a scf:if. This prevented the buffer hoisting optimization from MLIR to hoist the alloc/free from inside the inner loops (iterating over all of the tiles / panel of mammals).

This resulted in a 2x slow down compared to earlier versions.

This PR fixes this by pre-allocating the data outside of the scf-if, thus enabling the buffer hoisting pass to move the allocations outside of the entire matmul loops (including the 3 nested loop iterating over the tiles.

Because the data is small, it's ok to use alloca in sequential mode (parallel disabled). This does not work when parallel is enabled, because the alloc would remain stuck inside of the outermost omp parallel for loop. TODO: migrate the alloc/alloca outside of the omp for loop into the omp parallel region. Then alloca should be used as well.

Migrating the alloc outside reduces 80% of the overheads, migrating to alloca nearly all of the removed the overheads.

Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
@AlexandreEichenberger
Copy link
Collaborator Author

Time for CPU CCFD with 4.2

/metis-workspace/main-4.2/onnx-mlir/build/Debug/bin/onnx-mlir
ONNX_MLIR_INSTRUMENT_FILE=ccfd-4.2.log RunONNXModel.py -m ccfd_dynamic.onnx -c="-shapeInformation=0:7x160 -mcpu=z16  -O3 -profile-ir=Onnx" -w 2 -n 10
root@ccfeee9c080a:/metis-workspace/main-4.2/amodels/ccfd# make-report.py -ums -w 2 -r ccfd-4.2.log
Statistics start, all ops, ordered_by time, tot_time 46.5553333
  onnx.LSTM, 2, 23.2680833, 46.5361667, 100.0%
  onnx.MatMul, 1, 0.0150000, 0.0150000, 0.0%
  onnx.Sigmoid, 1, 0.0060000, 0.0060000, 0.0%
  onnx.Constant, 11, 0.0003636, 0.0040000, 0.0%
  onnx.Squeeze, 2, 0.0005833, 0.0011667, 0.0%
  onnx.Add, 1, 0.0010000, 0.0010000, 0.0%
Statistics end, all ops, ordered_by time, tot_time 46.5553333

Dev Version before the fix

ONNX_MLIR_INSTRUMENT_FILE=ccfd-dev.log RunONNXModel.py -m ccfd_dynamic.onnx -c="-shapeInformation=0:7x160 -mcpu=z16  -O3 -profile-ir=Onnx" -w 2 -n 10
/metis-workspace/main/onnx-mlir/build/Debug/bin/onnx-mlir
root@ccfeee9c080a:/metis-workspace/main/amodels/ccfd# make-report.py -ums -w 2 -r ccfd-dev.log
Statistics start, all ops, ordered_by time, tot_time 128.4006667
  onnx.LSTM, 2, 64.1216667, 128.2433333, 99.9%
  onnx.MatMul, 1, 0.0256667, 0.0256667, 0.0%
  onnx.Sigmoid, 1, 0.0066667, 0.0066667, 0.0%
  onnx.Squeeze, 2, 0.0005833, 0.0011667, 0.0%
  onnx.Add, 1, 0.0010000, 0.0010000, 0.0%
  onnx.Constant, 3, 0.0001667, 0.0005000, 0.0%
Statistics end, all ops, ordered_by time, tot_time 128.4006667

and version with fix

Statistics start all ops ordered_by time, tot_time,  49.9953333
  onnx.LSTM, 2, 24.8744167, 49.7488333, 99.5%
  onnx.MatMul, 1, 0.0170000, 0.0170000, 0.0%
  onnx.Sigmoid, 1, 0.0075000, 0.0075000, 0.0%
  onnx.Constant, 3, 0.0004444, 0.0013333, 0.0%
  onnx.Add, 1, 0.0010000, 0.0010000, 0.0%
  onnx.Squeeze, 2, 0.0000833, 0.0001667, 0.0%
Statistics end all ops ordered_by time, tot_time,  49.9953333

So within 4 ms of original dev time, covering 95% of original degradation.

Copy link
Collaborator

@tungld tungld left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

If possible, could you add some lit tests for different situations like parallelism, no-parallelism? Thanks!

Also I wonder whether we should do the same for compiler-generated stick/unstick or not.

@AlexandreEichenberger
Copy link
Collaborator Author

I agree, I was surprised there were no lit tests. Let me add them in a subsequent PR so that the perf team may start evaluating the benchmarks right away.

@AlexandreEichenberger AlexandreEichenberger merged commit 409a12c into onnx:main Feb 13, 2025
6 checks passed
@jenkins-droid
Copy link
Collaborator

Jenkins Linux amd64 Build #16288 [push] null... failed after 1 hr 22 min

@jenkins-droid
Copy link
Collaborator

Jenkins Linux s390x Build #16290 [push] null... failed after 1 hr 48 min

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants