-
Notifications
You must be signed in to change notification settings - Fork 336
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Matmul CPU performance regression #3072
Matmul CPU performance regression #3072
Conversation
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Time for CPU CCFD with 4.2
Dev Version before the fix
and version with fix
So within 4 ms of original dev time, covering 95% of original degradation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
If possible, could you add some lit tests for different situations like parallelism, no-parallelism? Thanks!
Also I wonder whether we should do the same for compiler-generated stick/unstick or not.
I agree, I was surprised there were no lit tests. Let me add them in a subsequent PR so that the perf team may start evaluating the benchmarks right away. |
Jenkins Linux amd64 Build #16288 [push] null... failed after 1 hr 22 min |
Jenkins Linux s390x Build #16290 [push] null... failed after 1 hr 48 min |
When lowering the KrnlMatmul operation, we generate different code for full/partial simd/scalar tile. Currently, the temporary used to store the kernel's register were allocated using
alignedAlloc
, inside ascf:if
. This prevented the buffer hoisting optimization from MLIR to hoist the alloc/free from inside the inner loops (iterating over all of the tiles / panel of mammals).This resulted in a 2x slow down compared to earlier versions.
This PR fixes this by pre-allocating the data outside of the
scf-if
, thus enabling the buffer hoisting pass to move the allocations outside of the entire matmul loops (including the 3 nested loop iterating over the tiles.Because the data is small, it's ok to use
alloca
in sequential mode (parallel disabled). This does not work when parallel is enabled, because the alloc would remain stuck inside of the outermostomp parallel for
loop. TODO: migrate the alloc/alloca outside of theomp for
loop into theomp parallel
region. Then alloca should be used as well.Migrating the alloc outside reduces 80% of the overheads, migrating to alloca nearly all of the removed the overheads.