-
Notifications
You must be signed in to change notification settings - Fork 319
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Op fusion experiment #1817
Op fusion experiment #1817
Conversation
Signed-off-by: chentong319 <chentong@us.ibm.com>
Signed-off-by: chentong319 <chentong@us.ibm.com>
@chentong319 So it seems to be about 30% faster. What were the original sizes in the benchmark it came from, just to get a feel? |
Also, if it is a test that you don't expect to merge in, I would convert it to draft. Tx |
@jenkins-droid test it please. |
@chentong319 Thanks for creating this PR. I tested lowering following mlir with
` |
@chentong319 Sorry.. please ignore previous comment. Above static dim case is ok by fusing two ops, but maybe fusing three ops are needed to support dynamic dim case?
|
@imaihal did you see any improvement with your practical model? |
@tungld It seems elapsed time in Concat-Transpose-Shape part becomes faster about 60%. I will check again and will show the result offline. |
Signed-off-by: chentong319 <chentong@us.ibm.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chentong319 thanks for proposing a fused op!
My three main comments are:
- Creating a IndexExpr-based ShapeHelper so that our DimAnalysis can utilize it.
- Adding a lit test for the lowering of ConcatShapeTranspose.
- ShapeOp may be decomposed into DimOp and Constant. Please make sure we have fuse Concat, Shape and Transpose before Shape is vanished. This perhaps is a topic for the next PR.
DimsExpr outputTransposeDims(commonRank); | ||
auto permAttr = operandAdaptor.perm(); | ||
for (uint64_t i = 0; i < commonRank; i++) { | ||
auto current = outputConcatDims[ArrayAttrIntVal(permAttr, i)]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use a concrete type instead of auto
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for the changes!
This PR is an experimental PR to check the performance impact of Op fusion.
It has been detected that the following pattern occurs repeatedly in a model:
Since the current MLIR loop fusion cannot fuse the loops generated from these Ops, I am trying to fuse them at Op level just to see the performance impact.
This PR introduced a new ONNX Op CancatShapeTranspose and copied the shape inference and krnl lowering code the three original Ops into krnl lowering for the new Op. I didn't add shape inference just because the code is just for experiment and the op fusion should occur after shape inference.
The baseline model is
The fused model is
I tried experiment on my MacBook for two data size (100x200, 100x300) and (1000x2000, 1000, 3000) and measured the time with two methods:1) time in python driver. 2) time from onnx instrumentation
The time measured in python driver are similar for both versions. The fused version has significant shorter execution time measured with instrumentation. Need further investigation.
Here are the details
(100x200, 100x300)
(1000x2000, 1000, 3000)