-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
inference: refactor the core loops to use less memory #45276
Conversation
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. |
@nanosoldier |
@nanosoldier |
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. |
Your package evaluation job has completed - possible new issues were detected. A full report can be found here. |
e0074a2
to
9c11c71
Compare
@nanosoldier |
Your package evaluation job has completed - no new issues were detected. A full report can be found here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Good work Shuhei!
Edit: Just saw PkgEval results, I guess there is more work to be done. Let me know if you want help, as I read it somewhat thoroughly to do this review (though obviously not thoroughly enough to catch any bugs 😂 )
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. |
406cb61
to
2336683
Compare
@nanosoldier |
Your package evaluation job has completed - no new issues were detected. A full report can be found here. |
2336683
to
c642a71
Compare
@nanosoldier |
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. |
Your package evaluation job has completed - possible new issues were detected. A full report can be found here. |
d6a4417
to
87d863c
Compare
This comment was marked as resolved.
This comment was marked as resolved.
72c03d4
to
c245fb8
Compare
@nanosoldier |
c67e37d
to
283aee5
Compare
@nanosoldier |
Something went wrong when running your job:
Logs and partial data can be found here |
@nanosoldier |
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. |
cde6129
to
eb6d7ca
Compare
Currently inference uses `O(<number of statements>*<number of slots>)` state in the core inference loop. This is usually fine, because users don't tend to write functions that are particularly long. However, MTK does generate functions that are excessively long and we've observed MTK models that spend 99% of their inference time just allocating and copying this state. It is possible to get away with significantly smaller state, and this PR is a first step in that direction, reducing the state to `O(<number of basic blocks>*<number of slots>)`. Further improvements are possible by making use of slot liveness information and only storing those slots that are live across a particular basic block. The core change here is to keep a full set of `slottypes` only at basic block boundaries rather than at each statement. For statements in between, the full variable state can be fully recovered by linearly scanning throughout the basic block, taking note of slot assignments (together with the SSA type) and NewVarNodes. The current status of this branch is that the changes appear correct (no known functional regressions) and significantly improve the MTK test cases in question (no exact benchmarks here for now, since the branch still needs a number of fixes before final numbers make sense), but somewhat regress optimizer quality (which is expected and just a missing TODO) and bootstrap time (which is not expected and something I need to dig into). Co-Authored-By: Keno Fisher <keno@juliacomputing.com>
eb6d7ca
to
973ff33
Compare
@nanosoldier |
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. |
The nanosolidier results look nice. Going to merge. |
A minimum version of #43999, without any apparent regressions.
The original PR description by @Keno will follow:
Currently inference uses
O(<number of statements>*<number of slots>)
statein the core inference loop. This is usually fine, because users don't tend
to write functions that are particularly long. However, MTK does generate
functions that are excessively long and we've observed MTK models that spend
99% of their inference time just allocating and copying this state.
It is possible to get away with significantly smaller state, and this PR is
a first step in that direction, reducing the state to
O(<number of basic blocks>*<number of slots>)
.Further improvements are possible by making use of slot liveness information
and only storing those slots that are live across a particular basic block.
The core change here is to keep a full set of
slottypes
only atbasic block boundaries rather than at each statement. For statements
in between, the full variable state can be fully recovered by
linearly scanning throughout the basic block, taking note of
slot assignments (together with the SSA type) and NewVarNodes.
The current status of this branch is that the changes appear correct(no known functional regressions) and significantly improve the MTK
test cases in question (no exact benchmarks here for now, since
the branch still needs a number of fixes before final numbers make
sense), but somewhat regress optimizer quality (which is expected
and just a missing TODO) and bootstrap time (which is not expected
and something I need to dig into).
@nanosoldier
runbenchmarks(!"scalar", vs=":master")