-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Very WIP: Refactor core inference loops to use less memory #43999
Conversation
9ae2a51
to
36e3abe
Compare
Rebased against the latest master, and dropped changes from #43899 (I believe these two PRs are orthogonal and can be merged separately). Still there seems to be some problems in this PR. Will try to dig into them. |
2ec47da
to
60a82a6
Compare
Currently inference uses `O(<number of statements>*<number of slots>)` state in the core inference loop. This is usually fine, because users don't tend to write functions that are particularly long. However, MTK does generate functions that are excessively long and we've observed MTK models that spend 99% of their inference time just allocating and copying this state. It is possible to get away with significantly smaller state, and this PR is a first step in that direction, reducing the state to `O(<number of basic blocks>*<number of slots>)`. Further improvements are possible by making use of slot liveness information and only storing those slots that are live across a particular basic block. The core change here is to keep a full set of `slottypes` only at basic block boundaries rather than at each statement. For statements in between, the full variable state can be fully recovered by linearly scanning throughout the basic block, taking note of slot assignments (together with the SSA type) and NewVarNodes. The current status of this branch is that the changes appear correct (no known functional regressions) and significantly improve the MTK test cases in question (no exact benchmarks here for now, since the branch still needs a number of fixes before final numbers make sense), but somewhat regress optimizer quality (which is expected and just a missing TODO) and bootstrap time (which is not expected and something I need to dig into).
60a82a6
to
dfeb2e9
Compare
if s.undef | ||
sv.src.slotflags[sn] |= SLOT_USEDUNDEF | SLOT_STATICUNDEF | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After digging into the regressions in the optimizer, I found this change could be problematic, and especially causes some regressions in the SROA pass.
I think these changes are orthogonal to the main purpose of this PR, so I am trying to fix regressions by separating related changes into another branch.
Still there seems to be another cause of the regressions though.
Separated from #43999. xref: <#43999 (comment)>
Separated from #43999. xref: <#43999 (comment)>
Separated from #43999. xref: <#43999 (comment)>
Separated from #43999. xref: <#43999 (comment)>
Separated from #43999. xref: <#43999 (comment)>
Separated from JuliaLang#43999. xref: <JuliaLang#43999 (comment)>
Replaced by #45276 . |
Currently inference uses
O(<number of statements>*<number of slots>)
statein the core inference loop. This is usually fine, because users don't tend
to write functions that are particularly long. However, MTK does generate
functions that are excessively long and we've observed MTK models that spend
99% of their inference time just allocating and copying this state.
It is possible to get away with significantly smaller state, and this PR is
a first step in that direction, reducing the state to
O(<number of basic blocks>*<number of slots>)
.Further improvements are possible by making use of slot liveness information
and only storing those slots that are live across a particular basic block.
The core change here is to keep a full set of slottypes only at
basic block boundaries rather than at each statement. For statements
in between, the full variable state can be fully recovered by
linearly scanning throught the basic block, taking note of
slot assignments (together with the SSA type) and NewVarNodes.
The current status of this branch is that the changes appear correct
(no known functional regressions) and significantly improve the MTK
test cases in question (no exact benchmarks here for now, since
the branch still needs a number of fixes before final numbers make
sense), but somewhat regress optimizer quality (which is expected
and just a missing TODO) and bootstrap time (which is not expected
and something I need to dig into).