-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add RowsStreamingWindowBuild to avoid OOM in Window operator #9025
Conversation
✅ Deploy Preview for meta-velox ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
196b0c8
to
28cda2f
Compare
@mbasmanova @aditi-pandit Can you help to review? Thanks for your help. |
28cda2f
to
adddc97
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JkSelf : Had a high level question about this PR.
My understanding is that :
- You are trying to simulate an end of partition by artificially adding a partitionStartRow offset for each input block.
- Each partition has an offset indicating which row position is the first one.
- So for every input block you might naturally encounter end of partition if there is such a row. But you always reach end of partition at the end of the block.
- The next input block is a sort of new partition but carries the offset in the partition.
That would mean that resetPartition function is called at end of each input block. That is changing the semantics of WindowFunction. That is misleading. We shouldn't do that.
Instead we should enhance WindowPartition structure to model a partially filled partition. The code in Window operator that iterates in callApplyLoop would have logic to handle paritial WindowPartitions.
velox/exec/Window.cpp
Outdated
const auto& functionName = windowNodeFunction.functionCall->name(); | ||
const auto& frame = windowNodeFunction.frame; | ||
|
||
bool isRankLikeFunction = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could add to WindowFunction API a new function supportsStreaming() that returns true/false depending on whether it supports this. So this code would remain independent of the function names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be preferable to make such metadata available from the registry without having to instantiate a function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aditi-pandit @mbasmanova Move this metadata in registry to avoid instancing function.
39d0150
to
194ebc3
Compare
@aditi-pandit Make sense to me. Updated based on your suggestions and make the |
56f2161
to
160e96c
Compare
@JkSelf Would you update the PR description to provide some context on the changes here. Specifically, please, describe the overall design you implemented. |
velox/exec/Window.cpp
Outdated
@@ -187,6 +195,28 @@ void Window::createWindowFunctions() { | |||
} | |||
} | |||
|
|||
// The supportRankWindowBuild is designed to support 'rank' and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This optimization can be applied more broadly. With default frame, all aggregate functions can be processed in streaming manner without holding all rows of a partition in memory.
See code around analyzeFrameValues in AggregateWindow.cpp for some context.
Also, note that row_number and rank functions ignore frames.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Thanks @JkSelf for incorporating previous suggestions. Sorry for the delay in my review. I got swamped with many things the last 2 days Agree with others about having a design doc. Putting a doc together will give an opportunity to think about this more clearly from first principles. Had a suggestion for you about this PR as well: |
@mbasmanova @aditi-pandit @rui-mo Thanks for your review and suggestions. I have written a design document about streaming processing for Rank and row_number here. Correct me if something wrong. Thanks.
Sure. I will split one PR to solve the streaming metadata into WindowFunction, another is to handle the streaming processing for rank and row_number(), and the last one is to deal with the streaming processing for aggregate window functions with default window frames.
The patch already support WindowPartition with partial input. And can pass the window test with StreamingWindowBuild and SortWindowBuild. The failed unit tests is not related with this PR.
It is feasible to put the current RankLikeWindowBuild into StreamingWindowBuild, but it will increase the complexity of the StreamingWindowBuild code. Personally, I feel that keeping them separate would be clearer. |
velox/exec/RankLikeWindowBuild.h
Outdated
/// and row_number functions. RankWindowBuild adopts a streaming method to | ||
/// construct WindowPartition, which can reduce the occurrence of Out Of Memory | ||
/// (OOM). | ||
class RankLikeWindowBuild : public WindowBuild { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RankLike name doesn't work very well. dense_rank function is like rank, but it doesn't support this specific functionality. On the other hand, sum is not like rank, but it does support this functionality. Let's come up with a name that better reflects the functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mbasmanova Yes. How about RowLevelStreamingWindowBuild
and also changing the existing StreamingWindowBuild
to PartitionLevelStreamingWindowBuild
?
593d27b
to
d5e7e69
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JkSelf LGTM. Thanks for the iterations!
@xiaoxmeng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
d338045
to
df8f3e3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JkSelf few more comments. thanks!
df8f3e3
to
da68e74
Compare
@xiaoxmeng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Hi @JkSelf, thank you for adding RowsStreamingWindowBuild. Could you also run the window fuzzer that compares Velox results against Presto for 2+ hours? To run that, you need to first start a Presto server following steps 1--10 here (but change the version 0.284 to 0.288): #8111. Then, you can start velox_window_fuzzer_test with command-line arguments Please let me know if you have any questions with running the fuzzer test. Thanks! |
@kagamiori Thanks for your review. I will try to execute the window fuzzer test locally following the steps you provided. |
da68e74
to
0d2318e
Compare
I encountered the following error during the execution process, but this error doesn't seem to be related to my PR. I tried running it without my PR, and the same error occurred.
|
This seems to be memory leak somewhere, and I didn't see that in Meta internal run against duckDB. Not sure if it relates to Presto verification. I will try with enableWindowVerification set with duckDB. |
0d2318e
to
0ffb120
Compare
@xiaoxmeng @kagamiori
The frame start has already been established and a buffer has been allocated here. However, when creating the frame end, the offset is negative, which triggers an error here. At this point, the buffer allocated for the frame start has not been released in time, resulting in the reported memory leak. The issue can fixed by converting the negative to positive in WindowFuzzer.cpp. |
0ffb120
to
bf3925c
Compare
@xiaoxmeng merged this pull request in d33cdb2. |
Conbench analyzed the 1 benchmark run on commit There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |
…kincubator#9025) Summary: Unlike `StreamingWindowBuild`, `RowLevelStreamingWindowBuild ` in this PR is capable of processing window functions as rows arrive within a single partition, without the need to wait for the entire partition to be ready. This approach can significantly reduce memory usage, especially when a single partition contains a large amount of data. It is particularly suited for optimizing `rank `and `row_number `functions, as well as aggregate window functions with a default frame. The detailed discussions is [here](facebookincubator#8975). The design doc is [here](https://docs.google.com/document/d/17ONSJHK8XP5Lixm8XBl01RMNl4ntpixiVFe693ahw6k/edit?usp=sharing). Pull Request resolved: facebookincubator#9025 Test Plan: Run through 10hrs fuzzer testing Reviewed By: kagamiori Differential Revision: D61473798 Pulled By: xiaoxmeng fbshipit-source-id: 569a752770395330c48a3521bd5421eb89f5623d
to fix the result mismatch in RowsStreamingWindowBuild (#499) * Revert "Add RowsStreamingWindowBuild to avoid OOM in Window operator (9025)" This reverts commit f34c9b1. * Add RowsStreamingWindowBuild to avoid OOM in Window operator (facebookincubator#9025) Summary: Unlike `StreamingWindowBuild`, `RowLevelStreamingWindowBuild ` in this PR is capable of processing window functions as rows arrive within a single partition, without the need to wait for the entire partition to be ready. This approach can significantly reduce memory usage, especially when a single partition contains a large amount of data. It is particularly suited for optimizing `rank `and `row_number `functions, as well as aggregate window functions with a default frame. The detailed discussions is [here](facebookincubator#8975). The design doc is [here](https://docs.google.com/document/d/17ONSJHK8XP5Lixm8XBl01RMNl4ntpixiVFe693ahw6k/edit?usp=sharing). Pull Request resolved: facebookincubator#9025 Test Plan: Run through 10hrs fuzzer testing Reviewed By: kagamiori Differential Revision: D61473798 Pulled By: xiaoxmeng fbshipit-source-id: 569a752770395330c48a3521bd5421eb89f5623d * Fix error message * Fix the result mismatch in RowsStreamingWindowBuild (facebookincubator#10979) Summary: For a Range frame, it is necessary to ensure that the peer is ready before commencing the window function computation Pull Request resolved: facebookincubator#10979 Reviewed By: kagamiori Differential Revision: D62622816 Pulled By: xiaoxmeng fbshipit-source-id: 1a9911da416c867c9e295242a05d0f33fbc2e22d --------- Co-authored-by: Jia Ke <ke.a.jia@intel.com>
Unlike
StreamingWindowBuild
,RowLevelStreamingWindowBuild
in this PR is capable of processing window functions as rows arrive within a single partition, without the need to wait for the entire partition to be ready. This approach can significantly reduce memory usage, especially when a single partition contains a large amount of data. It is particularly suited for optimizingrank
androw_number
functions, as well as aggregate window functions with a default frame.The detailed discussions is here. The design doc is here.