-
-
Notifications
You must be signed in to change notification settings - Fork 671
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
replace BinaryHeap for TopN #2186
Conversation
replace BinaryHeap for TopN with variant that selects the median with QuickSort, which runs in O(n) time. add merge_fruits fast path
src/collector/top_score_collector.rs
Outdated
sorted_buffer.sort_unstable(); | ||
|
||
// Return the sorted top N elements | ||
sorted_buffer.into_iter().take(self.max_size / 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't this use sorted_buffer.capacity()
and thereby avoid storing max_size
at all?
I also wonder if truncation before calling .into_iter()
isn't preferable to usage of .take(..)
as it simplifies the returned iterator, including returning impl ExactSizeIterator<Item = ...>
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Vec::with_capacity
contract is
The vector will be able to hold at least capacity elements without reallocating.
Current implementation seems to always give exactly the requested capacity (except for ZST), but it could in theory give more, in which case we could return more docs than initially requested
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if the implementation is to be robust against that, the check line 753 should also use max_size
instead of capacity
, shouldn't it?
(It would lead to more than 2K
documents being tracked and hence the first K
could be out of order w.r.t. the median at len() / 2 == capacity() / 2 > K
? Or does sorting the whole buffer at the end take care of that?)
(Wouldn't it actually suffice to call truncate_median
in into_sorted_iter
and then sort only the remaining K
elements generally (thereby also avoiding the take
adaptor)?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having the truncate_median
logic work on a little bit more capacity is no issue.
The check in line 753 did not work, so I replaced it with some unsafe
(Wouldn't it actually suffice to call truncate_median in into_sorted_iter and then sort only the remaining K elements generally (thereby also avoiding the take adaptor)?)
Yes, but I don't think the performance is worth it the lower readability
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but I don't think the performance is worth it the lower readability
To be honest, I am not sure that readability actually suffers:
pub(crate) fn into_iter_sorted(mut self) -> impl ExactSizeIterator<Item = ComparableDoc<Score, DocId>> {
self.truncate_median();
self.buffer.sort_unstable();
self.buffer.into_iter()
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
anyone proofreading would need to check truncate_median
truncates exact to top n ... which it doesn't
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which it doesn't
If capacity is over-allocated, but since you went for spare_capacity
anyway, you could actually enforce that by checking self.buffer.len() == self.max_size
instead of self.buffer.capacity()
.
I agree that this entangles this function with the invariants of TopNComputer
, but I am not sure that TopNCompuer
can be reviewed piece by piece in any case as the contents of self.buffer
is really only meaningful here due to these invariants.
That said, I would still suggest doing at least
pub(crate) fn into_iter_sorted(self) -> impl ExactSizeIterator<Item = ComparableDoc<Score, DocId>> {
let mut sorted_buffer = self.buffer;
sorted_buffer.sort_unstable();
// Return the sorted top N elements
sorted_buffer.truncate(self.top_n);
sorted_buffer.into_iter()
}
to simplify the returned iterator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that simplifies the code or the returned Iterator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It definitely avoids "injecting" the code for the Take
adaptor into the caller, c.f. https://doc.rust-lang.org/stable/src/core/iter/adapters/take.rs.html#35
Codecov ReportAttention:
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## main #2186 +/- ##
==========================================
+ Coverage 94.41% 94.44% +0.02%
==========================================
Files 322 322
Lines 63486 63583 +97
==========================================
+ Hits 59941 60048 +107
+ Misses 3545 3535 -10
☔ View full report in Codecov by Sentry. |
if it makes that much of a difference, we'll probably want to do the same in quickwit's topk module |
I made |
ff12aa0
to
6c53745
Compare
src/collector/top_score_collector.rs
Outdated
} | ||
|
||
pub(crate) fn into_iter_sorted(self) -> impl Iterator<Item = ComparableDoc<Score, DocId>> { | ||
let mut sorted_buffer = self.buffer; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick
let mut sorted_buffer = self.buffer; | |
if self.buffer.len() > self.top_n { | |
self.truncate_median(); | |
} | |
let mut sorted_buffer = self.buffer; |
Note this change requires to self.top_n change in truncate_median.
// This is faster since it avoids the buffer resizing to be inlined from vec.push() | ||
// (this is in the hot path) | ||
// TODO: Replace with `push_within_capacity` when it's stabilized | ||
let uninit = self.buffer.spare_capacity_mut(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we go this way, we might want to remove the main branch of this code?
if doc.feature < last_median
We can do it by making by converting this bool into 1 or 0, and
self.buffer.set_len(self.buffer.len() + inc_from_bool);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The if doc.feature < last_median
early exit is quite important for performance I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comments inline
src/collector/top_score_collector.rs
Outdated
|
||
#[inline(never)] | ||
fn truncate_top_n(&mut self) -> Score { | ||
let truncate_pos = self.top_n.min(self.buffer.len().saturating_sub(1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is still an off-by-1 error here no?
self.top_n.min(self.buffer.len()).saturating_sub(1)
seems more correct?
Alternatively, this could also be a branch...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic was correct I think, either truncate top n or don't truncate if the buffer is smaller than top n
I could remove that special case, since it only appeared when called from into_sorted_vec
replace BinaryHeap for TopN with variant that selects the median with QuickSort,
which runs in O(n) time.
Variant with
spare_capacity_mut
Top 10
Top 100
Top 1000
Variant with
push