Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replace BinaryHeap for TopN #2186

Merged
merged 6 commits into from
Sep 27, 2023
Merged

replace BinaryHeap for TopN #2186

merged 6 commits into from
Sep 27, 2023

Conversation

PSeitz
Copy link
Contributor

@PSeitz PSeitz commented Sep 24, 2023

replace BinaryHeap for TopN with variant that selects the median with QuickSort,
which runs in O(n) time.

Variant with spare_capacity_mut

Top 10

Query tantivy-0.20 tantivy-0.21 tantivy-topn
AVERAGE 2,882 μs 2,844 μs 2,469 μs

Top 100

Query tantivy-0.20 tantivy-0.21 tantivy-topn
AVERAGE 2,846 μs 2,902 μs 2,504 μs

Top 1000

Query tantivy-0.20 tantivy-0.21 tantivy-topn
AVERAGE 2,848 μs 3,169 μs 2,518 μs

Variant with push

topn_1
topn_2
topn_3

replace BinaryHeap for TopN with variant that selects the median with QuickSort,
which runs in O(n) time.

add merge_fruits fast path
sorted_buffer.sort_unstable();

// Return the sorted top N elements
sorted_buffer.into_iter().take(self.max_size / 2)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't this use sorted_buffer.capacity() and thereby avoid storing max_size at all?

I also wonder if truncation before calling .into_iter() isn't preferable to usage of .take(..) as it simplifies the returned iterator, including returning impl ExactSizeIterator<Item = ...>.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vec::with_capacity contract is

The vector will be able to hold at least capacity elements without reallocating.

Current implementation seems to always give exactly the requested capacity (except for ZST), but it could in theory give more, in which case we could return more docs than initially requested

Copy link
Collaborator

@adamreichold adamreichold Sep 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if the implementation is to be robust against that, the check line 753 should also use max_size instead of capacity, shouldn't it?

(It would lead to more than 2K documents being tracked and hence the first K could be out of order w.r.t. the median at len() / 2 == capacity() / 2 > K? Or does sorting the whole buffer at the end take care of that?)

(Wouldn't it actually suffice to call truncate_median in into_sorted_iter and then sort only the remaining K elements generally (thereby also avoiding the take adaptor)?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having the truncate_median logic work on a little bit more capacity is no issue.

The check in line 753 did not work, so I replaced it with some unsafe

(Wouldn't it actually suffice to call truncate_median in into_sorted_iter and then sort only the remaining K elements generally (thereby also avoiding the take adaptor)?)

Yes, but I don't think the performance is worth it the lower readability

Copy link
Collaborator

@adamreichold adamreichold Sep 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I don't think the performance is worth it the lower readability

To be honest, I am not sure that readability actually suffers:

pub(crate) fn into_iter_sorted(mut self) -> impl ExactSizeIterator<Item = ComparableDoc<Score, DocId>> {
    self.truncate_median();
    self.buffer.sort_unstable();
    self.buffer.into_iter()
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

anyone proofreading would need to check truncate_median truncates exact to top n ... which it doesn't

Copy link
Collaborator

@adamreichold adamreichold Sep 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which it doesn't

If capacity is over-allocated, but since you went for spare_capacity anyway, you could actually enforce that by checking self.buffer.len() == self.max_size instead of self.buffer.capacity().

I agree that this entangles this function with the invariants of TopNComputer, but I am not sure that TopNCompuer can be reviewed piece by piece in any case as the contents of self.buffer is really only meaningful here due to these invariants.

That said, I would still suggest doing at least

pub(crate) fn into_iter_sorted(self) -> impl ExactSizeIterator<Item = ComparableDoc<Score, DocId>> {
        let mut sorted_buffer = self.buffer;
        sorted_buffer.sort_unstable();

        // Return the sorted top N elements
        sorted_buffer.truncate(self.top_n);
        sorted_buffer.into_iter()
}

to simplify the returned iterator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that simplifies the code or the returned Iterator

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It definitely avoids "injecting" the code for the Take adaptor into the caller, c.f. https://doc.rust-lang.org/stable/src/core/iter/adapters/take.rs.html#35

@codecov-commenter
Copy link

codecov-commenter commented Sep 24, 2023

Codecov Report

Attention: 7 lines in your changes are missing coverage. Please review.

Comparison is base (34920d3) 94.41% compared to head (6b3c86d) 94.44%.
Report is 3 commits behind head on main.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2186      +/-   ##
==========================================
+ Coverage   94.41%   94.44%   +0.02%     
==========================================
  Files         322      322              
  Lines       63486    63583      +97     
==========================================
+ Hits        59941    60048     +107     
+ Misses       3545     3535      -10     
Files Coverage Δ
src/collector/mod.rs 54.63% <ø> (ø)
src/collector/top_score_collector.rs 98.61% <99.21%> (+0.58%) ⬆️
src/collector/top_collector.rs 97.02% <50.00%> (-0.59%) ⬇️

... and 6 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@trinity-1686a
Copy link
Contributor

if it makes that much of a difference, we'll probably want to do the same in quickwit's topk module

@PSeitz PSeitz marked this pull request as ready for review September 25, 2023 10:51
@PSeitz
Copy link
Contributor Author

PSeitz commented Sep 25, 2023

if it makes that much of a difference, we'll probably want to do the same in quickwit's topk module

I made TopNComputer pub, so we can just call it from Quickwit

}

pub(crate) fn into_iter_sorted(self) -> impl Iterator<Item = ComparableDoc<Score, DocId>> {
let mut sorted_buffer = self.buffer;
Copy link
Collaborator

@fulmicoton fulmicoton Sep 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick

Suggested change
let mut sorted_buffer = self.buffer;
if self.buffer.len() > self.top_n {
self.truncate_median();
}
let mut sorted_buffer = self.buffer;

Note this change requires to self.top_n change in truncate_median.

// This is faster since it avoids the buffer resizing to be inlined from vec.push()
// (this is in the hot path)
// TODO: Replace with `push_within_capacity` when it's stabilized
let uninit = self.buffer.spare_capacity_mut();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we go this way, we might want to remove the main branch of this code?
if doc.feature < last_median

We can do it by making by converting this bool into 1 or 0, and

self.buffer.set_len(self.buffer.len() + inc_from_bool);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The if doc.feature < last_median early exit is quite important for performance I think

Copy link
Collaborator

@fulmicoton fulmicoton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments inline


#[inline(never)]
fn truncate_top_n(&mut self) -> Score {
let truncate_pos = self.top_n.min(self.buffer.len().saturating_sub(1));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is still an off-by-1 error here no?

self.top_n.min(self.buffer.len()).saturating_sub(1)

seems more correct?

Alternatively, this could also be a branch...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic was correct I think, either truncate top n or don't truncate if the buffer is smaller than top n

I could remove that special case, since it only appeared when called from into_sorted_vec

@PSeitz PSeitz merged commit b525f65 into main Sep 27, 2023
5 checks passed
@PSeitz PSeitz deleted the faster_top_n branch September 27, 2023 07:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants