Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[READY] Implement partial sorting #825

Merged
merged 1 commit into from
Sep 3, 2017
Merged

Conversation

micbou
Copy link
Collaborator

@micbou micbou commented Aug 31, 2017

This PR adds an optional parameter to the CandidatesForQueryAndType and FilterAndSortCandidates functions that limits the number of candidates returned by these functions. The candidates are partially sorted according to this limit (i.e. only the n smallest candidates are sorted).

The current behavior is unchanged: identifiers are still limited to 10 by default and there is no limit for other kind of candidates. The plan is to add a new setting (e.g. g:ycm_max_num_candidates) to limit the number of non-identifier candidates. 50 seems a reasonable choice.

The PartialSort function implements the following strategy:

  • if the number of elements to sort is less than 1024 or represents less than 1/64 the total number of elements, use std::partial_sort;
  • otherwise, use std::nth_element then std::sort on the nth smallest elements.

This heuristic was obtained by comparing the performance of these 3 algorithms:

They were tested on randomly generated strings of 20 characters with different values for the total number of strings and the number of strings to sort. Here is the graph of an experiment with 210 random strings:
partial-sorting-comparison

The observation is that, when the number of elements to sort is small, std::partial_sort outperforms other algorithms while std::nth_element + std::sort always beats std::sort alone in other cases.
This experiment can be reproduced by checking out this branch and running the script plot_bench.py (the matplotlib package must be installed to draw the graph):

./plot_bench.py --bench bench.log

The number of elements can be modified by editing this variable.

Finally, here are the benchmark numbers obtained with these changes on my config. Conclusions are:

  • no performance loss when all candidates are returned;
  • a ~36% speedup in sorting and filtering identifiers with max_min_identifier_candidates default value.
  • when limiting the number of non-identifier candidates to 50, sorting and filtering non-identifier candidates is ~5% faster if candidates are not already stored in the repository and ~24% if they are.

This change is Reviewable

@bstaletic
Copy link
Collaborator

Reviewed 12 of 12 files at r1.
Review status: all files reviewed at latest revision, 2 unresolved discussions.


cpp/ycm/benchmarks/IdentifierCompleter_bench.cpp, line 55 at r1 (raw file):

BENCHMARK_REGISTER_F( IdentifierCompleterFixture, CandidatesWithCommonPrefix )
    ->RangeMultiplier( 1 << 4 )
    ->Ranges( { { 1, 1 << 16 }, { 10, 10 } } )

The { 1, 1 << 16 } part is the range itself. What does the { 10, 10 } pat mean?


cpp/ycm/benchmarks/PythonSupport_bench.cpp, line 56 at r1 (raw file):

    state.ResumeTiming();
    FilterAndSortCandidates( candidates, "insertion_text", "aA",
                             state.range( 1 ) );

What does state.reange( 1 ) evaluate to in this case?


Comments from Reviewable

@micbou
Copy link
Collaborator Author

micbou commented Aug 31, 2017

Reviewed 2 of 12 files at r1.
Review status: all files reviewed at latest revision, 2 unresolved discussions.


cpp/ycm/benchmarks/IdentifierCompleter_bench.cpp, line 55 at r1 (raw file):

Previously, bstaletic (Boris Staletic) wrote…

The { 1, 1 << 16 } part is the range itself. What does the { 10, 10 } pat mean?

{ 10, 10 } is the range for the maximum number of candidates returned by CandidatesForQuery. It's identical to { 10 } (in other words, we only test the value 10) but Google benchmark doesn't accept a single value for a range.


cpp/ycm/benchmarks/PythonSupport_bench.cpp, line 56 at r1 (raw file):

Previously, bstaletic (Boris Staletic) wrote…

What does state.reange( 1 ) evaluate to in this case?

0 when the ranges are { { 1, 1 << 16 }, { 0, 0 } } and 50 when { 1, 1 << 16 }, { 50, 50 } }.


Comments from Reviewable

@codecov-io
Copy link

codecov-io commented Sep 1, 2017

Codecov Report

Merging #825 into master will decrease coverage by <.01%.
The diff coverage is 96.55%.

@@            Coverage Diff             @@
##           master     #825      +/-   ##
==========================================
- Coverage   94.82%   94.82%   -0.01%     
==========================================
  Files          79       79              
  Lines        5374     5389      +15     
  Branches      168      169       +1     
==========================================
+ Hits         5096     5110      +14     
  Misses        231      231              
- Partials       47       48       +1

@Valloric
Copy link
Member

Valloric commented Sep 1, 2017

I'm sorry, but the PR description isn't at all clear to me. I've read it twice now and I still have no clue what this PR is trying to accomplish. :D It might have just been a long day at work, but still, could this be clarified? Perhaps with some examples.


Review status: all files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.


Comments from Reviewable

@bstaletic
Copy link
Collaborator

bstaletic commented Sep 1, 2017

@Valloric Let me try to explain.

Current situation

Identifier completer - returns 10 candidates.
All other completers - return everything they can, no hard limit set.

Sorting - std::sort() always.

This pull request

Identifier completer - returns 10 sorted candidates.
Other completers - return up to 50 sorted candidates. Reasoning: User doesn't care for thousands of identifiers, only the most likely "few". Few is 50 by default, controllable with g:ycm_max_num_candidates.

Sorting - Partially sort those 50 (by default) candidates following this algorithm.

  • If there are less than 1024 candidates use std::partial_sort().
  • If the number of wanted candidates (50) is less than (total number of possible candidates)/64 use std::partial_sort()
  • In all other cases use std:nth_element() then std::sort(). - This branch is what isn't tested according to codecov.

This results in much faster sorting of candidates.


Review status: all files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.


Comments from Reviewable

@micbou
Copy link
Collaborator Author

micbou commented Sep 1, 2017

Sorry for being unclear. The idea is that, since the maximum number of identifiers returned to the user is (by default) 10, we don't need to sort all of them but only the 10 smallest. This is significantly faster when the number of identifiers is big. The same can be done for non-identifier candidates if we limit them too. The rational in limiting them is that, in addition to improving performance, it's not particularly useful to return thousands of candidates. Users will never go through all of them.

@bstaletic did a good summary except that the PR doesn't actually set the maximum number of non-identifier candidates returned to the user to 50 but give the possibility to limit this number by adding an optional parameter to the FilterAndSortCandidates function. Setting the limit to 50 is subject to discussion and should be done in a separate PR.


Reviewed 9 of 12 files at r1.
Review status: all files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.


Comments from Reviewable

@bstaletic
Copy link
Collaborator

@micbou Taking another look at the code, the optional argument determining the max number of candidates is initialised defaulting to 0. with 0 having a special meaning of returning everything without any limitation on the number of returned candidates.

I really think we shuld default to some other value.


Review status: all files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.


Comments from Reviewable

@micbou
Copy link
Collaborator Author

micbou commented Sep 1, 2017

The default is 0 to keep the current behavior which is to return all the non-identifier candidates. If we decide to limit the number of candidates, we should set the max_candidates parameter in the Python layer like we currently do for identifiers.


Reviewed 1 of 12 files at r1, 1 of 1 files at r2.
Review status: all files reviewed at latest revision, all discussions resolved, some commit checks broke.


Comments from Reviewable

@bstaletic
Copy link
Collaborator

Fair point. Now why is codecov still reporting the same coverage after adding another test in r2 of this PR?


Review status: all files reviewed at latest revision, all discussions resolved, some commit checks failed.


Comments from Reviewable

@micbou micbou changed the title [READY] Implement partial sorting [WIP] Implement partial sorting Sep 1, 2017
@micbou
Copy link
Collaborator Author

micbou commented Sep 1, 2017

Because the test only sorts the 2 smallest identifiers while it should sort at least 1024 if we want the other part of PartialSort to be covered. I am rewriting the test.


Review status: all files reviewed at latest revision, all discussions resolved, some commit checks failed.


Comments from Reviewable

@bstaletic
Copy link
Collaborator

With the updated test this is :lgtm:.


Reviewed 1 of 1 files at r2, 1 of 1 files at r3.
Review status: all files reviewed at latest revision, all discussions resolved.


Comments from Reviewable

@micbou micbou changed the title [WIP] Implement partial sorting [READY] Implement partial sorting Sep 1, 2017
@micbou
Copy link
Collaborator Author

micbou commented Sep 1, 2017

New test improved coverage. This is ready.


Reviewed 1 of 1 files at r3.
Review status: all files reviewed at latest revision, all discussions resolved.


Comments from Reviewable

@Valloric
Copy link
Member

Valloric commented Sep 2, 2017

Thanks for the extra clarification folks! I get what this is trying to do now. Seems like a sensible change. Minor inline comments, but otherwise :lgtm: .


Review status: 12 of 13 files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.


cpp/ycm/Utils.h, line 191 at r4 (raw file):

// Shrink a vector to its sorted |num_sorted_elements| smallest elements. If
// |num_sorted_elements| is 0 or more than the vector size, sort the whole

"or more" -> "or larger"


cpp/ycm/Utils.h, line 210 at r4 (raw file):

  // alone in other cases.
  if ( max_elements <= std::max( static_cast< size_t >( 1024 ),
                                 nb_elements >> 6 ) ) {

IMO the right shift is too clever. Just use / 64 and the compiler will transform it for you into the shift. It will perform the same but it will be easier to read.


Comments from Reviewable

Add an option to limit the number of candidates to sort and filter. Partially
sort the list of candidates according to this limit.
@micbou
Copy link
Collaborator Author

micbou commented Sep 2, 2017

Reviewed 1 of 1 files at r4, 1 of 1 files at r5.
Review status: all files reviewed at latest revision, 2 unresolved discussions.


cpp/ycm/Utils.h, line 191 at r4 (raw file):

Previously, Valloric (Val Markovic) wrote…

"or more" -> "or larger"

Done.


cpp/ycm/Utils.h, line 210 at r4 (raw file):

Previously, Valloric (Val Markovic) wrote…

IMO the right shift is too clever. Just use / 64 and the compiler will transform it for you into the shift. It will perform the same but it will be easier to read.

Done.


Comments from Reviewable

@Valloric
Copy link
Member

Valloric commented Sep 3, 2017

Thanks!

@zzbot r=bstaletic


Review status: all files reviewed at latest revision, all discussions resolved.


Comments from Reviewable

@zzbot
Copy link
Contributor

zzbot commented Sep 3, 2017

📌 Commit 1e00ddd has been approved by bstaletic

@zzbot
Copy link
Contributor

zzbot commented Sep 3, 2017

⌛ Testing commit 1e00ddd with merge f180d07...

zzbot added a commit that referenced this pull request Sep 3, 2017
[READY] Implement partial sorting

This PR adds an optional parameter to the `CandidatesForQueryAndType` and `FilterAndSortCandidates` functions that limits the number of candidates returned by these functions. The candidates are partially sorted according to this limit (i.e. only the `n` smallest candidates are sorted).

The current behavior is unchanged: identifiers are still limited to `10` by default and there is no limit for other kind of candidates. The plan is to add a new setting (e.g. `g:ycm_max_num_candidates`) to limit the number of non-identifier candidates. `50` seems a reasonable choice.

The `PartialSort` function implements the following strategy:
 - if the number of elements to sort is less than `1024` or represents less than `1/64` the total number of elements, use `std::partial_sort`;
 - otherwise, use `std::nth_element` then `std::sort` on the `n`th smallest elements.

This heuristic was obtained by comparing the performance of these 3 algorithms:
 - [`std::sort`](http://en.cppreference.com/w/cpp/algorithm/sort);
 - [`std::partial_sort`](http://en.cppreference.com/w/cpp/algorithm/partial_sort);
 - [`std::nth_element`](http://en.cppreference.com/w/cpp/algorithm/nth_element) + `std::sort`.

They were tested on randomly generated strings of 20 characters with different values for the total number of strings and the number of strings to sort. Here is the graph of an experiment with 2<sup>10</sup> random strings:
![partial-sorting-comparison](https://user-images.githubusercontent.com/10026824/29936611-2c05ceaa-8e83-11e7-8303-a08c39afcae0.png)

The observation is that, when the number of elements to sort is small, `std::partial_sort` outperforms other algorithms while `std::nth_element` + `std::sort` always beats `std::sort` alone in other cases.
This experiment can be reproduced by checking out [this branch](https://github.com/micbou/ycmd/tree/partial-sorting-bench) and running the script `plot_bench.py` (the `matplotlib` package must be installed to draw the graph):
```
./plot_bench.py --bench bench.log
```
The number of elements can be modified by editing [this variable](https://github.com/micbou/ycmd/blob/41d0b5e1735fa917c071421831b48111a8cb6621/cpp/ycm/benchmarks/PartialSorting_bench.cpp#L205).

Finally, [here](https://gist.github.com/micbou/424944dafd8150aa11dc6d8cfe49f624) are the benchmark numbers obtained with these changes on my config. Conclusions are:
 - no performance loss when all candidates are returned;
 - a ~36% speedup in sorting and filtering identifiers with `max_min_identifier_candidates` default value.
 - when limiting the number of non-identifier candidates to `50`, sorting and filtering non-identifier candidates is ~5% faster if candidates are not already stored in the repository and ~24% if they are.

<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/valloric/ycmd/825)
<!-- Reviewable:end -->
@zzbot
Copy link
Contributor

zzbot commented Sep 3, 2017

💔 Test failed - status-travis

@zzbot
Copy link
Contributor

zzbot commented Sep 3, 2017

☀️ Test successful - status-travis
Approved by: bstaletic
Pushing f180d07 to master...

@zzbot zzbot merged commit 1e00ddd into ycm-core:master Sep 3, 2017
@zzbot zzbot mentioned this pull request Sep 3, 2017
@micbou micbou deleted the partial-sorting branch September 3, 2017 11:49
zzbot added a commit that referenced this pull request Sep 3, 2017
[READY] Add max_num_candidates option

Add the `max_num_candidates` option to limit the number of non-identifier candidates returned by ycmd. Its default value is set to 50 as this seems a good compromise between performance and the amount of candidates shown to the user (10 candidates like for identifiers would be too low).

Limiting the number of candidates does not only improve performance when sorting them (see PR #825) but also when sending the response to the client as the response is smaller and when clients are populating the completion menu since there are less candidates.

<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/valloric/ycmd/830)
<!-- Reviewable:end -->
zzbot added a commit that referenced this pull request Sep 3, 2017
[READY] Add max_num_candidates option

Add the `max_num_candidates` option to limit the number of non-identifier candidates returned by ycmd. Its default value is set to 50 as this seems a good compromise between performance and the amount of candidates shown to the user (10 candidates like for identifiers would be too low).

Limiting the number of candidates does not only improve performance when sorting them (see PR #825) but also when sending the response to the client as the response is smaller and when clients are populating the completion menu since there are less candidates.

<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/valloric/ycmd/830)
<!-- Reviewable:end -->
zzbot added a commit to ycm-core/YouCompleteMe that referenced this pull request Sep 10, 2017
[READY] Update ycmd

This new version of ycmd includes the following changes:

 - PR ycm-core/ycmd#795: add option to make relative paths in flags from extra conf absolute;
 - PR ycm-core/ycmd#802: fix compilation on Haiku;
 - PR ycm-core/ycmd#804: add libclang detection on FreeBSD;
 - PR ycm-core/ycmd#808: write python used during build before installing completers;
 - PR ycm-core/ycmd#810: support unknown languages from tags;
 - PR ycm-core/ycmd#811: update Universal Ctags languages list;
 - PR ycm-core/ycmd#814: resolve symlinks in extra conf glob patterns;
 - PR ycm-core/ycmd#815: update JediHTTP;
 - PR ycm-core/ycmd#816: update Boost to 1.65.0;
 - PR ycm-core/ycmd#819: filter and sort candidates when query is empty;
 - PR ycm-core/ycmd#820: improve LLVM root path search for prebuilt binaries;
 - PR ycm-core/ycmd#822: inline critical utility functions;
 - PR ycm-core/ycmd#824: do not sort header paths in filename completer;
 - PR ycm-core/ycmd#825: implement partial sorting;
 - PR ycm-core/ycmd#830: add max_num_candidates option;
 - PR ycm-core/ycmd#831: fix multiline comments and strings issues;
 - PR ycm-core/ycmd#832: update Clang to 5.0.0.

The `g:ycm_max_num_candidates` and `g:ycm_max_num_identifier_candidates` options are added to the documentation.

The link to ycmd extra conf is updated.

Fixes #2562.

<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/valloric/youcompleteme/2768)
<!-- Reviewable:end -->
zzbot added a commit to ycm-core/YouCompleteMe that referenced this pull request Sep 10, 2017
[READY] Update ycmd

This new version of ycmd includes the following changes:

 - PR ycm-core/ycmd#795: add option to make relative paths in flags from extra conf absolute;
 - PR ycm-core/ycmd#802: fix compilation on Haiku;
 - PR ycm-core/ycmd#804: add libclang detection on FreeBSD;
 - PR ycm-core/ycmd#808: write python used during build before installing completers;
 - PR ycm-core/ycmd#810: support unknown languages from tags;
 - PR ycm-core/ycmd#811: update Universal Ctags languages list;
 - PR ycm-core/ycmd#814: resolve symlinks in extra conf glob patterns;
 - PR ycm-core/ycmd#815: update JediHTTP;
 - PR ycm-core/ycmd#816: update Boost to 1.65.0;
 - PR ycm-core/ycmd#819: filter and sort candidates when query is empty;
 - PR ycm-core/ycmd#820: improve LLVM root path search for prebuilt binaries;
 - PR ycm-core/ycmd#822: inline critical utility functions;
 - PR ycm-core/ycmd#824: do not sort header paths in filename completer;
 - PR ycm-core/ycmd#825: implement partial sorting;
 - PR ycm-core/ycmd#830: add max_num_candidates option;
 - PR ycm-core/ycmd#831: fix multiline comments and strings issues;
 - PR ycm-core/ycmd#832: update Clang to 5.0.0.

The `g:ycm_max_num_candidates` and `g:ycm_max_num_identifier_candidates` options are added to the documentation.

The link to ycmd extra conf is updated.

Fixes #2562.

<!-- Reviewable:start -->
---
This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/valloric/youcompleteme/2768)
<!-- Reviewable:end -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants