[READY] Implement partial sorting #825

micbou · 2017-08-31T19:49:57Z

This PR adds an optional parameter to the CandidatesForQueryAndType and FilterAndSortCandidates functions that limits the number of candidates returned by these functions. The candidates are partially sorted according to this limit (i.e. only the n smallest candidates are sorted).

The current behavior is unchanged: identifiers are still limited to 10 by default and there is no limit for other kind of candidates. The plan is to add a new setting (e.g. g:ycm_max_num_candidates) to limit the number of non-identifier candidates. 50 seems a reasonable choice.

The PartialSort function implements the following strategy:

if the number of elements to sort is less than 1024 or represents less than 1/64 the total number of elements, use std::partial_sort;
otherwise, use std::nth_element then std::sort on the nth smallest elements.

This heuristic was obtained by comparing the performance of these 3 algorithms:

They were tested on randomly generated strings of 20 characters with different values for the total number of strings and the number of strings to sort. Here is the graph of an experiment with 2¹⁰ random strings:

The observation is that, when the number of elements to sort is small, std::partial_sort outperforms other algorithms while std::nth_element + std::sort always beats std::sort alone in other cases.
This experiment can be reproduced by checking out this branch and running the script plot_bench.py (the matplotlib package must be installed to draw the graph):

./plot_bench.py --bench bench.log

The number of elements can be modified by editing this variable.

Finally, here are the benchmark numbers obtained with these changes on my config. Conclusions are:

no performance loss when all candidates are returned;
a ~36% speedup in sorting and filtering identifiers with max_min_identifier_candidates default value.
when limiting the number of non-identifier candidates to 50, sorting and filtering non-identifier candidates is ~5% faster if candidates are not already stored in the repository and ~24% if they are.

This change is

bstaletic · 2017-08-31T21:22:19Z

Reviewed 12 of 12 files at r1.
Review status: all files reviewed at latest revision, 2 unresolved discussions.

cpp/ycm/benchmarks/IdentifierCompleter_bench.cpp, line 55 at r1 (raw file):

BENCHMARK_REGISTER_F( IdentifierCompleterFixture, CandidatesWithCommonPrefix )
    ->RangeMultiplier( 1 << 4 )
    ->Ranges( { { 1, 1 << 16 }, { 10, 10 } } )

The { 1, 1 << 16 } part is the range itself. What does the { 10, 10 } pat mean?

cpp/ycm/benchmarks/PythonSupport_bench.cpp, line 56 at r1 (raw file):

    state.ResumeTiming();
    FilterAndSortCandidates( candidates, "insertion_text", "aA",
                             state.range( 1 ) );

What does state.reange( 1 ) evaluate to in this case?

Comments from Reviewable

micbou · 2017-08-31T21:50:21Z

Reviewed 2 of 12 files at r1.
Review status: all files reviewed at latest revision, 2 unresolved discussions.

cpp/ycm/benchmarks/IdentifierCompleter_bench.cpp, line 55 at r1 (raw file):

Previously, bstaletic (Boris Staletic) wrote…

The { 1, 1 << 16 } part is the range itself. What does the { 10, 10 } pat mean?

{ 10, 10 } is the range for the maximum number of candidates returned by CandidatesForQuery. It's identical to { 10 } (in other words, we only test the value 10) but Google benchmark doesn't accept a single value for a range.

cpp/ycm/benchmarks/PythonSupport_bench.cpp, line 56 at r1 (raw file):

Previously, bstaletic (Boris Staletic) wrote…

What does state.reange( 1 ) evaluate to in this case?

0 when the ranges are { { 1, 1 << 16 }, { 0, 0 } } and 50 when { 1, 1 << 16 }, { 50, 50 } }.

Comments from Reviewable

codecov-io · 2017-09-01T00:18:38Z

Codecov Report

Merging #825 into master will decrease coverage by <.01%.
The diff coverage is 96.55%.

@@            Coverage Diff             @@
##           master     #825      +/-   ##
==========================================
- Coverage   94.82%   94.82%   -0.01%     
==========================================
  Files          79       79              
  Lines        5374     5389      +15     
  Branches      168      169       +1     
==========================================
+ Hits         5096     5110      +14     
  Misses        231      231              
- Partials       47       48       +1

Valloric · 2017-09-01T02:21:04Z

I'm sorry, but the PR description isn't at all clear to me. I've read it twice now and I still have no clue what this PR is trying to accomplish. :D It might have just been a long day at work, but still, could this be clarified? Perhaps with some examples.

Review status: all files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.

Comments from Reviewable

bstaletic · 2017-09-01T06:48:34Z

@Valloric Let me try to explain.

Current situation

Identifier completer - returns 10 candidates.
All other completers - return everything they can, no hard limit set.

Sorting - std::sort() always.

This pull request

Identifier completer - returns 10 sorted candidates.
Other completers - return up to 50 sorted candidates. Reasoning: User doesn't care for thousands of identifiers, only the most likely "few". Few is 50 by default, controllable with g:ycm_max_num_candidates.

Sorting - Partially sort those 50 (by default) candidates following this algorithm.

If there are less than 1024 candidates use std::partial_sort().
If the number of wanted candidates (50) is less than (total number of possible candidates)/64 use std::partial_sort()
In all other cases use std:nth_element() then std::sort(). - This branch is what isn't tested according to codecov.

This results in much faster sorting of candidates.

Review status: all files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.

Comments from Reviewable

micbou · 2017-09-01T08:14:11Z

Sorry for being unclear. The idea is that, since the maximum number of identifiers returned to the user is (by default) 10, we don't need to sort all of them but only the 10 smallest. This is significantly faster when the number of identifiers is big. The same can be done for non-identifier candidates if we limit them too. The rational in limiting them is that, in addition to improving performance, it's not particularly useful to return thousands of candidates. Users will never go through all of them.

@bstaletic did a good summary except that the PR doesn't actually set the maximum number of non-identifier candidates returned to the user to 50 but give the possibility to limit this number by adding an optional parameter to the FilterAndSortCandidates function. Setting the limit to 50 is subject to discussion and should be done in a separate PR.

Reviewed 9 of 12 files at r1.
Review status: all files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.

Comments from Reviewable

bstaletic · 2017-09-01T08:23:29Z

@micbou Taking another look at the code, the optional argument determining the max number of candidates is initialised defaulting to 0. with 0 having a special meaning of returning everything without any limitation on the number of returned candidates.

I really think we shuld default to some other value.

Review status: all files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.

Comments from Reviewable

micbou · 2017-09-01T10:08:25Z

The default is 0 to keep the current behavior which is to return all the non-identifier candidates. If we decide to limit the number of candidates, we should set the max_candidates parameter in the Python layer like we currently do for identifiers.

Reviewed 1 of 12 files at r1, 1 of 1 files at r2.
Review status: all files reviewed at latest revision, all discussions resolved, some commit checks broke.

Comments from Reviewable

bstaletic · 2017-09-01T11:51:49Z

Fair point. Now why is codecov still reporting the same coverage after adding another test in r2 of this PR?

Review status: all files reviewed at latest revision, all discussions resolved, some commit checks failed.

Comments from Reviewable

micbou · 2017-09-01T12:12:00Z

Because the test only sorts the 2 smallest identifiers while it should sort at least 1024 if we want the other part of PartialSort to be covered. I am rewriting the test.

Review status: all files reviewed at latest revision, all discussions resolved, some commit checks failed.

Comments from Reviewable

bstaletic · 2017-09-01T12:39:36Z

With the updated test this is .

Reviewed 1 of 1 files at r2, 1 of 1 files at r3.
Review status: all files reviewed at latest revision, all discussions resolved.

Comments from Reviewable

micbou · 2017-09-01T20:18:32Z

New test improved coverage. This is ready.

Reviewed 1 of 1 files at r3.
Review status: all files reviewed at latest revision, all discussions resolved.

Comments from Reviewable

Valloric · 2017-09-02T02:19:41Z

Thanks for the extra clarification folks! I get what this is trying to do now. Seems like a sensible change. Minor inline comments, but otherwise .

Review status: 12 of 13 files reviewed at latest revision, 2 unresolved discussions, some commit checks failed.

cpp/ycm/Utils.h, line 191 at r4 (raw file):

// Shrink a vector to its sorted |num_sorted_elements| smallest elements. If
// |num_sorted_elements| is 0 or more than the vector size, sort the whole

"or more" -> "or larger"

cpp/ycm/Utils.h, line 210 at r4 (raw file):

  // alone in other cases.
  if ( max_elements <= std::max( static_cast< size_t >( 1024 ),
                                 nb_elements >> 6 ) ) {

IMO the right shift is too clever. Just use / 64 and the compiler will transform it for you into the shift. It will perform the same but it will be easier to read.

Comments from Reviewable

Add an option to limit the number of candidates to sort and filter. Partially sort the list of candidates according to this limit.

micbou · 2017-09-02T06:38:54Z

Reviewed 1 of 1 files at r4, 1 of 1 files at r5.
Review status: all files reviewed at latest revision, 2 unresolved discussions.

cpp/ycm/Utils.h, line 191 at r4 (raw file):

Previously, Valloric (Val Markovic) wrote…

"or more" -> "or larger"

Done.

cpp/ycm/Utils.h, line 210 at r4 (raw file):

Previously, Valloric (Val Markovic) wrote…

IMO the right shift is too clever. Just use / 64 and the compiler will transform it for you into the shift. It will perform the same but it will be easier to read.

Done.

Comments from Reviewable

Valloric · 2017-09-03T02:53:13Z

Thanks!

@zzbot r=bstaletic

Review status: all files reviewed at latest revision, all discussions resolved.

Comments from Reviewable

zzbot · 2017-09-03T02:53:14Z

📌 Commit 1e00ddd has been approved by bstaletic

zzbot · 2017-09-03T02:53:17Z

⌛ Testing commit 1e00ddd with merge f180d07...

[READY] Implement partial sorting This PR adds an optional parameter to the `CandidatesForQueryAndType` and `FilterAndSortCandidates` functions that limits the number of candidates returned by these functions. The candidates are partially sorted according to this limit (i.e. only the `n` smallest candidates are sorted). The current behavior is unchanged: identifiers are still limited to `10` by default and there is no limit for other kind of candidates. The plan is to add a new setting (e.g. `g:ycm_max_num_candidates`) to limit the number of non-identifier candidates. `50` seems a reasonable choice. The `PartialSort` function implements the following strategy: - if the number of elements to sort is less than `1024` or represents less than `1/64` the total number of elements, use `std::partial_sort`; - otherwise, use `std::nth_element` then `std::sort` on the `n`th smallest elements. This heuristic was obtained by comparing the performance of these 3 algorithms: - [`std::sort`](http://en.cppreference.com/w/cpp/algorithm/sort); - [`std::partial_sort`](http://en.cppreference.com/w/cpp/algorithm/partial_sort); - [`std::nth_element`](http://en.cppreference.com/w/cpp/algorithm/nth_element) + `std::sort`. They were tested on randomly generated strings of 20 characters with different values for the total number of strings and the number of strings to sort. Here is the graph of an experiment with 2<sup>10</sup> random strings: ![partial-sorting-comparison](https://user-images.githubusercontent.com/10026824/29936611-2c05ceaa-8e83-11e7-8303-a08c39afcae0.png) The observation is that, when the number of elements to sort is small, `std::partial_sort` outperforms other algorithms while `std::nth_element` + `std::sort` always beats `std::sort` alone in other cases. This experiment can be reproduced by checking out [this branch](https://github.com/micbou/ycmd/tree/partial-sorting-bench) and running the script `plot_bench.py` (the `matplotlib` package must be installed to draw the graph): ``` ./plot_bench.py --bench bench.log ``` The number of elements can be modified by editing [this variable](https://github.com/micbou/ycmd/blob/41d0b5e1735fa917c071421831b48111a8cb6621/cpp/ycm/benchmarks/PartialSorting_bench.cpp#L205). Finally, [here](https://gist.github.com/micbou/424944dafd8150aa11dc6d8cfe49f624) are the benchmark numbers obtained with these changes on my config. Conclusions are: - no performance loss when all candidates are returned; - a ~36% speedup in sorting and filtering identifiers with `max_min_identifier_candidates` default value. - when limiting the number of non-identifier candidates to `50`, sorting and filtering non-identifier candidates is ~5% faster if candidates are not already stored in the repository and ~24% if they are.  --- This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/valloric/ycmd/825)

zzbot · 2017-09-03T04:10:49Z

💔 Test failed - status-travis

zzbot · 2017-09-03T11:44:50Z

☀️ Test successful - status-travis
Approved by: bstaletic
Pushing f180d07 to master...

[READY] Add max_num_candidates option Add the `max_num_candidates` option to limit the number of non-identifier candidates returned by ycmd. Its default value is set to 50 as this seems a good compromise between performance and the amount of candidates shown to the user (10 candidates like for identifiers would be too low). Limiting the number of candidates does not only improve performance when sorting them (see PR #825) but also when sending the response to the client as the response is smaller and when clients are populating the completion menu since there are less candidates.  --- This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/valloric/ycmd/830)

[READY] Update ycmd This new version of ycmd includes the following changes: - PR ycm-core/ycmd#795: add option to make relative paths in flags from extra conf absolute; - PR ycm-core/ycmd#802: fix compilation on Haiku; - PR ycm-core/ycmd#804: add libclang detection on FreeBSD; - PR ycm-core/ycmd#808: write python used during build before installing completers; - PR ycm-core/ycmd#810: support unknown languages from tags; - PR ycm-core/ycmd#811: update Universal Ctags languages list; - PR ycm-core/ycmd#814: resolve symlinks in extra conf glob patterns; - PR ycm-core/ycmd#815: update JediHTTP; - PR ycm-core/ycmd#816: update Boost to 1.65.0; - PR ycm-core/ycmd#819: filter and sort candidates when query is empty; - PR ycm-core/ycmd#820: improve LLVM root path search for prebuilt binaries; - PR ycm-core/ycmd#822: inline critical utility functions; - PR ycm-core/ycmd#824: do not sort header paths in filename completer; - PR ycm-core/ycmd#825: implement partial sorting; - PR ycm-core/ycmd#830: add max_num_candidates option; - PR ycm-core/ycmd#831: fix multiline comments and strings issues; - PR ycm-core/ycmd#832: update Clang to 5.0.0. The `g:ycm_max_num_candidates` and `g:ycm_max_num_identifier_candidates` options are added to the documentation. The link to ycmd extra conf is updated. Fixes #2562.  --- This change is [<img src="https://reviewable.io/review_button.svg" height="34" align="absmiddle" alt="Reviewable"/>](https://reviewable.io/reviews/valloric/youcompleteme/2768)

micbou force-pushed the partial-sorting branch from 062d3c1 to 2a071f8 Compare September 1, 2017 08:49

micbou changed the title ~~[READY] Implement partial sorting~~ [WIP] Implement partial sorting Sep 1, 2017

micbou force-pushed the partial-sorting branch from 2a071f8 to 0270c33 Compare September 1, 2017 12:34

micbou force-pushed the partial-sorting branch from 0270c33 to 9b5a88e Compare September 1, 2017 15:51

micbou changed the title ~~[WIP] Implement partial sorting~~ [READY] Implement partial sorting Sep 1, 2017

micbou force-pushed the partial-sorting branch from 9b5a88e to 51c7331 Compare September 1, 2017 21:21

Implement partial sorting

1e00ddd

Add an option to limit the number of candidates to sort and filter. Partially sort the list of candidates according to this limit.

micbou force-pushed the partial-sorting branch from 51c7331 to 1e00ddd Compare September 2, 2017 06:37

zzbot merged commit 1e00ddd into ycm-core:master Sep 3, 2017

zzbot mentioned this pull request Sep 3, 2017

[READY] Downgrade boost #827

Closed

zzbot mentioned this pull request Sep 3, 2017

[NOT READY] Hide C++ symbols in the ycm_core.so by default #826

Closed

micbou deleted the partial-sorting branch September 3, 2017 11:49

micbou mentioned this pull request Sep 3, 2017

[READY] Add max_num_candidates option #830

Merged

micbou mentioned this pull request Sep 10, 2017

[READY] Update ycmd ycm-core/YouCompleteMe#2768

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[READY] Implement partial sorting #825

[READY] Implement partial sorting #825

micbou commented Aug 31, 2017 •

edited

Loading

bstaletic commented Aug 31, 2017

micbou commented Aug 31, 2017

codecov-io commented Sep 1, 2017 •

edited

Loading

Valloric commented Sep 1, 2017

bstaletic commented Sep 1, 2017 •

edited

Loading

micbou commented Sep 1, 2017

bstaletic commented Sep 1, 2017

micbou commented Sep 1, 2017

bstaletic commented Sep 1, 2017

micbou commented Sep 1, 2017

bstaletic commented Sep 1, 2017

micbou commented Sep 1, 2017

Valloric commented Sep 2, 2017

micbou commented Sep 2, 2017

Valloric commented Sep 3, 2017

zzbot commented Sep 3, 2017

zzbot commented Sep 3, 2017

zzbot commented Sep 3, 2017

zzbot commented Sep 3, 2017

[READY] Implement partial sorting #825

[READY] Implement partial sorting #825

Conversation

micbou commented Aug 31, 2017 • edited Loading

bstaletic commented Aug 31, 2017

micbou commented Aug 31, 2017

codecov-io commented Sep 1, 2017 • edited Loading

Codecov Report

Valloric commented Sep 1, 2017

bstaletic commented Sep 1, 2017 • edited Loading

Current situation

This pull request

micbou commented Sep 1, 2017

bstaletic commented Sep 1, 2017

micbou commented Sep 1, 2017

bstaletic commented Sep 1, 2017

micbou commented Sep 1, 2017

bstaletic commented Sep 1, 2017

micbou commented Sep 1, 2017

Valloric commented Sep 2, 2017

micbou commented Sep 2, 2017

Valloric commented Sep 3, 2017

zzbot commented Sep 3, 2017

zzbot commented Sep 3, 2017

zzbot commented Sep 3, 2017

zzbot commented Sep 3, 2017

micbou commented Aug 31, 2017 •

edited

Loading

codecov-io commented Sep 1, 2017 •

edited

Loading

bstaletic commented Sep 1, 2017 •

edited

Loading