Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overhaul std collections and algorithm documentation #36318

Closed
4 tasks
gnzlbg opened this issue Sep 7, 2016 · 28 comments
Closed
4 tasks

Overhaul std collections and algorithm documentation #36318

gnzlbg opened this issue Sep 7, 2016 · 28 comments
Labels
A-docs Area: documentation for any part of the project, including the compiler, standard library, and tools C-enhancement Category: An issue proposing an enhancement or a PR with one.

Comments

@gnzlbg
Copy link
Contributor

gnzlbg commented Sep 7, 2016

The documentation of the std::collections:

  • lacks enough rigor to make good informed decisions about using the collections and their methods as well as most algorithms.
  • some of the information is there, but it is not easy to get to it from a particular method (e.g. if I want to find the complexity of some container insert method googling for "Rust HashSet insert" would lead me to the documentation of the insert function, but this information is in a performance table in the main page of the collection API.
  • the place for rigor is in the docs and not in the book. The docs are the reference.

I really think that being more thorough with the documentation will help both find and fix bugs in the documentation and make the std library way better.

The solution I propose is to make all the relevant information accessible from all the methods.

The first thing to do would be to achieve consensus on which information should be available for each method. Taking inspiration from the C++ standard, and to kickstart the discussion, I would propose that each method / algorithm / function documentation comment should contain the following "fields" below the short and long descriptions. If a field doesn't make sense for some particular method, I would propose to still have it there but explicitly leave it blank.

  • Algorithmic complexity: with big O in terms of the relevant operations (which can be multiple: copies/moves in terms of the number of elements, number of swaps, number of comparisons, hashing...), with exact number of operations when guaranteed (performs exactly N moves, performs exactly M - N swaps, ...), and for those methods that have amortized/expected complexity it should contain both the complexity in the amortized case as well as the worst case (and maybe best case), and under which conditions they trigger.
    • Example: Vec::push(t): amortized O(1): exactly 1 move/copy of t if Vec::size() < Vec::capacity(), otherwise worst case O(N): exactly N + 1 moves/copies where N == Vec::size().
    • Example: Vec::reserve(M): O(1) time if M <= Vec::capacity(), O(N) otherwise: exactly N moves/copies where N == Vec::size().
  • Memory complexity: does it use O(1) stack space? Is it recursive and it uses O(logN) stack space? Does it allocate memory using an allocator? If so how much?
    • Example: Vec::push(t): amortized O(1) stack space, otherwise worst case O(1) stack space and a single memory allocation of O(k*N) where k is an implementation-defined growth factor independent of N and where N == Vec::size().
    • Example: Vec::reserve(M): O(1) stack space if M <= Vec::capacity(), otherwise a single allocation of O(k*N) where k is an implementation-defined growth factor independent of N and N == Vec::size().
  • Effects: What are the effects of the operation?
    • Example: Vec::push(t): Before: size() == N, capacity == M, After: size() == N + 1, capacity() == M (amortized case) or capacity() == k*M (worst case, where k is an implementation defined growth factor), pop() == Some(t).
    • Example: Vec::reserve(N): Before: capacity() == M, After: capacity() == max(N, M).
  • Panics: When is an operation supposed to panic?
    • Example: Vec::push(t) panics if OOM or if a copy of t is performed and copying panics.
    • Example: Vec::size() never panics.

I consider this the bare minimum level of "formalization" that at least the basic collections and algorithms in the standard library should have, but as you can see things become extremely verbose so one might want to explore how to factor out some parts of the documentation (e.g. that for Vec the growth-factor is a constant independent of the number of elements or the capacity of the vector) that are to be repeated in a lot of comments.

So that's for my strawman proposal. I guess the steps to follow (please correct me if I'm wrong):

  • Achieve consensus on how to document the methods of the std library algorithms and collections. Maybe through an RFC? Or isn't an RFC required here?
  • Explore if there is an easy way to reuse some documentation bits (e.g. OOM panics, never panics, explanation for amortized / expected complexity).
  • Go (again) through all methods and document them rigorously (i.e. actually implement this).
  • Orthogonal to this, since not all collections are mentioned in the main collections page (e.g. HashSet), probably we want to go over the main collection page and overhaul it as well.

For a feeling of the current state of things:

  • HashMap::insert doesn't mention that insertions are O(1) so if one lands there (e.g. via google) one is on its own. Relevant information should be visible right away.
  • vec::insert: in the amortized case requires O(1) stack memory and will perform exactly O(end - pos) moves but in the worst case it requires O(kN) memory and O(N) moves. None of this is "obvious" from the docs. We should make it obvious.
  • std::slice::sort complexity is O(NlogN) in the number of elements, but it mentions that it allocates approximately "2*n" elements. Does that mean at worst 2N? Can it be more? While this is better documented than other methods, it is still extremely vague. In particular, trivial stable sort implementations require 1/2N extra memory, a bad one at least N, and the good ones are in place (e.g. GrailSort requires O(1)). The rust standard library doesn't have an inplace unstable sorting algorithm (AFAIK), but the documentation of one implemented with something quicksort-based like intro-sort should say that it uses O(logN) stack space in some cases (I did not have any examples above with anything but O(1) stack space, but basically every recursive algorithm requires something different than O(1)).

cc @GuillaumeGomez

@hanna-kruppe
Copy link
Contributor

hanna-kruppe commented Sep 7, 2016

I'm not against additional rigor, but I would rather have the API documentation err on the side of too little details than too many:

  • API docs aren't ISO standards, they don't have to spell out every little detail, only those that actually aid programmers. In particular, some "obvious" things don't necessarily need to be spelled out (e.g., the effect of Vec::push on a subsequent Vec::pop call is fully clear from the current prose descriptions).
  • Since the review process is presumably just one person with permissions giving an r+, there is a danger of sneaking in more guarantees than "the team" wants to give (this wouldn't be the end of the world as API docs aren't standards—see previous bullet point—but these non-guarantees can be broken accidentally, which defeats the point of documenting them). RFCs could alleviate this but are pretty heavy duty for something so relatively unimportant.

Also, spec'ing the number of moves/copies makes little sense (except as one of many factors influencing run time): C++ specifies it because moves and copies call constructors and are thus observable behavior, but in Rust they're plain memcpys (except Clone). I also have doubts about the usefulness and practicality of spec'ing asymptotic stack space requirements, where the usual objection of constant factors being incredibly important applies even more than everywhere else.

@gnzlbg
Copy link
Contributor Author

gnzlbg commented Sep 7, 2016

I'm not against additional rigor, but I would rather have the API documentation err on the side of too little details than too many

I both agree and disagree here. I think it is worth it to document as much as it makes sense (that is "many details") but I think it makes sense to err in the "conservative" side, like for example specify conservative complexity guarantees (e.g. for stable sort "at most O(N) extra memory" doesn't prevent the implementation from switching to grail sort, which requires O(1) memory, for those cases in which it is faster).

API docs aren't ISO standards, they don't have to spell out every little detail, only those that actually aid programmers. In particular, some "obvious" things don't necessarily need to be spelled out (e.g., the effect of Vec::push on a subsequent Vec::pop call is fully clear from the current prose descriptions).

Yes I agree in that these things are obvious, and repetitive. The tricky part will be achieving consensus on what should be documented here and what should not be documented, particularly in the effect section. My strawman tries to mention everything since that is one of the extremes, but probably something in between will be better.

Having said this, I do think that having a terse bullet-point form of all the side-effects of an operation is useful even if they are already documented in the prose. For example for Vec::pop, spelling the side -effects out might help users avoiding the wrong assumption that thesize() afterwards is always the size before minus one. It's trivial, it's already said somewhere else, but IMO important things are said twice (in particular when "refreshing" an algorithm one might directly just go to the complexity/effects section and just read that).

Since the review process is presumably just one person with permissions giving an r+, there is a danger of sneaking in more guarantees than "the team" wants to give (this wouldn't be the end of the world as API docs aren't standards—see previous bullet point—but these non-guarantees can be broken accidentally, which defeats the point of documenting them).

Even though API docs aren't standards, guaranteeing O(N) complexity and changing that afterwards to O(N log N) is an API breaking change, so we want to be very conservative at first with what do we guarantee for each method. Tightening any of those guarantees can always be done later, but I think that should follow the RFC process, e.g., want to tighten the guarantee that stable sort requires O(1) extra memory instead of O(N)? Write a (small) RFC for it.

Also, spec'ing the number of moves/copies makes little sense (except as one of many factors influencing run time): C++ specifies it because moves and copies call constructors and are thus observable behavior, but in Rust they're plain memcpys (except Clone)

I should have said move/clone instead of move/copy but I still do think that it is worth it to guarantee or at least document this to allow users to reason about performance. Some algorithms are in place and do not require any memcpys, while others need to perform a memcpy of O(N). Again I am not suggesting to making these guarantees as tight as possible (or as tight as the current implementation), but guaranteeing that O(N) copies happen as opposed to O(N^2) copies tells users that:

  • copies might be happening (which is something that they might not have thought about)
  • the number of copies happening won't make the algorithm explode (and somebody has given thought to this).

I think a good example are the two kinds of sorting algorithms one generally encounters: unstable in-place, and stable with a buffer. The number of comparisons alone don't tell you the whole picture about both algorithms. For unstable in-place the number of swaps and the stack usage (O(1) vs O(logN)) tell you the larger story (*), while for stable sort you also need to know how much memory it allocates, how often is it going to copy elements into the buffer, whether it can survive OOM (in Rust obviously it cannot, but maybe in the future with Allocators it could be made to work without a buffer).

(*) I think that with respect to stack usage constant factors don't matter "that much", since for a O(1) algorithm you know that if it works for zero elements it won't overflow the stack for N elements, but the same cannot be said about an algorithm with O(logN) stack usage.

@hanna-kruppe
Copy link
Contributor

hanna-kruppe commented Sep 7, 2016

Even though API docs aren't standards, guaranteeing O(N) complexity and changing that afterwards to O(N log N) is an API breaking change

That is precisely my worry. Giving any guarantee at all is a commitment that needs to be made carefully and consciously, unless the bounds given are incredibly loose. Such loose bounds, on the other hand, can be harmful by confusing people and giving a wrong impression.

move/clone

clone calls might be worth documenting (for the same reason as in C++), but I don't understand the focus on memcpys. The fact that something takes, say, O(n) time tips you off about the main implication of the copies. As far as constant factors go, a memcpy is often among the cheapest operations you can do, compared to many other things that nobody proposes to specifically call out in the documentation. So I do not see any reason to emphasize memcpy over a myriad other metrics (scratch allocations, multiple passes in an algorithm that one might expect to be single-pass, overflow checks, the number of system calls, etc.)

You are right that some of these metrics give a better idea of the algorithm used, but this requires giving bounds that aren't trivially true for any sensible implementation. Again, giving overly loose bounds (e.g., O(log N) stack space, O(N) scratch space, O(N²) comparisons and swaps worst case) would be misleading, and narrowing it down would be a large commitment that I do not believe people would be or should be willing to make. Guaranteeing details of the algorithm used (the raison d'être for these metrics) is precisely what we should be reluctant to do.


The more I think and write about this topic, the less happy I am about giving precise asymptotic complexity bounds. Don't get me wrong, knowing that a program takes somewhere between O(1) and O(N log N) time and space is a very valuable sanity check that it won't blow up on larger data sets. We should give this information. But more details are both hard to provide, and may not give all that much detail. For realistic performance estimates, one needs more information than a programming language should be willing to guarantee, including information that can't be captured in asymptotic complexity at all.

In short, it is information about the specific implementation your program is actually using, and that's the exact opposite of what API docs should do. I don't know if and how this information should be made available to those fine-tuning their code's performance, but I am increasingly convinced that API docs are the wrong place for these sorts of details.

@mark-i-m
Copy link
Member

mark-i-m commented Sep 7, 2016

@rkruppe Perhaps it might be valuable to create another RFC to finalize asymptotic bounds for operations on standard library collections before putting them in the docs.

Also, I think it would be nice if docs could mention significant dropping and ownership behavior (e.g. Vec::pop gives ownership of the popped item) and events that could cause unwanted behavior (e.g. killing another thread while that thread is executing Vec::drain could result in memory leaks).

@mark-i-m
Copy link
Member

mark-i-m commented Sep 7, 2016

I'm not against additional rigor, but I would rather have the API documentation err on the side of too little details than too many

Also, I would rather err on the side of too many details as long as they are well organized. People can skim through stuff they don't care about, but not getting the information you need is maddening...

@gnzlbg
Copy link
Contributor Author

gnzlbg commented Sep 7, 2016

I think it is worth remembering that currently we are at the opposite end of the spectrum, in which "almost nothing" is documented (a lot is documented, but not really in a method per method basis with strict guarantees that users can rely on).

Maybe I started this discussion with the wrong foot by suggesting specific things to document. I guess it would be better to agree on general goals for the API documentation first (from the point-of-view of an user), and after we have consensus on that, try to formulate some more specific guidelines that achieve those goals (and then iterate).

In my opinion, for a user reading the API documentation of some collection method or algorithm, it should be clear whether the method:

  1. "Can be called in a loop", and estimate the complexity of the loop in terms of the method input.
  2. Can overflow their stack, and in that case, how does the stack grow (particularly important for embedded development),
  3. Can panic at all and, when possible, under which circumstances (only due to OOM, stack overflow, etc.)
  4. Has some observable side-effects like allocating memory (transferring ownership, spawning threads, performing a sys-call, etc...).
  5. Changes the collection invariants and which ones (e.g. can leak memory under which situations).
  6. Has some performance properties that depend on properties of the input or the collection invariants, or both.

So what do I mean by the "opposite end of the spectrum above"? Consider finding a key in a HashSet, using HashSet::contains, quoting from the docs:

fn contains<Q: ?Sized>(&self, value: &Q) -> bool where T: Borrow<Q>, Q: Hash + Eq

Returns true if the set contains a value.

The value may be any borrowed form of the set's value type, but Hash and Eq on the borrowed form must match those for the value type.

So going through the goals stated above, the current documentation basically fails at all of them. The API docs don't specify the complexity (so I don't know if calling it inside a loop will blow up my program), whether it is implemented recursively or not (i.e. if the stack could overflow), whether this operation can panic, can leak or allocate memory, ...

Does this mean that contains is useless? Obviously no. But it kind of does mean that changing any of the above is not a breaking change, since the API doesn't promise anything. This is a serious stability issue, since the complexity of most operations could change tomorrow, and make my program blow up with the next rustc version.

Realistically though, I seriously doubt that anybody at this point would break any of the current guarantees that HashSet provides, and the same goes for the other collections. If anything, those guarantees might get better, but not worse. At least I would consider making e.g. the complexity worse a breaking change "even if the API doesn't promise anything". So IMO we might as well just go on and document this. This would not only serve as guidelines for users, but also as guidelines for those improving HashSet (what can/cannot be done without breakage).

I think that how to exactly document this should probably be handled in an RFC, but at least agreeing on a set of goals for the documentation first would make it much easier to start writing such an RFC.


@rkruppe

That is precisely my worry. Giving any guarantee at all is a commitment that needs to be made carefully and consciously, unless the bounds given are incredibly loose. Such loose bounds, on the other hand, can be harmful by confusing people and giving a wrong impression

For me changing the complexity (among other things mentioned above, like memory usage, stack usage, allocating memory, panics, ...) is a breaking change even if it is not documented. If the std collections do not guarantee any of these things, this means to me that:

  • I cannot reason about my programs, and that even if I look at the implementation, my reasoning needs to be updated with every new rustc version
  • I cannot use std collections if I need stable behavior, since currently most methods have no rigorous (as in properly specified, not as in tight) guarantees (so they can change anytime).

but I don't understand the focus on memcpys. The fact that something takes, say, O(n) time tips you off about the main implication of the copies. As far as constant factors go, a memcpy is often among the cheapest operations you can do

Cheapest compared to what? Comparisons while using stable sort? They are relatively cheap for types that are small and expensive to compare, and relatively expensive for types that are large and cheap to compare (like struct { buffer: [u8; 16384], unique_id_for_cmp: u8 }).

Again, giving overly loose bounds (e.g., O(log N) stack space, O(N) scratch space, O(N²) comparisons and swaps worst case) would be misleading,

Whether stack space used is O(1) or O(logN) is the difference between a stack overflow being possible or not (and thus the difference between a panic being possible or not, or at least that if the algorithm for N == 0 doesn't overflow the stack, it won't overflow the stack for any N). Maybe I am weird for caring about this, but I do like to know. And if I write some code that should never panic, and use some algorithm that uses O(1) stack space for this, I would not want this behavior to change in the next rustc upgrade under the covers.

I agree that this information does leak implementation details of the algorithm, and I agree the need to be conservative to give implementors enough leeway to perform optimizations, but one cannot have it both ways.

Guaranteeing details of the algorithm used (the raison d'être for these metrics) is precisely what we should be reluctant to do.

I think we should give enough guarantees to allow people to reason about how and when to use the methods and classes in std collections without severely constraining the implementation.

Documentation wise it is probably better to move some of the details to a "Non-binding implementation details" section in the API documentation comments that might add some extra information that might be relevant for those wanting to know exactly what is going on, but that is allowed to change between rustc versions.

The more I think and write about this topic, the less happy I am about giving precise asymptotic complexity bounds. Don't get me wrong, knowing that a program takes somewhere between O(1) and O(N log N) time and space is a very valuable sanity check that it won't blow up on larger data sets. We should give this information.

Deciding which information to give exactly and how is a hard. Sadly we only have the API docs for this, since that is the reference. As mentioned above I think it would be better to have a non-binding section in the API docs that tells users implementation details that "are good to know" but that are allowed to change (e.g. the value of Vec's growth factor is good to know and if you are benchmarking it might help you understand the result of your benchmarks, but you should not rely on it).

@mark-i-m
Copy link
Member

mark-i-m commented Sep 7, 2016

@gnzlbg What you said makes a lot of sense :)

I have a question about your 5th point:

  1. Changes the collection invariants and which ones

What do you mean by this?

Also, I would like to add to the list: "Has interesting static side effects, such as giving up ownership of some piece of data."

[Regarding space complexity] Maybe I am weird for caring about this, but I do like to know.

I think space complexity is important (especially for collection types), but I have not come across examples where stack space complexity has caused my program to overflow... In my limited experience, this tends to be more a function of time complexity... Do you have a specific example where you have had problems?

Documentation wise it is probably better to move some of the details to a "Non-binding implementation details" section in the API documentation comments that might add some extra information that might be relevant for those wanting to know exactly what is going on, but that is allowed to change between rustc versions.

Here, I disagree:

  • People who want to know "exactly what is going on" should read the code (which is generally well-written and commented), while general users should be able to get enough information to do what they need to.
  • If a detail is later determined to be important to the average programmer, it should be formally made part of the spec for that collection via an RFC and then included in the docs.
  • If a user has a bizzare situation and needs to know something unusual, they should ask the rust team via an issue or on the IRC

The reason is that it breaks modularity. If we put something in the docs, even if it is marked "non-binding", somebody will write code that depends on the non-binding behavior, and that code will break when the non-binding behavior changes.

That said, I would like a section specifying what will not be standardized. For example, if for some reason it is not likely for the complexity of something to be specified, that should be well-noted.

More generally, I think it is also important to specify what should be mention in the module-level docs. IMHO, this should be:

  1. Basic usage (unless obvious)
  2. Design paradigms (e.g. RAII) that should guide the use of the library or help a programmer to understand the design better
  3. The Unspecified/Unstandardized section

@hanna-kruppe
Copy link
Contributor

@gnzlbg I have to agree that hashing out the details is out of scope for a single issue and at this stage the focus should be on the general direction. Unfortunately that still leaves us with a major philosophical disagreement: What to guarantee and what not. I don't really disagree with this sentiment:

At least I would consider making e.g. the complexity worse a breaking change "even if the API doesn't promise anything".

Certainly if someone filed an issue about a reasonable program that:

  • regressed significantly in performance,
  • ran into reasonable heap/stack limits,
  • degraded in quality for other reasons (e.g., suddenly using many more syscalls),

then I am pretty sure someone would try to fix that. At the same time, I am entirely opposed to enshrining such "reasonable" guarantees to a degree that terms like "breaking change" float around. We even reserve the right to make programs stop compiling as a side effect of tweaking type inference or adding trait implementations. It is simply not realistic to freeze all the myriad details that can factor into a program compiling and running at peak performance, and occasionally some metric will get worse, accidentally or deliberately as part of a change that's for the better in other aspects.

So I am quite receptive to the idea of documenting these things as "non-binding"/"non-guarantees". This doesn't mean that you "cannot use std collections if [you] need stable behavior". Pragmatically, it just means that programs might occasionally get slower (and at other times faster). But this is nothing new and nothing that can be solved at the level of asymptotic complexity — codegen and optimization passes also have a huge effect and change much more frequently. For example, see #35408: rustc started overflowing its stack on some inputs, not because of any algorithmic changes but because of an inadequacy of MIR trans.

This is not something that can be willed away with API docs and RFCs, it is an unfortunate fact of software engineering.

@mark-i-m
Copy link
Member

mark-i-m commented Sep 7, 2016

We even reserve the right to make programs stop compiling as a side effect of tweaking type inference or adding trait implementations.

But wouldn't this be labeled as a breaking change? I can't imagine something like this happening without at least an RFC? I'm just curious because this is a very strong statement...

@hanna-kruppe
Copy link
Contributor

@mark-i-m See RFC 1122 for the official policy on language changes and RFC 1105 on library stuff like trait implementations. The people pulling the levers are responsible and careful enough that there aren't big changes dropping from the sky every other day (e.g., often there will be a crater run), but virtually every change anywhere in the API surface or certain parts of the compiler has the potential to break some code, even if it's usually hypothetical or pathological.

@mark-i-m
Copy link
Member

mark-i-m commented Sep 7, 2016

Ah, I see what you mean now. Thanks

On Sep 7, 2016 12:07 PM, "Robin Kruppe" notifications@github.com wrote:

@mark-i-m https://github.com/mark-i-m See RFC 1122
https://github.com/rust-lang/rfcs/blob/master/text/1122-language-semver.md
for the official policy on language changes and RFC 1105
https://github.com/rust-lang/rfcs/blob/master/text/1105-api-evolution.md
on library stuff like trait implementations. The people pulling the levers
are responsible and careful enough that there aren't big changes dropping
from the sky every other day (e.g., often there will be a crater run), but
virtually every change anywhere in the API surface or certain parts of the
compiler has the potential to break some code, even if it's usually
hypothetical or pathological.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#36318 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AIazwKO-FZDnKrtk5ZNt_dJR60ArkPweks5qnu9QgaJpZM4J2xdf
.

@brson brson added C-enhancement Category: An issue proposing an enhancement or a PR with one. A-docs Area: documentation for any part of the project, including the compiler, standard library, and tools labels Sep 7, 2016
@gnzlbg
Copy link
Contributor Author

gnzlbg commented Sep 7, 2016

@mark-i-m

I have a question about your 5th point:

Changes the collection invariants and which ones

What do you mean by this?

I mean things like "this method changes the result of ::size(), ::capacity(), etc.". I should have said maybe "mutate the collection" or something like that.

Do you have a specific example where you have had problems?

Unstable sort is the typical example, e.g., if implemented using recursive quicksort it can overflow the stack, but if the implementation is a bit more clever and doesn't use recursion (or uses some sort of tail recursion), then that cannot happen. The guarantee is often O(1), and in some rare cases (dive and conquer algorithms) it might be O(logN), but the thing to read from there is "the implementation is allowed (or not) to use recursion, which as a consequence might overflow your stack". On a desktop this might be irrelevant (N would need to be too large), but on a tiny embedded system the amount of stack you get can be, well tiny.

People who want to know "exactly what is going on" should read the code (which is generally well-written and commented), while general users should be able to get enough information to do what they need to.

There are some things that are good to know, and having to read the code to find them is painful. For example, average programmers know that Vec has a growth factor and that they cannot rely on it. Still in some occasions it is nice to easily be able to find what it is. Examples are while benchmarking something you might see some effect that actually depends on the exact growth factor of a vector. Knowing the growth factor can help you better understand the benchmark results. Another example is writing an Allocators for a Vec. While you cannot rely on the growth factor, you might make your allocator faster by knowing in which ballpark it is (is it 1.2-1.4 or is it 1.9-2.). Another example is sort. It is useful to know the exact algorithm that sort uses (merge sort) without needing to go through the code. The exact algorithm might change, but if you know it is merge sort, and that merge sort is particularly bad suited for your input, that information would allow you to choose a better suited algorithm without having to dig through the code. Lots of sorting algorithms are ill suited for lots of patterns (like already sorted, almost already sorted, organ pipe, ...).

So I believe such a section with "summary of relevant implementation details that might change" is still worth it in some cases. Stuff like the complexities do hint at the implementation details, but having them in bullet point form avoid you to guess. For example if the worst case complexity of nth_element were O(N) that actually says which algorithm is used (since there is almost only one with that complexity), but while you don't want to guarantee that some specific algorithm is used, you can still save your users some work by spelling it out in some non-binding section (like "btw we use algorithm X, if somebody finds a better algorithm we might move to it, but your programs won't break because we won't make the complexity worse").

If a detail is later determined to be important to the average programmer, it should be formally made part of the spec for that collection via an RFC and then included in the docs.

Some details are always important, some are often important, and some might be important in corner cases. My point being, not all details are equally important. We might not be willing to commit on all details (e.g. the exact growth factor of a vector), but this doesn't mean we shouldn't pass this information to users. Requiring everybody to "read the code" doesn't scale.

The reason is that it breaks modularity. If we put something in the docs, even if it is marked "non-binding", somebody will write code that depends on the non-binding behavior, and that code will break when the non-binding behavior changes.

This happens independently of whether this information is inside some "do not rely on this" documentation section, or in the source code. That is, those users willing to rely on things they shouldn't rely on are going to do it independently of this. I don't think we should penalize all users by not including this information.

More generally, I think it is also important to specify what should be mention in the module-level docs. IMHO, this should be:

  • Basic usage (unless obvious)
  • Design paradigms (e.g. RAII) that should guide the use of the library or help a programmer to understand the design better
  • The Unspecified/Unstandardized section

I think that if we mean the same thing by the unspecified/unstandardized section (what I was referring to as "non-biding"), that doesn't make much sense to me at the module level because that information is going to be highly method specific. Are you referring to something else?


@rkruppe

We even reserve the right to make programs stop compiling as a side effect of tweaking type inference or adding trait implementations.

Yes, for programs that are arguably already broken.

It is simply not realistic to freeze all the myriad details that can factor into a program compiling and running at peak performance, and occasionally some metric will get worse, accidentally or deliberately as part of a change that's for the better in other aspects. Pragmatically, it just means that programs might occasionally get slower (and at other times faster). But this is nothing new and nothing that can be solved at the level of asymptotic complexity — codegen and optimization passes also have a huge effect and change much more frequently.

@rkruppe Maybe we are talking past each other, but IIUC your point is that knowing the algorithmic and memory complexities of the methods, and whether they panic and have some form of side-effect or not, is worth documenting, but not worth "guaranteeing"? And that users should still rely on these even if they are not guaranteed, but if they change and their programs break, then they should complain? I'm a bit lost. I also have the feeling that for some reason you think that I am proposing guaranteeing that sorting X ints will run in Y micro-seconds on a Xeon Z, which couldn't be farther off from what I am proposing.

What I am proposing in a nutshell is guaranteeing the information required such that users know whether they can call sort in a loop of their application or not, whether that call can blow up their stack and segfault or not, and whether for some input they might get an OOM or not.

In particular, most of this information is already documented and guaranteed for sort, mainly because without this information nobody can really reason about whether sort is actually usable for their problem. This is actually true for all methods, so what I am proposing is doing this for all other methods as well.

@mark-i-m
Copy link
Member

mark-i-m commented Sep 8, 2016

@gnzlbg

I think that if we mean the same thing by the unspecified/unstandardized section (what I was referring to as "non-biding"), that doesn't make much sense to me at the module level because that information is going to be highly method specific. Are you referring to something else?

I think we are actually thinking of converse ideas. I mean a section in which we explicitly say "XYZ behavior/property is not going to be standardized and you should not rely on it" rather than "You should not rely on it, but XYZ is actually done using ABC". More concretely, we should say "Do not rely on any particular allocator" rather than "You should not rely on the allocator, but we happen to be using such and such".

But I do agree that something like this could also be useful on a per-function basis.

Some details are always important, some are often important, and some might be important in corner cases. My point being, not all details are equally important. We might not be willing to commit on all details (e.g. the exact growth factor of a vector), but this doesn't mean we shouldn't pass this information to users.
...
The exact algorithm might change, but if you know it is merge sort, and that merge sort is particularly bad suited for your input, that information would allow you to choose a better suited algorithm without having to dig through the code.

I understand where you are coming from, but I disagree. Building software and assuming things that might change seems like a very wobbly foundation. If somebody is willing to change their design completely based on which sort we are using, it might be worth it to actually specify the properties of the sort we are using (or maybe they are making too many implementation specific assumptions)... As soon as the sorting algorithm changes, their code will break or become really slow. And the only fix is for them to change their design, which is exactly what we want to avoid by writing modular code.


@rkruppe

After thinking more about your earlier post, I still think it is a good idea to give some guarantees about the behavior of algorithms in the references. A library with such fundamental data structures needs to give corresponding fundamental guarantees, like complexity bounds, even if they are a bit loose.

I honestly don't expect the bounds to change too much either (that is, the time and space complexity bounds). Most of these data structures are pretty standard, and their implementations are pretty well known.

@gnzlbg
Copy link
Contributor Author

gnzlbg commented Sep 8, 2016

On Thu, Sep 8, 2016 at 2:48 AM, Not Mark notifications@github.com wrote:

Building software and assuming things that might change seems like a very
wobbly foundation. If somebody is willing to change their design completely
based on which sort we are using, it might be worth it to actually specify
the properties of the sort we are using (or maybe they are making too many
implementation specific assumptions)... As soon as the sorting algorithm
changes, their code will break or become really slow. And the only fix is
for them to change their design, which is exactly what we want to avoid by
writing modular code.

I am just trying to see the documentation from the point-of-view of users
that do and don't care. For example, when writing a program if you need to
sort something, most of the time you don't actually care as long as it
isn't unreasonably slow. It might be irrelevant how fast sorting is, you
might not know if the sequence you are sorting even follows some pattern,
... Sometimes, users do know, it is relevant, and they do care. While
std::sort is a good generic stable sort algorithm, it isn't the best at
sorting all kind of inputs, we have libraries with multiple sorting
algorithms for this reason. Such an user might ask, can I still start with
std::sort and worry about choosing a better algorithm later? Without
knowing the exact algorithm it uses, this might be hard. Users still can go
on, open the source, and find the algorithm it users, which paper explain
it and so on. While most users might not care, I'd rather spare the work to
those that do and use the documentation to reinforce that they should not
be relying on this.

Some information is IMO just good to know. You go on, use std::sort, and
get horrible performance and don't even know that it is because of the
sort. At the point you benchmark you might have a better idea about your
inputs than when you decided to just sort something. Knowing the exact
algorithm might be the difference between a fast "ah! I remember from
Algorithms and Data Structures that merge sort is particularly bad at this
case", and hours of helpless debugging and profiling trying to figure out
why std::sort fails for your case. Just because users shouldn't rely on it
doesn't make the information useless.

@hanna-kruppe
Copy link
Contributor

@gnzlbg

Yes, for programs that are arguably already broken.

No, for perfectly reasonable programs too, if they are judged to occur rarely enough in the wild (that's one use of crater!) and be easy enough to fix (e.g. by explicitly disambiguating the method call or adding a type method).

Maybe we are talking past each other, but IIUC your point is that knowing the algorithmic and memory complexities of the methods, and whether they panic and have some form of side-effect or not, is worth documenting,

That is exactly what I am saying, since, as you yourself later wrote:

Knowing the exact algorithm might be the difference between a fast "ah! I remember from Algorithms and Data Structures that merge sort is particularly bad at this case", and hours of helpless debugging and profiling trying to figure out why std::sort fails for your case. Just because users shouldn't rely on it doesn't make the information useless.

The distinction that might confuse you is that I draw a line between a hard guarantee given in the API docs, which will influence all Rust implementations ever unless heroically changed against inertia and the ugliness of the words "breaking changes", versus so-called Quality of Implementation (QoI) issues. A C compiler that segfaults without output when there is any problem with the input program is a really shitty compiler, but it's in conformance with the C standard(s). Likewise, an algorithm that blows the stack on reasonable inputs really really ought to be fixed, but this is not a question of standards and guarantees, just an QoI issue. Furthermore, like the (absence of) diagnostics from the hypothetical C compiler, it "only" has to do with a specific implementation, not with the language.

whether that call can blow up their stack and segfault or not, and whether for some input they might get an OOM or not.

Complexity bounds are neither necessary nor sufficent for knowing that. They are a crude approximation. Don't get me wrong, I have the greatest respect for complexity theory and I'm the first to nit pick the correctness of some bound. But for these questions they are heuristics at best (especially with the stack, where the upper limit is not just a constant but a quite small one). To give an example, there's a function in libcore right now (f64::parse) that take O(1) stack but the constant is over 2 kilobytes. Other functions might stack allocate even more.

Furthermore, I do not think adding hard guarantees adds too much value over just documenting implementation details in a non-binding manner. For answering question of the form "will this program work reasonably well" or "why does this algorithm not work well" on the day-to-day, the implementation details are almost as good. For long-term evolution, non-guarantees are not much worse:

  • If there really is only one reasonably implementation, then the non-binding hints will remain valid forever and possibly even across implementations.
  • If the implementation changes in a way that doesn't affect the bounds, then guaranteed bounds wouldn't have helped you anyway. (This assumes swift update of documentation, and not documenting superfluous details — I think we can manage that)
  • If the implementation changes in significant ways, then differences begin to appear:
    • If an RFC was accepted and guarantees were set in stone, the change can't happen. This would be sad because presumably the changes would be for the better, else they wouldn't be done.
    • If no RFC was accepted, well, there was no guarantee to begin with while implementation details could still be there and useful (see above).

So, in summary:

  • I am fully in favor of documenting existing guarantees better, e.g., copy bounds from the std::collections landing page to the individual methods
  • I am not in principle opposed to adding more guarantees, but this would have to go through the RFC process, which is a big investment of time and energy for something of IMHO relatively little practical impact
  • I think the practical impact is relatively small because many of the same questions (of the form "will this program work reasonably well") can also be answered by giving implementation details in non-binding ways.

@GuillaumeGomez
Copy link
Member

cc @steveklabnik

Thanks for opening it @gnzlbg!

@gnzlbg
Copy link
Contributor Author

gnzlbg commented Sep 8, 2016

No, for perfectly reasonable programs too, if they are judged to occur rarely enough in the wild (that's one use of crater!) and be easy enough to fix (e.g. by explicitly disambiguating the method call or adding a type method).

Worsening the complexity of an algorithm breaks programs silently at run-time without a warning. The programs weren't broken before, the language wasn't broken (no soundness issue), breakage cannot be fixed automatically, and warnings cannot be emitted. I don't see how the current guidelines for allowed breaking changes would allow worsening the complexity of an algorithm even if it wasn't documented without a major version bump.

Quality of Implementation (QoI) issue

I don't think that space and time complexity are QoI issues since without this information I cannot reason about my programs.

Complexity bounds are neither necessary nor sufficent for knowing that. They are a crude approximation. [...] To give an example, there's a function in libcore right now (f64::parse) that take O(1) stack but the constant is over 2 kilobytes. Other functions might stack allocate even more.

That's why in the first post I used in some of the examples the term exactly, as in "it allocates exactly 2*N memory". Exactly / at most, and similar terms give enough information to reason about methods like f64::parse. I agree in that complexity bounds are only a part of the story, we should offer enough information for users to reason about their programs.

I am fully in favor of documenting existing guarantees better, e.g., copy bounds from the std::collections landing page to the individual methods

Which existing guarantees? The API docs of most methods don't currently guarantee anything (not even time complexity).

I am not in principle opposed to adding more guarantees, but this would have to go through the RFC process, which is a big investment of time and energy for something of IMHO relatively little practical impact

As stated in the first post, every major change to the API of the standard library has to go through the RFC process. Guaranteeing anything that wasn't guaranteed before (like what is being proposed here) changes the API of the standard library and has to go through the RFC process as well. This issue was just filled to "raise the issue" that currently almost nothing is documented / guaranteed.

I think the practical impact is relatively small because many of the same questions (of the form "will this program work reasonably well") can also be answered by giving implementation details in non-binding ways.

We disagree on what are implementation details. You think that the time complexity of an algorithm is an implementation detail that should not be guaranteed, and that this somehow allows higher quality implementations. I think that without these guarantees I cannot reason about the behavior of even the most trivial programs (is calling HashSet::contains in a loop O(N), O(NlogN), or O(N^2)?).

It's a bit moot to discuss subtler topics like space complexity or effects without agreeing on this first. Let's wait to see what ideas and thoughts other people have, maybe somebody has an idea about how to achieve consensus here.

@hanna-kruppe
Copy link
Contributor

Worsening the complexity of an algorithm breaks programs silently at run-time without a warning.

Asymptotic complexity is a very rough heuristic. No asymptotic complexity bound can ever tell you whether a concrete program with a concrete input will exceed a concrete number of time steps or bytes of heap/stack memory. Even if every single method in std had tight bounds for time and space complexity, numerous realistic programs could still "break" (exceed their budget in time or space) easily. Furthermore, guaranteeing more than an asymptotic complexity bound is pretty much impossible.

Therefore, whenever you have a resource limit you need to stay under, complexity bounds can't tell you whether you will manage. At most, they will give you a quick sanity check whether it's likely that it will work out okay. And that's good to have, but it doesn't completely solve your problem. It just tells you whether you can proceed to the actual testing, or scrap the code because it most likely won't pan out. And even for this they can be misleading!

In short: Hard guarantees about complexity won't translate into hard guarantees about what people actually need to know. Thus I do not see the need to provide hard guarantees about complexity.

(Note: Everywhere I say complexity, I don't necessarily mean asymptotic upper bounds, it encompasses everything that hides constants.)

I don't see how the current guidelines for allowed breaking changes would allow worsening the complexity of an algorithm even if it wasn't documented without a major version bump.

On the contrary, the current guidelines don't mention the time and space used by a program at all. Behavioral changes in general are a bit brushed over, but RFC 1105 spends a paragraph on it. In particular, changes to stuff that isn't documented is ruled a minor breaking change (at least; we can always decide that a change would be too disruptive in practice). And that's assuming the time and space complexity fall under "behavioral" at all — I don't have an authoritative source but my impression is that they generally don't.

Regardless of rules lawyering, the guiding principle is that a Rust program that works fine in 1.X should mostly work fine in 1.(X+N). There are numerous caveats to this, mostly relating to the practicality of preventing such a regression and the positive and negative effects the change will have on the programs that have actually been written so far. We are essentially debating where a change of complexity falls on this scale. Simply asserting that it can make a program stop working is not enough, every change can do that. At this point it would perhaps be more useful to let the philosophical debate rest and go search for practical examples of programs breaking because a standard library changed under their feet. I'd be especially interested in programs that got seriously hurt as trade off for an improvement for other programs, since a change that's just bad across the board will be fixed like any other regression.

That's why in the first post I used in some of the examples the term exactly, as in "it allocates exactly 2*N memory". Exactly / at most, and similar terms give enough information to reason about methods like f64::parse. I agree in that complexity bounds are only a part of the story, we should offer enough information for users to reason about their programs.

Most phrases like this contain implicit constant factors. No sane API designer would ever guarantee the exact amount of stack space used, which is super volatile and highly dependent on compiler optimizations and even on the hardware and the ABI! Even in most algorithms which perform a heap allocation for a specific purpose you usually won't want to give exact numbers, for example to allow over-allocation or to tune buffer sizes for efficiency (a buffer for a stable sort might be an exception). But giving exact numbers without hidden constants is exactly what would be needed to be able to reason so precisely that no "breaking change" (in your wide sense of the word) happens.

Which existing guarantees? The API docs of most methods don't currently guarantee anything (not even time complexity).

https://doc.rust-lang.org/std/collections/#performance

@gnzlbg
Copy link
Contributor Author

gnzlbg commented Sep 8, 2016

On Thu, Sep 8, 2016 at 2:42 PM, Robin Kruppe notifications@github.com
wrote:

Asymptotic complexity is a very rough heuristic. No asymptotic
complexity bound can ever tell you whether a concrete program with a
concrete input will exceed a concrete number of time steps or bytes of
heap/stack memory.

You are the only one that has mentioned counting bytes or "timesteps" (do
you mean CPU cycles?). I agree that it would be a bad idea, and hence why
it isn't being proposed. However, I fail to see how that being a bad idea
makes something completely unrelated, like what is being proposed, a bad
idea as well.

Hard guarantees about complexity won't translate into hard guarantees
about what people actually need to know.

Indeed, this is why the initial post discusses multiple guarantees beyond
algorithmic complexity. Hard guarantees about complexity are just that. If
they don't tell you anything, that is fine with me. They do tell me a lot.

At this point it would perhaps be more useful to let the philosophical
debate rest and go search for practical examples of programs breaking
because a standard library changed under their feet.

C is the only low-level language besides Rust that doesn't guarantee the
complexity of its operations. In particular, using a different standard
library implementation can give you a different algorithmic complexity. In
the [qsort-shootout blogpost](http://calmerthanyouare.org/2013/05/
31/qsort-shootout.html) switching the standard library to dietlibc broke a
perfectly fine running program because dietlibc was allowed by the standard
to implement qsort using O(N) stack space, and so they did. They fixed
that, but changing it back breaks that running program again. In
particular, this is actually a logic error in the program, since it wrongly
assumes something that is not guaranteed, that is, that when calling qsort
the stack won't grow unbounded.

Your argument seems to be "I don't want to document these things because
they don't tell me much and they prevent better future implementations".
The C++ and D standard libraries do guarantee the complexity in time, the
effects of all of their algorithms, and in most cases their space
complexity as well. The C++ standard library guarantees are >20 years old.

Can you give an example in which guaranteeing any complexity bound for a
C++ standard library algorithm prevented a better implementation of the
same algorithm? I haven't heard about any, and haven't been able to find
any.

No sane API designer would ever guarantee the exact amount of stack space
used, which is super volatile and highly dependent on compiler
optimizations and even on the hardware and the ABI!

It is trivial to find examples of algorithms in which the stack usage can
be mostly independent of compiler optimizations, hardware, and the ABI
(e.g. any algorithm that is not recursive and has a large enough allocated
buffer which dominates the stack size, such as stable sort using an stack
allocated buffer). For some reason you seem to be arguing against
documenting exact stack usage of all methods. The only thing tangentially
related that has been proposed for all methods is documenting how the stack
grows, which is something completely different. Still, I think that if
for a particular method it makes sense to guarantee an upper bound on stack
space (or even exact stack space usage), then that's the way it is. The
world isn't black or white.

The main argument you seem to have against guaranteeing any sort of space
complexity is again that it would prevent better future implementations.
While I have examples of programs crashing because this wasn't guaranteed,
I don't have any examples yet of better implementations not being possible
because of a conservative stack usage bound. In fact, since you can
transform recursion into a loop that does not growth the stack, you
actually in general improve performance by providing tighter bounds on
how the stack grows. So if there is a performance argument about space
complexity on the stack, is that making sure that we have the right
guarantees, and that the implementation conforms to them, actually improves
performance, and not the other way around.

Even in most algorithms which perform a heap allocation for a specific
purpose you usually won't want to give exact numbers,

You still want to at least provide whether the algorithm can allocate at
all and how the amount of memory allocated grows as a function of the input
(which is what is being proposed). As mentioned above, and as you say, in
some cases it might make sense to provide exact numbers, or upper bounds.

https://doc.rust-lang.org/std/collections/#performance

5 methods x 5 containers = complexity of 25 methods documented there + in
the API docs some of the methods have some sort of documentation w.r.t.
complexity (but in an inconsistent form).

Vec alone, however, has 35 methods (vs the 5 documented there) without
traits implementations and algorithms that, e.g., work on slices. Searching
for something trivial like googling for "rust Vec::push" lands you at the
method
,
whose two lines don't tell you anything. Just try to find the complexity of
any method like new() or even trivial stuff like Vec::len(). That is
what I mean with most methods aren't documented. I don't want to count them
but extrapolating ~35 methods x 7 containers means 245. I doubt that the
complexity of even 50 of the methods is actually stated somewhere, and much
less in the API docs, but everybody is welcome to make an extensive survey
and prove me wrong.

This issue is about consistently documenting the API of each method. To
discuss which information is potentially relevant there for most methods,
and which other stuff might be considered on a method per method basis. In
particular it mentions that one thing that needs to be improved is how one
arrives at this information. That a way to do that would be to include it
in the API documentation of each method, and tangentially that the the
general collections documentation should probably be updated with the
containers that are not documented there (but this issue is orthogonal to
the API docs).

@hanna-kruppe
Copy link
Contributor

The reason I am focusing on exact number is that those exact numbers are, at the end of the day, what makes or breaks a program. There was no watchdog killing that C program because qsort had the mathematical property of bad asymptotic complexity, it died because it used more than x KB of stack space on the inputs it needed to handle.

Your argument seems to be "I don't want to document these things because they don't tell me much and they prevent better future implementations".

I am, as a matter of principle, opposed to guaranteeing anything unless it's demonstrably a win. That includes the guarantee being useful enough to be worth the hassle, and it being unlikely to cause problems later on. A formal guarantee is a very heavy hammer, it requires a rather heavy process to establish and must be kept in mind in the future. Consequently, I am mostly arguing that giving hard guarantees of the form that can be given reliably, cross-implementation, are not that useful for the purposes you mention, and thus not worth the hassle.

Keep in mind that the alternative is not "never tell anyone anything about the implementation and maliciously make it worse once in a blue moon", the alternative is "keep making code perform as well as possible, give users an idea of the algorithms used, but permit ourselves to change them when useful". The program you refer to didn't only break because the C standard gave no guarantees (for that matter, the C++ standards guarantee time complexity but say nothing at all about stack space), it also broke because the qsort implementation was crappy.

Can you give an example in which guaranteeing any complexity bound for a C++ standard library algorithm prevented a better implementation of the same algorithm? I haven't heard about any, and haven't been able to find any.

http://stackoverflow.com/questions/10134836/complexity-of-stdlistsplice-and-other-list-containers

std::deque is apparently a linked list of blocks rather than a ring buffer in part because push_back and friends must be constant time rather than amortized constant. People argue that the linked-list-of-blocks is actually amortized constant time too and this is permissible (so a ring buffer would be allowed too) but implementations don't seem to do this.

Post-C++11 rules out randomized sorting algorithms with average-case O(N log N) time but a worse worst-case, though I am not quite willing to say that one of those would be "better" (even though it probably would be faster in some use cases).

In fact, since you can transform recursion into a loop that does not growth the stack, you actually in general improve performance by providing tighter bounds on how the stack grows.

Not giving a guarantee doesn't preclude anyone from optimizing the code. Conversely, if the current implementation doesn't conform to a proposed bound and nobody steps up to fix that, the proposal fails. And all that assuming the argument were true, which it isn't. Except tail recursive algorithms, converting to a loop usually requires moving the stuff that used to be on the stack to the heap. That's rarely faster, since it's basically the same work plus the work of allocating and deallocating.

The only thing tangentially related that has been proposed for all methods is documenting how the stack grows, which is something completely different

But that, as I said before, is only a crude stand-in for "will this actually overflow". Additionally, I am opposed to documenting this growth behavior for most methods. Maybe for some selected methods, but not for enough that you can really look at an entire big program and say "yup this will never have unbounded stack use anywhere". Of course, wild recursion is rather fragile and should be avoided when reasonably possible, but again, that's a QoI issue for me.

@gnzlbg
Copy link
Contributor Author

gnzlbg commented Sep 8, 2016

http://stackoverflow.com/questions/10134836/complexity-of-stdlistsplice-and-other-list-containers

The choice of making size constant or linear results in two different containers, one is not better than the other. If you find a way to make splice constant time without changing size complexity you can go on an implement it, the standard just requires worst case linear time, which constant time satisfies. If you want to have constant time splice you need a different container (that among other things have different memory requirements).

std::deque is apparently a linked list of blocks rather than a ring buffer in part because push_back and friends must be constant time rather than amortized constant

The complexity of deque push_back is amortized constant time. The reason it is implemented as a linked list of blocks is because it guarantees that on insertion and removal iterators to the elements are never invalidated. Specifying whether iterators are invalidated or not in C++ is required for correctness. The same "type" of container with or without iterator invalidation is going to be implemented completely different, and have completely different performance and correctness guarantees. In this case, again, the tradeoff produces essentially different containers. The standard containers aim to be safe, and in general try not to invalidate iterators. People have reimplemented most of them without this guarantee, but even if their performance might be better in most cases, it is still not better in all cases (since they are implemented using completely different data-structures).

Post-C++11 rules out randomized sorting algorithms with average-case O(N log N) time but a worse worst-case, though I am not quite willing to say that one of those would be "better" (even though it probably would be faster in some use cases).

Worst case complexity is what makes programs explode, services go down, and introduces security vulnerabilities. This also rules out other algorithms like quicksort, which has worst case O(N^2). And this is a good thing, since std::algorithms should prevent your program from exploding.

Except tail recursive algorithms, converting to a loop usually requires moving the stuff that used to be on the stack to the heap.

Can you give an example? In my experience this isn't the case (e.g. iterative quicksort doesn't need any heap allocations nor recursion).

Maybe for some selected methods, but not for enough that you can really look at an entire big program and say "yup this will never have unbounded stack use anywhere".

In a wild guess, I would say that currently, for almost all methods in the standard library, the space complexity is O(1). It is trivial to document, trivial to guarantee, and in general faster (without an example of a non-tail recursive algorithm of the std::lib for which this is not possible I stick to what I know).

@hanna-kruppe
Copy link
Contributor

hanna-kruppe commented Sep 8, 2016

The choice of making size constant or linear results in two different containers, one is not better than the other.

They are two different containers alright, but that doesn't mean one can't be better. The standard took list implementations that were useful and efficient for some use cases, and ruled them out. Conversely, if we found out that constant time splice was desirable for linked lists, a previous guarantee for constant time len would rule that out. And why wouldn't we guarantee that? It's so obvious! Unless you've seen the debate around C++11, I guess.

The reason it is implemented as a linked list of blocks is because it guarantees that on insertion and removal iterators to the elements are never invalidated.

Oh, right. Sorry.

Worst case complexity is what makes programs explode, services go down, and introduces security vulnerabilities.

I take it you're not a fan of hash maps, then?

As I said, I wouldn't argue that they are necessarily better, but a (sufficiently rare) worst case doesn't need to rule out algorithms in all circumstances.

Can you give an example? In my experience this isn't the case (e.g. iterative quicksort doesn't need any heap allocations nor recursion).

Iterative quicksort needs a stack. You can bound the size of that stack with O(log N) by always pushing the smaller half to the stack and processing the larger one right away (the same trick works in recursive implementations that make use of tail call optimization), but it's large and non-constant-sized scratch space.

Other examples: Tree traversals without sibling/parent (depending on the traversal mode) links, depth-first/breadth-first search in graphs, any backtracking algorithm that needs to remember something non-trivial to restore a prior state.

In a wild guess, I would say that currently, for almost all methods in the standard library, the space complexity is O(1). It is trivial to document, trivial to guarantee (without an example of a non-tail recursive algorithm of the std::lib for which this is not possible I stick to what I know).

(I assume you mean stack space.) You'd be wrong. Try the BTrees in std::collections for starters. I don't have other examples ready off-hand but that may be because std has relatively little in the way of textbook algorithms.

@mark-i-m
Copy link
Member

mark-i-m commented Sep 8, 2016

@gnzlbg I still disagree on documenting implementation details, but not as strongly as I used to... if these need to be documented, I would like them to be placed somewhere not on the standard docs (a separate reference perhaps).

@rkruppe

Likewise, an algorithm that blows the stack on reasonable inputs really really ought to be fixed, but this is not a question of standards and guarantees, just an QoI issue.

One could argue that memory safety is QoI issue, yet Rust go to great lengths to guarantee it... The fact is that giving guarantees about safety and liveliness is actually useful, though I agree that there is a limit to their usefulness. I think that the usefulness is worth it.

Furthermore, I do not think adding hard guarantees adds too much value over just documenting implementation details in a non-binding manner

This is possibly an ideology difference. I would like the APIs I use to give me contracts.

For long-term evolution, non-guarantees are not much worse
...
If the implementation changes in significant ways, then differences begin to appear:

  • If an RFC was accepted and guarantees were set in stone, the change can't happen. This would be sad because presumably the changes would be for the better, else they wouldn't be done.
  • If no RFC was accepted, well, there was no guarantee to begin with while implementation details could still be there and useful (see above).

Such an RFC would still be a breaking change even if it is not labeled so... This is a collections library, if you don't make certain guarantees explicit, people will very naturally assume them. Fundamentally, I am still not convinced that giving non-binding details is the direction I would want for a standard library.

std has relatively little in the way of textbook algorithms.

Really? Vec, LinkedList, Deque, HashMap, HashSet, BTree, sorts... What do you mean by that statement? There are a lot of things in the standard library that people could make assumptions about if we don't tell them explicitly.


Overall, I agree that this has kind of devolved into a philosophical debate, though...

@gnzlbg
Copy link
Contributor Author

gnzlbg commented Sep 9, 2016

@rkruppe

The standard took list implementations that were useful and efficient for some use cases, and ruled them out.

The complexity wasn't guaranteed, depending on the standard library implementation users were getting completely different containers. If you did not care, you did not care, but if you did care, this made the standard container useless, since whether it was good for your use case or not would change across platforms. The consensus was that consistent complexity across implementations was necessary, and that independently of which container was chosen as the default, users could always implement the other one. This was the right call. The O(1) size implementation was chosen as the standard one because all other standard containers have O(1) size. Whether that was the right call or not is debatable, but even for those that wanted to get O(1) splice, this was a win, since now they would know that the only way to get it was to use something different than std::list.

As I said, I wouldn't argue that they are necessarily better, but a (sufficiently rare) worst case doesn't need to rule out algorithms in all circumstances.

As I said, the art is in choosing the right bounds. You can choose worst case complexity O(NlogN), or O(Nlog^2N). The second is slightly worse, but might allow better algorithms to be implemented. Choosing the right bounds is not easy, but other standard libraries have done a lot of work for us already. Rust can learn from their experience and do better.

Iterative quicksort needs a stack.

Of course, but there is a huge performance difference between using a stack and heap allocated memory (in particular for shorter sequences). The initial argument was that heap allocations were necessary (and thus the larger part of the Rust run-time, and syscalls), and that the cost of those would outweight the benefits. My point was only that they are not necessary.

(the same trick works in recursive implementations that make use of tail call optimization)

Sure, and when Rust gets guaranteed tail call optimization O(1) space complexity bounds will be able to be upholded even when using recursive implementations. Still, even without guaranteed tail call optimization, doing so is still possible in Rust today.

Other examples: Tree traversals without sibling/parent (depending on the traversal mode) links, depth-first/breadth-first search in graphs, any backtracking algorithm that needs to remember something non-trivial to restore a prior state.

I've worked a lot on binary, quad and octrees the past two years, and while it is trivial to implement tree traversals using recursion, it is not that hard to do so iteratively (with a loop, without extra stack storage like for quicksort). I've also read that another way to implement efficient traversals is using state machines, but for my use cases plain old loops were fine.

Anyhow, even if it is slightly more complicated to implement these iteratively, none of these data-structures are in the standard library (the trees in the standard library are easy). So I would say that we should worry about which guarantees we provide for these when it comes time to move any of these into the std::collections (which might probably be never).

Try the BTrees in std::collections for starters. I don't have other examples ready off-hand but that may be because std has relatively little in the way of textbook algorithms.

Even in BTrees, the majority of the methods don't do anything fancy (len, empty, new, iter...).


@mark-i-m

Overall, I agree that this has kind of devolved into a philosophical debate, though...

Yes. I think that it would be better to put in the work to, e.g., try to come up with a strawman documentation of a particular collection like Vec. That would give us more concrete examples. Maybe coming up with bounds that make sense for everybody turns out to be easier than what @rkruppe think, but maybe he is right and it becomes very controversial and more hard than I think it is.

@steveklabnik I would like to put in the work for this (e.g. come up with a strawman for documenting complexity and other guarantees with Vec as the use case). I would like to do it as a PR where we can easily discuss the specifics (the comments per line are just the right way to keep different discussions alive), and I would like to do so before writing an RFC (since I think having such an example for the RFC would be required anyways). Obviously the PR should not be merged until the RFC is accepted. Do you think this will be a good way to proceed?

bors added a commit that referenced this issue Dec 9, 2016
Implement a faster sort algorithm

Hi everyone, this is my first PR.

I've made some changes to the standard sort algorithm, starting out with a few tweaks here and there, but in the end this endeavour became a complete rewrite of it.

#### Summary

Changes:

* Improved performance, especially on partially sorted inputs.
* Performs less comparisons on both random and partially sorted inputs.
* Decreased the size of temporary memory: the new sort allocates 4x less.

Benchmark:

```
 name                                        out1 ns/iter          out2 ns/iter          diff ns/iter   diff %
 slice::bench::sort_large_ascending          85,323 (937 MB/s)     8,970 (8918 MB/s)          -76,353  -89.49%
 slice::bench::sort_large_big_ascending      2,135,297 (599 MB/s)  355,955 (3595 MB/s)     -1,779,342  -83.33%
 slice::bench::sort_large_big_descending     2,266,402 (564 MB/s)  416,479 (3073 MB/s)     -1,849,923  -81.62%
 slice::bench::sort_large_big_random         3,053,031 (419 MB/s)  1,921,389 (666 MB/s)    -1,131,642  -37.07%
 slice::bench::sort_large_descending         313,181 (255 MB/s)    14,725 (5432 MB/s)        -298,456  -95.30%
 slice::bench::sort_large_mostly_ascending   287,706 (278 MB/s)    243,204 (328 MB/s)         -44,502  -15.47%
 slice::bench::sort_large_mostly_descending  415,078 (192 MB/s)    271,028 (295 MB/s)        -144,050  -34.70%
 slice::bench::sort_large_random             545,872 (146 MB/s)    521,559 (153 MB/s)         -24,313   -4.45%
 slice::bench::sort_large_random_expensive   30,321,770 (2 MB/s)   23,533,735 (3 MB/s)     -6,788,035  -22.39%
 slice::bench::sort_medium_ascending         616 (1298 MB/s)       155 (5161 MB/s)               -461  -74.84%
 slice::bench::sort_medium_descending        1,952 (409 MB/s)      202 (3960 MB/s)             -1,750  -89.65%
 slice::bench::sort_medium_random            3,646 (219 MB/s)      3,421 (233 MB/s)              -225   -6.17%
 slice::bench::sort_small_ascending          39 (2051 MB/s)        34 (2352 MB/s)                  -5  -12.82%
 slice::bench::sort_small_big_ascending      96 (13333 MB/s)       96 (13333 MB/s)                  0    0.00%
 slice::bench::sort_small_big_descending     248 (5161 MB/s)       243 (5267 MB/s)                 -5   -2.02%
 slice::bench::sort_small_big_random         501 (2554 MB/s)       490 (2612 MB/s)                -11   -2.20%
 slice::bench::sort_small_descending         95 (842 MB/s)         63 (1269 MB/s)                 -32  -33.68%
 slice::bench::sort_small_random             372 (215 MB/s)        354 (225 MB/s)                 -18   -4.84%
```

#### Background

First, let me just do a quick brain dump to discuss what I learned along the way.

The official documentation says that the standard sort in Rust is a stable sort. This constraint is thus set in stone and immediately rules out many popular sorting algorithms. Essentially, the only algorithms we might even take into consideration are:

1. [Merge sort](https://en.wikipedia.org/wiki/Merge_sort)
2. [Block sort](https://en.wikipedia.org/wiki/Block_sort) (famous implementations are [WikiSort](https://github.com/BonzaiThePenguin/WikiSort) and [GrailSort](https://github.com/Mrrl/GrailSort))
3. [TimSort](https://en.wikipedia.org/wiki/Timsort)

Actually, all of those are just merge sort flavors. :) The current standard sort in Rust is a simple iterative merge sort. It has three problems. First, it's slow on partially sorted inputs (even though #29675 helped quite a bit). Second, it always makes around `log(n)` iterations copying the entire array between buffers, no matter what. Third, it allocates huge amounts of temporary memory (a buffer of size `2*n`, where `n` is the size of input).

The problem of auxilliary memory allocation is a tough one. Ideally, it would be best for our sort to allocate `O(1)` additional memory. This is what block sort (and it's variants) does. However, it's often very complicated (look at [this](https://github.com/BonzaiThePenguin/WikiSort/blob/master/WikiSort.cpp)) and even then performs rather poorly. The author of WikiSort claims good performance, but that must be taken with a grain of salt. It performs well in comparison to `std::stable_sort` in C++. It can even beat `std::sort` on partially sorted inputs, but on random inputs it's always far worse. My rule of thumb is: high performance, low memory overhead, stability - choose two.

TimSort is another option. It allocates a buffer of size `n/2`, which is not great, but acceptable. Performs extremelly well on partially sorted inputs. However, it seems pretty much all implementations suck on random inputs. I benchmarked implementations in [Rust](https://github.com/notriddle/rust-timsort), [C++](https://github.com/gfx/cpp-TimSort), and [D](https://github.com/dlang/phobos/blob/fd518eb310a9494cccf28c54892542b052c49669/std/algorithm/sorting.d#L2062). The results were a bit disappointing. It seems bad performance is due to complex galloping procedures in hot loops. Galloping noticeably improves performance on partially sorted inputs, but worsens it on random ones.

#### The new algorithm

Choosing the best algorithm is not easy. Plain merge sort is bad on partially sorted inputs. TimSort is bad on random inputs and block sort is even worse. However, if we take the main ideas from TimSort (intelligent merging strategy of sorted runs) and drop galloping, then we'll have great performance on random inputs and it won't be bad on partially sorted inputs either.

That is exactly what this new algorithm does. I can't call it TimSort, since it steals just a few of it's ideas. Complete TimSort would be a much more complex and elaborate implementation. In case we in the future figure out how to incorporate more of it's ideas into this implementation without crippling performance on random inputs, it's going to be very easy to extend. I also did several other minor improvements, like reworked insertion sort to make it faster.

There are also new, more thorough benchmarks and panic safety tests.

The final code is not terribly complex and has less unsafe code than I anticipated, but there's still plenty of it that should be carefully reviewed. I did my best at documenting non-obvious code.

I'd like to notify several people of this PR, since they might be interested and have useful insights:

1. @huonw because he wrote the [original merge sort](#11064).
2. @alexcrichton because he was involved in multiple discussions of it.
3. @veddan because he wrote [introsort](https://github.com/veddan/rust-introsort) in Rust.
4. @notriddle because he wrote [TimSort](https://github.com/notriddle/rust-timsort) in Rust.
5. @bluss because he had an attempt at writing WikiSort in Rust.
6. @gnzlbg, @rkruppe, and @mark-i-m because they were involved in discussion #36318.

**P.S.** [quickersort](https://github.com/notriddle/quickersort) describes itself as being universally [faster](https://github.com/notriddle/quickersort/blob/master/perf.txt) than the standard sort, which is true. However, if this PR gets merged, things might [change](https://gist.github.com/stjepang/b9f0c3eaa0e1f1280b61b963dae19a30) a bit. ;)
@steveklabnik steveklabnik removed the A-docs Area: documentation for any part of the project, including the compiler, standard library, and tools label Mar 10, 2017
@steveklabnik
Copy link
Member

@gnzlbg : coming back to this thread after a long time, a bit tough to catch up.

@steveklabnik I would like to put in the work for this (e.g. come up with a strawman for documenting complexity and other guarantees with Vec as the use case). I would like to do it as a PR where we can easily discuss the specifics (the comments per line are just the right way to keep different discussions alive), and I would like to do so before writing an RFC (since I think having such an example for the RFC would be required anyways). Obviously the PR should not be merged until the RFC is accepted. Do you think this will be a good way to proceed?

So, last time I chatted with @rust-lang/libs about something like this, they basically said "we'll take it case by case". So, I think the right move here would be to make one PR per guarantee, and we get libs to sign off on it. No RFC needed.

Thoughts, libs team?

@mark-i-m
Copy link
Member

I'm guessing the motivation is simplifying the effort? or is there another reason?

@gnzlbg
Copy link
Contributor Author

gnzlbg commented Mar 16, 2017

So, I think the right move here would be to make one PR per guarantee, and we get libs to sign off on it. No RFC needed.

That's reasonable, I can send pull-requests for the individual guarantees. The only thing to be careful about is the order of these pull-request, since lots of guarantees build on-top of each other.

@steveklabnik steveklabnik added A-docs Area: documentation for any part of the project, including the compiler, standard library, and tools and removed A-docs labels Mar 23, 2017
@steveklabnik
Copy link
Member

I'm going to give this one a close, because it's not particularly actionable. If anyone wants to improve collections docs, please send in PRs for the improvements you want to see!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-docs Area: documentation for any part of the project, including the compiler, standard library, and tools C-enhancement Category: An issue proposing an enhancement or a PR with one.
Projects
None yet
Development

No branches or pull requests

7 participants