Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metabug: Libcollections TODO #18009

Closed
20 of 30 tasks
Gankra opened this issue Oct 13, 2014 · 28 comments
Closed
20 of 30 tasks

Metabug: Libcollections TODO #18009

Gankra opened this issue Oct 13, 2014 · 28 comments
Labels
metabug Issues about issues themselves ("bugs about bugs")

Comments

@Gankra
Copy link
Contributor

Gankra commented Oct 13, 2014

As part of the stabilization process, we'd like our standard collection implementations to be good. Maybe even great. Wouldn't that be nice? The overall quality of each implementation is pretty varied, partially just because the language has dramatically shifted out from underneath them. Vec and HashMap are in a good place because they're The Big Dogs of data structuring, and have gotten substantial attention as a result. Other less-used collections are in worse shape, and need some serious attention.

If you're interested in working on a Rust project, or collections in particular, here's a big list of things to do experiment with on libcollections.

  • DList's internal "safe" API is totally busted. It's not safe at all. Possibly simply because RawLink is Copyable. This allows a reference to a RawLink to be converted into a value, allowing mutation of an immutable DList.
  • Investigate implementing a proper bidirectional cursor on DList that can insert/remove values. It's borderline useless without this.
  • Full code review of BitV and BitVSet. They've had a lot of bugs because the code base has seen a lot of churn. Some legacy references to the bad-old-day of BitV being an enum still exist (such as a benchmark that just tests how fast bitshifting a uint is).
  • Bitv uses architecture-sized uints for backing storage #16736 BitV's internal representation is a Vec of uints. It should almost certainly be a Vec of u32's, or some other non-machine-sized type.
  • RingBuf should not be a Vec of Options.
  • Investigate whether accepting total unsafety in TreeMap and using RawPtrs everywhere is worth it. May be necessary in a future full of scoped allocators.
  • Replace the bulky triple of Vecs used in BTreeMap's Node struct with a single manually allocated
    buffer, in the same vein as HashMap's RawTable.
  • Investigate a richer default B-selection algorithm for BTreeMap's default constructor than "6"
  • Investigate alternative node-search schemes over trivial linear search for BTreeMap. Binary search? skip-k-at-a-time linear search? Dynamic (or static!) selection of these algorithms based on B/K/V?
  • Seriously optimize BTreeMap at all. At all.
  • Investigate the possibility of replacing the heap-allocated Vec<*mut Node<T>>-based search-stacks of BTreeMap, TreeMap, and TrieMap with a stack-allocated [*mut Node<T>, ..uint::BITS] (TreeMap might need more, it's an AA-tree, dunno how tall they can get). Note that a preliminary experiment on BTree gave very poor results, though it has the least to gain from such a design from it's current design (tracks depth, can use with_capacity.
  • Investigate the possibility of reusing the SearchStack concept used in BTreeMap to more safely wrap the rawptr-based search stacks of TreeMap and TrieMap.
  • Investigate the possibility of making the iterators for TreeMap and TrieMap DoubleEnded, rather than having separate forward/backward iterators.
  • At very least, get rid of the old-style internal backwards iterator on TrieMap.
  • Investigate the possibility of cutting out a lot of the macro-madness from TreeMap and TrieMap (BTreeMap has no macros and little-to-no code duplication, just hardcore generics). Possibly using the Deref/DerefMut trick in HashMap.
  • Investigate completely replacing the unsafe rawptr search-stacks in the iterators of TreeMap and TrieMap with safe stacks of iterators over the nodes themselves, like BTreeMap does.
  • Investigate arbitrary sub-range iterators for sorted sets/maps.
  • Investigate functionality augmentations to a lot of collections iterators. For instance DList's iterator provides peek_next and remove_next methods for soundly removing elements from the list during iteration. At least Vec, RingBuf, and HashMap could all support this soundly, I believe. Vec, RingBuf, and DList could also support insert_next. Special iterator-like objects dedicated to these tasks might be better ideas, though (cursors, zippers, etc).
  • Implement "entry" API for maps #17320 Implement the new Entry API on TreeMap and TrieMap.
  • Investigate different hashing architectures and mechanisms for hash algorithm selection than the current one. For instance, our current stream-based architecture is poorly designed for small keys (e.g. u64's) which can and should be hashed in a single step.
  • Add a min heap option to PriorityQueue #15947 Investigate making ordering-based collections comparator based rather than directly Ord-based using unboxed closures. Ord types should be supported through some kind of "natural comparator" that is reversible (for supporting reverse orderings). PriorityQueue in particular would like this to simplify min-vs-max-heap.
  • Investigate optimizing PriorityQueue.extend with a modified heapify operation.
  • Consider more slice utility methods. slice.rotate_left/right? slice.shift_and_replace_left/right?
  • Consider providing split(&mut self, at: Index) -> Self on some collections.
  • Investigate tricks used in other languages to optimize for common/special access patterns.
  • Tests! Code coverage!
  • Docs! Examples!
  • Detailed discussions of what individual collections are good/bad for.
  • Performance comparison tables (probably just assymptoticss
  • There was only one other checked box and it looked lonely so I made it a friend. :D
@aturon
Copy link
Member

aturon commented Oct 13, 2014

cc me

@Gankra
Copy link
Contributor Author

Gankra commented Oct 13, 2014

I've added this mostly verbatim from the reddit thread I originally posted this to after prompting from several developers. Some things were added. This is still pretty heavily biased towards things I've poked at. Maybe TrieSet is a disaster. Maybe LruCache is a nightmare. No clue.

@aturon
Copy link
Member

aturon commented Oct 13, 2014

I'd suggest using this issue to coordinate work on some of these subitems. When a particular item needs additional discussion, you should spin off and link to an actual subissue for it.

Also, just a note that most of the TODO items here are not strictly needed for stabilization, which is about API commitments; a lot of these are important suggestions for implementation and documentation improvements.

@Gankra
Copy link
Contributor Author

Gankra commented Oct 13, 2014

Some of these actually have issues. Working on collecting those up now.

@gamazeps
Copy link
Contributor

Looking into 'BitV's internal representation is a Vec of uints. It should almost certainly be a Vec of u32's, or some other non-machine-sized type.'

Seems like you already fixed it 4 days ago @gankro ;) (it simply wasn't merged yet)

@Gankra
Copy link
Contributor Author

Gankra commented Oct 13, 2014

@gamazeps Huh? BitV still uses uints.

I did start up a branch expecting it to be a quick CTRL+H job, but got too distracted in the bigger "needs a full review" issue as I went over all the changes to make sure I wasn't messing something up. In particular, reading over the code, it's really hard to tell locally what should be a uint, and what should be a u32 (is that value a word, or an index?). I've been pretty scatter-brained/sick lately, and don't have enough time/energy to ensure I'm not screwing the whole API up. I've been sticking to writing/design stuff for the past-while since it's easier for me to squeeze in short bursts when I'm not doing school stuff. Also fever-dream prose is just hard to read, not a bug.

If you want to pick up and finish the work on my branch, feel free!

@gamazeps
Copy link
Contributor

Ok ok ^^
I thought since you started it you already finished it ^^

So I'll do that (but I'm not experimented enough to do the code review of the whole BitV for now).

@Gankra
Copy link
Contributor Author

Gankra commented Oct 13, 2014

Yeah, that can wait. @apoelstra has volunteered to help review any BitV PRs, but is also a bit swamped.

@reem
Copy link
Contributor

reem commented Oct 13, 2014

@gankro Have you thought a lot about how to fix DList? I'd be interested in iterating on it.

@Gankra
Copy link
Contributor Author

Gankra commented Oct 13, 2014

@reem I actually started up some work on this 2 months ago when I didn't understand Rust as well. It's mixed in with my Cursor work on this branch: https://github.com/Gankro/rust/compare/dlist-seek

The first commit is the refactor work, though it also features more sweeping reform in anticipation of introducing cursors: Gankra@4e259d2

At the time, I thought it was hopeless, but I realize now that it should mostly "just work" if you manually mark the RawLink's NoCopy. Skimming over it again now, I also came to the conclusion that the fact that we make frequently make invalid RawLinks transiently, so that makes dereffing them unsafe. That... might be valid?

This kind of goes hand-in-hand with some unsafe code conventions I've been mulling over in my mind. Specifically, is internal code reasonable to mark safe if assumes pre-conditions? That is, if it assumes it's only called when the structure is in a "valid" state, and not a transient "invalid" one?

@reem
Copy link
Contributor

reem commented Oct 13, 2014

Ideally, I think that unsafe code should only marked safe if it works in any case, i.e. there should never be a way to cause undefined behavior in safe code, even if you used unsafe code earlier.

I think that we should discuss Zippers and Cursors separately, and should probably make an RFC to introduce them. I'd be happy to chat on IRC or here about what they should like.

@Gankra
Copy link
Contributor Author

Gankra commented Oct 13, 2014

Yeah, then we need to waste cycles ensuring links are always valid, or we need to mark way more of DList straight-up unsafe.

@reem
Copy link
Contributor

reem commented Oct 13, 2014

I think that (ideally) no operations on DList should let you leave it in an unsafe state.

@Gankra
Copy link
Contributor Author

Gankra commented Oct 13, 2014

All of public operations are legit, it's jut trivial to make/deref a dangling pointer using the current internal API. Presumably Node or RawLink could be beefed up with more logic so that this doesn't happen. e.g. put all the unsafe link-mutating code on Node with a safe interface for DList to safely use. Node.insert_left(Node), Node.insert_right(Node), Node.cut_out(), Node.next() Node.prev(), etc.

Actually I think I mostly did some of this with the insert_before_node insert_after_node etc code I added to DList in the linked commit. (It's been a long time)

@reem
Copy link
Contributor

reem commented Oct 13, 2014

Hmmm. For internal code I think the distinction is less clear. I'm attempting a refactor now, and my goal is to move as much unsafe as possible into safe abstractions, but we'll see how it goes.

I'll take a look at your refactor.

We should spin this off into a subissue.

Danger: fn resolve<'a>(&mut self) -> Option<&'a mut T>...

One of the main things I'm doing is introducing the invariant to RawLink that the internal pointer is always valid, then keeping it behind an Option. I think this makes things a bit simpler.

@gamazeps
Copy link
Contributor

If you are going to delve deeper into it on IRC, I'm interested ;)

@Gankra
Copy link
Contributor Author

Gankra commented Oct 13, 2014

Remember, I wrote it at the start of August, and I first started learning Rust in July 😅

(I see you, old transmute hack I loved)

@gamazeps
Copy link
Contributor

The u32 for the BitV seems to be fine (running benchs to measure).

Looking into "RingBuf should not be a Vec of Options." now :)

@gereeter
Copy link
Contributor

I've got a slimmer BTree Node almost working and have started on a faster PriorityQueue.extend().

@Gankra
Copy link
Contributor Author

Gankra commented Oct 14, 2014

You're all amazing ❤️

@reem
Copy link
Contributor

reem commented Oct 15, 2014

@gankro I'd love to take a crack at discussing what certain data structures are best for. (I'm also still looking at refactoring DList).

Where do you see that going/what format do you think would be best for it?

@Gankra
Copy link
Contributor Author

Gankra commented Oct 15, 2014

@reem You can check out my BTree docs: http://doc.rust-lang.org/std/collections/struct.BTreeMap.html

and the high-levels docs I added to libcollections: http://doc.rust-lang.org/std/collections/index.html

for the sort of things I had in mind. That said, I haven't thought about it too hard, so if there's something you think should or shouldn't be done that isn't or is, feel free to pitch in ideas. I'm not a documentation expert.

gamazeps added a commit to gamazeps/rust that referenced this issue Oct 16, 2014
bors added a commit that referenced this issue Oct 17, 2014
I was going to write some doc in order to remove the #[allow(missing_doc)] but there was actually none missing.
I also removed a warning i didn't see in my last commit  #18018
Linked to #18009
@Gankra
Copy link
Contributor Author

Gankra commented Oct 29, 2014

CC #18424 collections reform has passed! Lots of work to do in there too!

@ghost ghost mentioned this issue Nov 7, 2014
bors added a commit that referenced this issue Nov 10, 2014
…=bstrie

I've implemented the new collection views API for TrieMap. I more or less followed the approach set out by @gankro in BTreeMap, by using a `SearchStack`. There's quite a bit of unsafe code, but I've wrapped it safely where I think is appropriate. I've added tests to ensure everything works, and performance seems quite good.

```
test trie::bench_map::bench_find                           ... bench:     67879 ns/iter (+/- 4192)
test trie::bench_map::bench_find_entry                     ... bench:    186814 ns/iter (+/- 18748)
test trie::bench_map::bench_insert_large                   ... bench:    716612 ns/iter (+/- 160121)
test trie::bench_map::bench_insert_large_entry             ... bench:    851219 ns/iter (+/- 20331)
test trie::bench_map::bench_remove                         ... bench:    838856 ns/iter (+/- 27998)
test trie::bench_map::bench_remove_entry                   ... bench:    981711 ns/iter (+/- 53046)
```

Using an entry is slow compared to a plain find, but is only ~15% slower for inserts and removes, which is where this API is most useful. I'm tempted to remove the standalone `remove` function in favour of an entry-based approach (to cut down on complexity).

I've added some more comments to the general part of the code-base, which will hopefully help the next person looking over this. I moved the three key structures to the top of the file so that the nesting structure is clearly visible, and renamed `Child<T>` to `TrieNode<T>` and `TrieNode<T>` to `InternalNode<T>` to improve clarity. If these changes are creeping, I'm happy to revert them.

Let me know if my use of `fail!` is ok, I was a little unsure of how specific to be. Some of the data-structures have various invariants that shouldn't be broken, so using `fail!` seemed appropriate.

## Still to do

* Modernise iterators (make them double-ended).
* Make the keys generic, or rename this data-structure (see: #14902).
* Possibly move this code out of libcollections. [Searching Github for TrieMap turns up very few real results.][triemap-search]

Related issues: #18009 and #17320

[triemap-search]: https://github.com/search?utf8=%E2%9C%93&q=TrieMap+language%3ARust&type=Code&ref=searchresults
bors added a commit that referenced this issue Nov 16, 2014
Fix for task in Metabug #18009 (Rebased version of #18170)

This changes much of about how RingBuf functions. `lo`, `nelts` are replaced by a more traditional `head` and`tail`. The `Vec<Option<T>>` is replaced by a bare pointer that is managed by the `RingBuf` itself. This also expects the ring buffer to always be size that is a power of 2.

This change also includes a number of new tests to cover the some areas that could be of concern with manual memory management.

The benchmarks have been reworked since the old ones were benchmarking of the Ring buffers growth rather then the actual test.

The unit test suite have been expanded, and exposed some bugs in `fn get()` and `fn get_mut()`

## Benchmark
**Before:**
```
test ring_buf::tests::bench_grow_1025                      ... bench:      8919 ns/iter (+/- 87)
test ring_buf::tests::bench_iter_1000                      ... bench:       924 ns/iter (+/- 28)
test ring_buf::tests::bench_mut_iter_1000                  ... bench:       918 ns/iter (+/- 6)
test ring_buf::tests::bench_new                            ... bench:        15 ns/iter (+/- 0)
test ring_buf::tests::bench_pop_100                        ... bench:       294 ns/iter (+/- 9)
test ring_buf::tests::bench_pop_front_100                  ... bench:       948 ns/iter (+/- 32)
test ring_buf::tests::bench_push_back_100                  ... bench:       291 ns/iter (+/- 16)
test ring_buf::tests::bench_push_front_100                 ... bench:       311 ns/iter (+/- 27
```
**After:**
```
test ring_buf::tests::bench_grow_1025                      ... bench:      2209 ns/iter (+/- 169)
test ring_buf::tests::bench_iter_1000                      ... bench:       534 ns/iter (+/- 27)
test ring_buf::tests::bench_mut_iter_1000                  ... bench:       515 ns/iter (+/- 28)
test ring_buf::tests::bench_new                            ... bench:        11 ns/iter (+/- 0)
test ring_buf::tests::bench_pop_100                        ... bench:       170 ns/iter (+/- 5)
test ring_buf::tests::bench_pop_front_100                  ... bench:       171 ns/iter (+/- 11)
test ring_buf::tests::bench_push_back_100                  ... bench:       172 ns/iter (+/- 13)
test ring_buf::tests::bench_push_front_100                 ... bench:       158 ns/iter (+/- 12)

```
bors added a commit that referenced this issue Dec 12, 2014
...ated buffer.

Before:

    test btree::map::bench::find_rand_100                      ... bench:        29 ns/iter (+/- 2)
    test btree::map::bench::find_rand_10_000                   ... bench:        83 ns/iter (+/- 6)
    test btree::map::bench::find_seq_100                       ... bench:        30 ns/iter (+/- 1)
    test btree::map::bench::find_seq_10_000                    ... bench:        50 ns/iter (+/- 3)
    test btree::map::bench::insert_rand_100                    ... bench:       186 ns/iter (+/- 30)
    test btree::map::bench::insert_rand_10_000                 ... bench:       377 ns/iter (+/- 8)
    test btree::map::bench::insert_seq_100                     ... bench:       299 ns/iter (+/- 10)
    test btree::map::bench::insert_seq_10_000                  ... bench:       368 ns/iter (+/- 12)
    test btree::map::bench::iter_1000                          ... bench:     20956 ns/iter (+/- 479)
    test btree::map::bench::iter_100000                        ... bench:   2060899 ns/iter (+/- 44325)
    test btree::map::bench::iter_20                            ... bench:       560 ns/iter (+/- 63)

After:

    test btree::map::bench::find_rand_100                      ... bench:        28 ns/iter (+/- 2)
    test btree::map::bench::find_rand_10_000                   ... bench:        74 ns/iter (+/- 3)
    test btree::map::bench::find_seq_100                       ... bench:        31 ns/iter (+/- 0)
    test btree::map::bench::find_seq_10_000                    ... bench:        46 ns/iter (+/- 0)
    test btree::map::bench::insert_rand_100                    ... bench:       141 ns/iter (+/- 1)
    test btree::map::bench::insert_rand_10_000                 ... bench:       273 ns/iter (+/- 12)
    test btree::map::bench::insert_seq_100                     ... bench:       255 ns/iter (+/- 17)
    test btree::map::bench::insert_seq_10_000                  ... bench:       340 ns/iter (+/- 3)
    test btree::map::bench::iter_1000                          ... bench:     21193 ns/iter (+/- 1958)
    test btree::map::bench::iter_100000                        ... bench:   2203599 ns/iter (+/- 100491)
    test btree::map::bench::iter_20                            ... bench:       614 ns/iter (+/- 110)

This code could probably be a fair bit cleaner, but it works.

Part of #18009.
@Gankra
Copy link
Contributor Author

Gankra commented Dec 17, 2014

TreeMap and TrieMap are on the way out, so I've marked all their issues as "resolved". A lot of the other issues have prototypes in collect-rs (where TreeMap and TrieMap live now).

Also @gereeter's excellent efforts have resolved a lot of the BTree issues!

@csherratt's work has resolved the RingBuf issue!

A lot of other stuff is in rust-lang/rfcs#509

@steveklabnik steveklabnik added the metabug Issues about issues themselves ("bugs about bugs") label Jan 29, 2015
@steveklabnik steveklabnik added A-libs metabug Issues about issues themselves ("bugs about bugs") and removed metabug Issues about issues themselves ("bugs about bugs") labels Oct 27, 2015
@steveklabnik
Copy link
Member

it looks like this bug may be done now? It seems like this was mostly "do work for 1.0". @gankro is that true?

@Gankra
Copy link
Contributor Author

Gankra commented Oct 27, 2015

Eh, it's good enough.

@Gankra Gankra closed this as completed Oct 27, 2015
@Gankra
Copy link
Contributor Author

Gankra commented Oct 27, 2015

(most of the outstanding issues have their own stuff somewhere in the tracker these days)

@flip111
Copy link

flip111 commented Nov 10, 2017

In the original reddit thread multi indexed collections were mentioned. Was this included? Where can i find it? I'm refering to functionality which is also available in other languages:

http://www.boost.org/doc/libs/1_62_0/libs/multi_index/doc/index.html
https://pandas.pydata.org/pandas-docs/stable/advanced.html
https://hackage.haskell.org/package/ixset-typed

It's useful to model data structures as can be done in relational databases (some documentation linked above explains it).

lnicola pushed a commit to lnicola/rust that referenced this issue Sep 25, 2024
fix: do not assume rustup is installed in xtask codegen take 2

7d9e4fcc07e5de94e37b73436147cdbbaa35dbdc broke this on rustup toolchains, the `cmd` command is trying to be too smart here
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
metabug Issues about issues themselves ("bugs about bugs")
Projects
None yet
Development

No branches or pull requests

7 participants