Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BTreeMap Raw Entry API #73

Closed
mokhaled2992 opened this issue Jul 18, 2022 · 18 comments
Closed

BTreeMap Raw Entry API #73

mokhaled2992 opened this issue Jul 18, 2022 · 18 comments
Labels
api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api

Comments

@mokhaled2992
Copy link

mokhaled2992 commented Jul 18, 2022

Proposal

Problem statement

We need an API that will allow the users to

  1. Fetch or insert entries without doing double lookups. Currently this is only possible with a sequence of get, insert: so two lookups.
  2. Implement custom generic comparators without the need to store the comparator function in the BTreeMap itself (as in C++). Additionally those custom comparators would allow the users to store entries in an order that can rely on some external contexts. Currently this is only possible through one of these two unpleasant alternatives:
    2.1. Making your external context a global mutable and having to deal with the hassle that comes out of that because of threading/synchronization etc....just no
    2.2. Making your context part of the key. This requires the context to be some Rc<RefCell<....>> and the key to have some Weak pointer to that. Obviously a big unnecessary overhead.

Motivation, use-cases

Imagine implementing a dictionary where the strings are stored contiguously in some vector and a you want to establish some sorted lightweight index on that vector. The index would be a BTreeMap of usize keys which are an index into the vector. Obviously the sorting order would then have to be different from the natural sorting order of the usize and it will need to rely on the actual strings vector, we want alphabetical order sorting for our index. By making the raw entry API take a closure we achieve two things: enabling mixed type comparisons, and enabling comparators to consider external contexts during comparisons and only during comparisons. In fact this is even better than the C++ alternative where the comparator lambda has to be passed to the map constructor to be stored as member variable in the map and keep whatever references it needs throughout the whole lifetime of the map.

Furthermore, the raw entry API would enable the users to efficiently tackle the very common case of "get or insert", in which the caller wants to get a reference to the value if it exists or insert a new value immediately at the right position without having to do the lookup twice; once to check for existence and another to insert.

Below is a basic user code that summarizes the use case that was described above

use std::collections::BTreeMap;


// holds all strings contiguously
// maintains a lightweight sorted index
// on such strings.
struct Storage {
    dict: Vec<String>,
    index: BTreeMap<usize, ()>,
}

impl Storage {
    fn get_or_insert(& mut self, input: &str) -> usize {
        match self.index.raw_entry_mut(|key| -> Ordering {
            self.dict[key].cmp(input)
        })
        {
            RawEntryMut::Occupied(o) => o.key(),
            RawEntryMut::Vacant(v) => {
                self.dict.push(input.to_owned());
                v.insert(self.dict.len() - 1, ());
                self.dict.len() - 1
            }
        }
    }
}

Solution sketches

The proposed API looks like this

pub fn BTreeMap::raw_entry_mut<F, U>(&self, u: &U, cmp: F) -> RawEntryMut<'_, K, V>
where F: FnMut(&K, &U) -> Ordering;

pub fn BTreeMap::raw_entry_mut<F>(&self, cmp: F) -> RawEntryMut<'_, K, V>
where F: FnMut(&K) -> Ordering;

RawEntryMut can be an enum Vacant/Occupied where the Vacant variant points to where that key would be inserted, or using C++ terms:

iterator to the position before which the new element will be inserted

Nevertheless, this is an implementation detail of Vacant, the user obviously will just call insert to insert an entry in a vacant location.
Finally, I'm aware of the cursor APIs discussions. I believe future work can extend the methods of RawEntryMut to add things like insert_(before|after) and so on.

Links and related work

I gathered some links showing the demand for such a feature

  1. BTreeSet sorted by dynamically generated keys
  2. How can I implement Ord when the comparison depends on data not part of the compared items?
  3. Interface improvements for BTreeMap
  4. Comparator for BTree(Map|Set)s

Furthermore,

This use case is fully achievable in C++ (C++ Use case code snippet) and I demonstrate that below.

#include <vector>
#include <string>
#include <set>
#include <iostream>

struct Comparator {
    using is_transparent = std::true_type;

    bool operator()(const std::size_t lhs, const std::size_t rhs) const 
    {
        return dict[lhs] < dict[rhs];
    }

    bool operator()(const std::string & lhs, const std::size_t rhs) const {
        return lhs < dict[rhs];
    }

    bool operator()(const std::size_t lhs, const std::string & rhs) const {
        return dict[lhs] < rhs;
    }

    Comparator(const std::vector<std::string> & dict) : dict(dict) {}
    const std::vector<std::string> & dict;
};

struct Storage
{
    std::vector<std::string> dict;
    std::set<std::size_t, Comparator> index;

    Storage() : index(Comparator(dict)) {}

    std::size_t get_or_insert(const std::string& s) {
        const auto it = index.lower_bound(s);
        if(it == index.end() || dict[*it] != s) {
            dict.push_back(s);
            index.emplace_hint(it, dict.size() - 1);
            return dict.size() - 1;
        }

        return *it;
    }
};


int main(int argc, const char ** argv) {
    Storage s;

    std::cout << s.get_or_insert("HelloWorld") << std::endl;
    std::cout << s.get_or_insert("HelloWorld") << std::endl;
    std::cout << s.get_or_insert("C++") << std::endl;
    std::cout << s.get_or_insert("Rust") << std::endl;
    std::cout << s.get_or_insert("C++") << std::endl;
    std::cout << s.get_or_insert("C++") << std::endl;
    std::cout << s.get_or_insert("Rust") << std::endl;
    std::cout << s.get_or_insert("C++") << std::endl;
    std::cout << s.get_or_insert("C++") << std::endl;

    return 0;
}

What happens now?

This issue is part of the libs-api team API change proposal process. Once this issue is filed the libs-api team will review open proposals in its weekly meeting. You should receive feedback within a week or two.

@mokhaled2992 mokhaled2992 added api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api labels Jul 18, 2022
@CAD97
Copy link

CAD97 commented Jul 18, 2022

I am willing to mentor implementation work on this API.

Matching the HashMap raw entry API, I would potentially suggest that the comparator function could be impl FnMut(&K) -> Ordering rather than taking an input key ref, but this does make the comparison order somewhat unclear compared to cmp(&K, &Q) -> Ordering. Defining the entrypoint this way also sidesteps the question of whether RawOccupiedEntryMut should carry a copy of the key and provide an insert which ToOwneds the previously used lookup key.

@mokhaled2992
Copy link
Author

mokhaled2992 commented Jul 18, 2022

@CAD97 let me see if I got your suggestion right.

The single key comparator that you propose will be used internally to implicitly compare current items of the map and find the first element that does not return Ordering::Less. If that item is Equal then it's an Occupied entry otherwise it's Greater yielding a Vacant entry, correct?

I also support trying to have the APIs as homogeneous as possible. In fact it's one thing I like about C++ std::(unordered_)map that you can seamlessly switch between them provided that your type implements hash, == and <.

@CAD97
Copy link

CAD97 commented Jul 18, 2022

Yep, exactly. The only difference is .raw_entry(|entry| entry.cmp(key)) versus .raw_entry(key, |entry, key| entry.cmp(key)). Everything else is the same.

@mokhaled2992
Copy link
Author

Sounds good to me actually and less verbose. And you are probably right about avoiding arguing over how would Q be passed. I even would not like the bound T: Borrow<Q> because Q does not need to be an actual reference but rather also some view, think std::string and std::string_view in C++.

@thomcc
Copy link
Member

thomcc commented Jul 18, 2022

Given that the HashMap raw_entry API is not quite something we're sure we want to stabilize (IIUC), I'm not sure mirroring it for BTreeMap is the right approach.

CC @Amanieu

@mokhaled2992
Copy link
Author

mokhaled2992 commented Jul 18, 2022

It's actually a shame that this API would be removed from the HashMap. I had some code using std::HashMap and I moved it to hashbrown to be able to use that API and make use of the optimization opportunities it allows for. Please reconsider that decision if possible for HashMap. As for BTreeMap, I tried to elaborate above in detail use cases which wouldn't be possible without it. At least not without unnecessary overhead; memory and potentially cpu.

@CAD97
Copy link

CAD97 commented Jul 18, 2022

The ability/desire to support cursor operations on the entry API for BTreeMap give at least a reason to have the API on BTreeMap even if it's not available on HashMap.

But that also said, I think I have a vague idea of how raw_entry and entry can share much of the same API surface and types -- the only difference being using the functions taking a comparator/hasher or using the inherent one -- which would simplify the surface significantly.

The two-step builder process isn't necessary; we can just have the raw_entry function take the most general form. It's intrinsically for raw access; the convenient access without dependency just-in-time injection is the normal APIs, and the raw entry API doesn't really need to support it. The cases where raw_entry is useful really is the cases where you provide a custom comparator/hasher/etc, and not really in just avoiding a little bit of recalculation in the main entry API.

(But it would be nice if entry could take some "Q: IntoOwned<K>" to avoid a clone/ownership if already in the map.)

@glittershark
Copy link

@mokhaled2992 is this still something you're interested in pushing forward? If not, is this something I can help move? Getting some form of external comparators is extremely important for my use case (a SQL-esque database, where for example string keys in an index can be compared either case-insensitively or case-sensitively depending on the index, but storing case-sensitivity on every single string key is unacceptably high overhead).

@mokhaled2992
Copy link
Author

mokhaled2992 commented Sep 27, 2022

@glittershark I'm still interested in that, I think having the support for custom on the fly comparison functions enables alot of optimizations as I explained in detail in the main issue text and without the need to carry too much unnecessary information in the key like weak pointers for instance. On a parallel dimension, in my day to day work we use the corresponding Hasbrown hashmap almost always instead of the standard hashtable because it enables custom hashing/equality on the fly. It would be great to see that added one day finally to the standrad hashmap and of course to the sorted map which is what this detailed issue about.

@glittershark
Copy link

That's great to hear!

To be clear: I believe this is likely to be something I will be able to work on full-time for at least a couple of weeks if not more. Is this in a state at the moment where me throwing my time at it is likely to help get it landed, even if it's only as a nightly unstable API?

I guess that's a question for @CAD97 as well.

@mokhaled2992
Copy link
Author

mokhaled2992 commented Sep 27, 2022

I provided all what I believe is necessary but I don't have enough background about the internals to be able to support with implementation or any kind of details about implementation. The PRs of the corresponding hashmap raw entry API might be helpful to look at as a start.

@CAD97
Copy link

CAD97 commented Sep 27, 2022

I brought this up in the Zulip to ask about if we can try to land this to nightly: https://rust-lang.zulipchat.com/#narrow/stream/327149-t-libs-api.2Fapi-changes/topic/BTreeMap.20raw.20entry

I'm formally volunteering to both mentor and review an unstable implementation of a raw entry API for BTreeMap. I don't have actual review permissions, so final review will need to be a real reviewer, but I'm volunteering to cover as much as I can.

To summarize per my understanding:

The proposed API is a "raw entry" API for BTreeMap. The purpose of a raw entry API on BTreeMap is dependency injection. Instead of using the Ord implementation of K, the raw entry functions provide a closure callback for comparing. The API should also be much simpler than what is currently provided for HashMap, as there is only one dependency to inject, so only one raw entry mode to worry about.

The API surface which I wish to second is the following (though of course open to bikeshed):

// mod std::collections::btree_map

impl<K, V, A> BTreeMap<K, V, A>
where
    A: Allocator + Clone,
{
    // existing:
    pub fn get_key_value<Q>(&self, k: &Q) -> Option<(&K, &V)>
    where
        K: Borrow<Q> + Ord,
        Q: Ord + ?Sized,
    ;
    pub fn entry(&mut self, key: K) -> Entry<'_, K, V, A>
    where
        K: Ord,
    ;

    // new, feature(btree_raw_entry):
    pub fn raw_entry<F>(&self, cmp: F) -> Option<(&K, &V)>
    where
        F: FnMut(&K) -> Ordering,
    ;
    pub fn raw_entry_mut<F>(&self, cmp: F) -> RawEntryMut<'_, K, V, A>
    where
        F: FnMut(&K) -> Ordering,
    ;
}

pub enum RawEntryMut<'a, K, V, A = Global>
where
    K: 'a,
    V: 'a,
    A: Allocator + Clone,
{
    Vacant(RawVacantEntryMut<'a, K, V, A>),
    Occupied(RawOccupiedEntryMut<'a, K, V, A>),
}

impl<'a, K, V, A> RawEntryMut<'a, K, V, A>
where
    A: Allocator + Clone,
{
    pub fn or_insert_with<F>(self, default: F) -> (&'a mut K, &'a mut V)
    where
        F: FnOnce() -> (K, V),
    ;
    pub fn and_modify<F>(self, f: F) -> RawEntry<'a, K, V, A>
    where
        F: FnOnce(&mut K, &mut V),
    ;
}

pub struct RawVacantEntryMut<'a, K, V, A = Global>
where
    A: Allocator + Clone,
{ /* private fields */ }

impl<'a, K, V, A> RawVacantEntryMut<'a, K, V, A>
where
    A: Allocator + Clone,
{
    pub fn insert(self, key: K, value: V) -> (&'a mut K, &'a mut V);
}

pub struct RawOccupiedEntryMut<'a, K, V, A = Global>
where
    A: Allocator + Clone,
{ /* private fields */ }

impl<'a, K, V, A> RawOccupiedEntryMut<'a, K, V, A>
where
    A: Allocator + Clone,
{
    pub fn key(&self) -> &K;
    pub fn key_mut(&mut self) -> &mut K;
    pub fn into_key(self) -> &'a mut K;

    pub fn get(&self) -> &V;
    pub fn get_mut(&mut self) -> &mut V;
    pub fn into_mut(self) -> &'a mut V;

    pub fn get_key_value(&self) -> (&K, &V);
    pub fn get_key_value_mut(&mut self) -> (&mut K, &mut V);
    pub fn into_key_value(self) -> (&'a mut K, &'a mut V);

    pub fn insert(&mut self, value: V) -> V;
    pub fn insert_key(&mut self, key: K) -> K;

    pub fn remove(self) -> V;
    pub fn remove_entry(self) -> (K, V);
}
Some alternative thoughts

While writing this up, I had an interesting idea: it may be useful to return RawOccupiedEntry instead of (&'a mut K, &'a mut V). This would apply equally to the HashMap raw entry API. Specifically, this would change the signatures to

  • RawEntry::or_insert_with(self, F) -> RawOccupiedEntry<'a, K, V, A>
  • RawVacantEntry::insert(self, K, V) -> RawOccupiedEntry<'a, K, V, A>

Additional utility could be provided by giving RawEntry Option-returning versions of RawOccupiedEntry's methods.
There's also additional methods from hashbrown's extended version of the raw entry API which I do not reproduce here.

I've skipped the RawEntryBuilder step for BTreeMap; see the following details panel for more on why I chose to do so here. However, for completeness and parallelism to HashMap, it may be worth keeping the extra step for BTreeMap.

Observations on the raw entry API

There's a choice to be made here: HashMap's raw entry API supports modality, roughly

method Eq Hash BuildHasher
RawEntryBuilder[Mut]::from_key intrinsic intrinsic intrinsic
RawEntryBuilder[Mut]::from_key_hashed_nocheck intrinsic injected injected
RawEntryBuilder[Mut]::from_hash injected injected injected
RawEntryMut::insert/or_insert[_with] - intrinsic intrinsic
RawVacantEntryMut::insert[_hashed_nocheck] - intrinsic intrinsic
RawVacantEntryMut::insert_with_hasher - injected injected

BTreeMap doesn't have this much modality: the only intrinsic/injected split available is the comparison. However, raw_entry_mut().from_key is still useful as it takes a borrowed key unlike .entry which takes an owned key. This is a very persnickety case of avoiding unnecessary clones, but it can be (at least theoretically) beneficial in real code.

It's my belief, however, that the main benefit of the raw entry API is not the minor optimization benefits, but rather the dependency injection. BTreeMap doesn't have any state which a user might want to reuse, so if the raw entry API is only used for dependency injection, it doesn't need the modality of the RawEntryBuilderMut step. HashMap however, retains a reason for modality in that you may want to use the map's BuildHasher, or we could say that people using the raw entry API should store their BuildHasher externally, set the map's S = (), and only use the raw entry API.

It comes down to which is more desired: serving more use cases (modal raw entry) or a simpler API (always full injection) which still covers the whole problem domain, just with a bit more manual work for the use cases between normal non-raw entry use and full control .

Given Rust's std biases towards providing more helpers, I might actually bias towards the modal API if it can be shown to pull its weight. The patterns the dependency injection abilities of the raw entry API unlock imho justify the raw entry API, but the questionable potential performance benefit of the non-dependency-injection modes are much harder to justify.

@scottmcm
Copy link
Member

I think the raw API ought to be unsafe, because I want to be able to trust BTreeMap<TrustedType1, TrustedType2>.

Right now, for example, I can trust that a BTreeMap<usize, T> that safe code gave me doesn't have any duplicate keys, for example, and use that to soundly get non-overlappings &muts.

This is, in my opinion, a completely different issue than the fact that I can't trust BTreeMap<SomeUnknownType, T>, because there's no way to trait bound it away. If arbitrary safe code can make arbitrarily-garbage BTreeMaps even out of known & trusted types (like BTreeMap<usize, usize>), there's no way for me to work around that. I'd have to force people to use my type instead.

@thomcc
Copy link
Member

thomcc commented Sep 27, 2022

Do we have an API that could be used to get non-overlapping muts without UB? Or is this to avoid blocking off some future API

@mokhaled2992
Copy link
Author

If I'm not mistaken this Safety issue was discussed in the Hashmap version of the API (rust-lang/rust#54043) and the discussion was steering in the direction that Keys inserted in the wrong position in a hashmap is not considered unsafe in the Rust sense of safety and we can extend that to the BTreeMap's API.

Honestly I do not mind if it is marked as unsafe as long as the API exists and I can utilize it.

@Amanieu
Copy link
Member

Amanieu commented Nov 28, 2022

See #141 where I propose an API for BTreeMap cursors.

@Amanieu
Copy link
Member

Amanieu commented May 17, 2023

We are currently moving moving away from RawEntry on HashMap because it allows safe code to create "invalid" tables where a key is in the wrong position in the table and can't be found by lookups. The proposed API would allow similarly invalid tree to be created, which we would consider unsound: a function receiving a BTreeMap<usize, ()> should be able to assume all the elements are in properly sorted order.

The proper solution for comparator support is either:

  • Expose this support in an external crate by receiving a comparator by reference in every function call, such as https://github.com/ripytide/btree_monstrousity. This BTreeMap would not have the same guarantees as the standard BTreeMap.
  • Alternatively it is possible to do this safely if the comparator is contained within the BTreeMap. In your example, the Vec<String> would be contained within the comparator object, which would itself be contained in the BTreeMap. If you would like to pursue this approach then a new ACP should be opened.

@Amanieu Amanieu closed this as completed May 17, 2023
@ripytide
Copy link

It's also worth mentioning the existence of https://github.com/eggyal/copse which is another external crate port of BTreeMap with a similar goal of providing ordering based on a contained ordering.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api
Projects
None yet
Development

No branches or pull requests

7 participants