Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Intersection of multiple Sets at the same time #2023

Open
oberien opened this issue Jun 8, 2017 · 7 comments
Open

Support for Intersection of multiple Sets at the same time #2023

oberien opened this issue Jun 8, 2017 · 7 comments
Labels
T-libs-api Relevant to the library API team, which will review and decide on the RFC.

Comments

@oberien
Copy link

oberien commented Jun 8, 2017

Currently the method intersection of HashSet and BTreeSet only support an intersection between two sets. If you want to intersect multiple sets, for example 3, you'll need to calculate the intersection between the first two, collect them into another set and calculate the intersection between that immediate result and the third set. This is very intense on computation, memory usage and lines of code.

I'd like to suggest the following:

  1. Change the current implementations of the Intersection structs to contain a list of other sets to intersect with instead of a reference to a single other set. This applies to std::collections::btree_set::Intersection as well as std::collections::hash_set::Intersection.
  2. Implement intersect_many (or intersect_multiple) on HashSet and BTreeSet which take a list of other sets to all intersect with the current one.
  3. Add intersect and intersect_many (or intersect_multiple) to the Intersection structs as well. Those can only be called if the Intersection hasn't been iterated over yet. Otherwise, they'll panic.

If (3.) is implemented, the "list of sets to intersect with" will need to be a Vec in order to be growable after creation. For performance reasons, the third suggestion could be dropped and the "list" can instead be a reference to the provided slice.

The current implementation of {Hash,BTree}Set::intersection would need to be changed to pass &[other] instead of other.
While this request should be relatively easy to implement for HashSet (self.others.iter().all(|set| set.contains(elt)), implementing it for BTreeSet could result in a lot more code.

Unresolved Questions:

  1. Naming: intersect_many vs intersect_multiple vs ???
  2. Don't implement the third suggestion (functions on the Intersection struct) in favor of using an allocation free slice-reference instead of a Vec?
@le-jzr
Copy link

le-jzr commented Jun 8, 2017

It might be cleaner to turn Intersection into an iterator filter that can be chained. Worth keeping in mind as a possible alternative.

@oberien
Copy link
Author

oberien commented Jun 10, 2017

While that does seem pretty nice from the API perspective, iterators are meant to be iterated over only once. Given an arbitrary iterator, we'd need to create a new iterator with a buffer which we collect all items into, just to be able to perform a nested loop join. While this can be very optimized with specialization for sets, it is very heavy and imperformant for most other cases. So in my opinion it does not fit the nearly no-cost iterator operations we have currently.

@burdges
Copy link

burdges commented Jun 10, 2017

Intersections speed can be optimized by using the same BuildHasher for both HashMaps and tweaking the intersection algorithm to dig into the guts of the HashMap and exploit that both use the same hasher. It might be easier to express that dependence statically with dependent types, but doing so may go beyond even @ticki's proposals. You can exploit it dynamically though and the multiple intersection case can exploit it even more :

Add Ord, Eq, etc. to RandomState. Make your Intersect function ask for HashMap<K,V,B> where B: BuildHasher+Ord+Eq. First it sorts the HashMaps according to their keys Bs and runs the fast intersection the HashMaps with equal Bs. Next it runs the slower intersection on the remaining HashMaps with unequal Bs. In principle, you might prefer trait objects &BuildHasher+Ord+Eq but that'd require changing HashMap itself.

Just fyi, this speedy intersection is completely trivial for Cuckoo tables, and O(n), but you can do a speedy one for Robin-Hood hash tables like HashMap too that should be amortized O(n). I'd assume you get O(n log n) for intersections that do not exploit equality of B.

@Centril Centril added the T-libs-api Relevant to the library API team, which will review and decide on the RFC. label Dec 6, 2017
@Stannislav
Copy link

I guess currently the way to do it is with fold? But it looks rather ugly, note also the difference between union and intersection:

    let mut sets: Vec<HashSet<char>> = Vec::new();
    sets.push(['a', 'b', 'c', 'd'].iter().cloned().collect());
    sets.push(['c', 'd'].iter().cloned().collect());
    sets.push(['d', 'a'].iter().cloned().collect());

    let union = sets
        .iter()
        .fold(HashSet::new(), |acc, hs| {
            acc.union(hs).cloned().collect()
        });
    let intersection = sets
        .iter()
        .skip(1)
        .fold(sets[0].clone(), |acc, hs| {
            acc.intersection(hs).cloned().collect()
        });

    println!("Union: {:?}", union);
    println!("Intersection: {:?}", intersection);

Or is there a better way?

@Diggsey
Copy link
Contributor

Diggsey commented Dec 6, 2020

@Stannislav heh, doing Advent of Code? 😛

This is what I did:

            let mut lines = group
                .lines()
                .map(|line| line.chars().collect::<HashSet<_>>());
            let mut s = lines.next().unwrap();
            for line in lines {
                s = s.intersection(&line).copied().collect();
            }

I agree it's ugly though. I think the unstable fold_first method on iterators would make this nicer. That said, it would be nice not to have to continually collect back into a HashSet.

@Diggsey
Copy link
Contributor

Diggsey commented Dec 6, 2020

FWIW, I think what we're missing is iterator adaptors that work on already-sorted data. It should be possible to implement efficient N-way intersection on sorted iterators without introducing a HashSet at all.

@Stannislav
Copy link

Stannislav commented Dec 6, 2020

@Stannislav heh, doing Advent of Code? 😛

Ha, got me 🎄 ! Here's my solution

Anyway, regarding the actual topic, I saw there's an external reduce crate, I haven't tried it though. I'm surprised there's no reduce in the standard library.

Edit: just found the fold_first dev issue here, seems like reduce is one of the proposed names as well.

dimo414 added a commit to dimo414/advent-2022 that referenced this issue Dec 3, 2022
…g a fold is noticably slower than maintaining a mutable map of seen values.

See also rust-lang/rfcs#2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T-libs-api Relevant to the library API team, which will review and decide on the RFC.
Projects
None yet
Development

No branches or pull requests

6 participants