Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve ContainExactly matcher speed when elements obey transitivity #1325

Conversation

bclayman-sq
Copy link
Contributor

@bclayman-sq bclayman-sq commented Oct 6, 2021

This addresses issues #1006, #1161.

The current implementation for ContainExactly runs in O(n!). The crux of the problem is that some elements don't obey transitivity. As a result, knowing that sorting actual and expected doesn't result in a match doesn't guarantee that expected and actual don't match.

This PR makes improvements that ensure any test with comparable elements in actual and expected runs in O(n log n) time.

There are two main updates:

Update the core logic in ContainExactly to:

  1. Attempt to sort actual and expected
  2. If the sort succeeds, this tells us the objects are comparable. Use the result of the sort

This means that when actual and expected contain comparable elements, we can immediately return false when the sorted arrays don't match. This speeds up passing tests where actual and expected don't match from O(n!) to O(n log n).

Speed up determining extra and missing elements

Previously, the logic for determining extra and missing items between actual and expected relied on code in PairingsMaximizer that generated all possible element pairings. So even if you avoided comparing all possible pairings to determine if the matcher should return true, you would still incur that O(n!) cost when generating a failure message.

The code now calculates extra and missing in linear time. This speeds up failing tests from O(n!) to O(n log n).

More practically, this means that common use cases for ContainExactly will enjoy a massive speedup. Previously, users have examples where comparing arrays of 30 integers "never finishes." With this PR's update, comparing arrays of 10,000 integers runs in < 0.1s on my machine.

@bclayman-sq
Copy link
Contributor Author

This is a small proof of concept to suggest a path forward for improving the performance of ContainExactly (and by extension, MatchArray) dramatically for common use cases. I'd love to hear what folks think!

@bclayman-sq bclayman-sq force-pushed the bclayman/improve-contains-exactly-speed branch 4 times, most recently from c00b44c to 57ccbff Compare October 7, 2021 14:55
@bclayman-sq
Copy link
Contributor Author

@pirj Had to update a few things to pass CI for all ruby versions. If you think it's reasonable, I can try to put out a more comprehensive PR.

What do you think of this approach?

@pirj
Copy link
Member

pirj commented Oct 8, 2021

Not to litter in contained threads, I'll just post my findings here. I have a limited experience with practical algorithm application, unfortunately.

https://en.wikipedia.org/wiki/Stable_marriage_problem

problem of finding a stable matching between two equally sized sets of elements given an ordering of preferences for each element

https://en.wikipedia.org/wiki/Lattice_of_stable_matchings

https://brilliant.org/wiki/matching/

group of candidates and a set of jobs, and each candidate is qualified for at least one of the jobs

https://brilliant.org/wiki/matching-algorithms/
https://en.wikipedia.org/wiki/Matching_(graph_theory)#Maximum-cardinality_matching

In any case, O(n!) seems to be ridiculously expensive. Building a lattice is O(N^2), and shaking it has a chance to be less computationally extensive, but I suppose will consume more memory.

Does the practical task boil down to finding the local minimum total of extra and missing elements?

@bclayman-sq bclayman-sq force-pushed the bclayman/improve-contains-exactly-speed branch 3 times, most recently from b78dd66 to 42dda47 Compare October 8, 2021 23:46
@bclayman-sq
Copy link
Contributor Author

bclayman-sq commented Oct 9, 2021

👋 @pirj,

I'm gonna reply in-line just to keep a summary of where we're at on this PR. You had two recent bits of feedback that I've addressed below:

It would be ideal, though, not to expose transitive, as, frankly, I had to look it up to remind myself what it was.
And still, users (I do not assume stupidity, just human error) might use it wrong. Apart from it being yet another to keep in mind.

I think there are a few ways to address your concerns above:

1. Better naming: I'm not wedded to transitive. We could call it sortable, comparable, etc.
2. Good documentation: Explicit and detailed documentation will hopefully help users use it correctly.
3. Informative error messages when someone does misuse transitive: For example, if a user uses transitive with elements that aren't comparable, we can raise. That can happen here:

def safe_sort(array)
  array.sort
rescue Support::AllExceptionsExceptOnesWeMustNotRescue
  raise "Invalid use of `.transitive` with unsortable array #{array}" if @transitive
  array
end

Maybe we could calculate the missing and extra items separately

This sounds like a reasonable approach. It feels that for transitive we can improve pairing maximizer to be O(n) (on two sorted arrays, since sorting that was performed previously already).

Awesome, I think we're on the same page (and thanks for being so helpful in your responses).

I've just refactored my code to calculate extra and missing items in O(n) time when transitive is used. This addresses your earlier concern about generate_failure_message incurring O(n!) work by relying on PairingsMaximizer methods.

I think it'd be a really nice improvement to ContainExactly from O(n!) to O(n log n) when sorting doesn't result in a match and the elements are transitive. I'd be open to improving naming, updating documentation, and adding error-handling if that would get this across the finish line.

When you get a chance (you've already helped tons, thank you!), would you mind letting me know if you'd be open to merging this functionality into RSpec?

@pirj
Copy link
Member

pirj commented Oct 9, 2021

Just skimmed through your comments and I truly like the idea to detect if all items are transitive by making a check if the sort succeeded. Wondering if [be_positive, be_negative].sort blows up. It should for our purposes.
Will reply later to the rest. Thanks for taking this tricky task over!

@bclayman-sq bclayman-sq force-pushed the bclayman/improve-contains-exactly-speed branch 3 times, most recently from 710aee6 to 5cc1ea7 Compare October 9, 2021 21:24
…ents obey transitivity

This is a proof of concept approach for addressing issue rspec#1161.

The current implementation for ContainExactly runs in O(n!).  In practice,
it runs in O(n log n) when the elements are comparable and sorting result
in a match.

The crux of the problem is that some elements don't obey transitivity.  As
a result, knowing that sorting actual and expected doesn't result in a match
*doesn't* guarantee that expected and actual don't match.

This proof of concept provides a way for the user to indicate that the elements
in a particular example's expected and actual obey transitivity. That looks like this:

expect(a).to contain_exactly(*b).transitive

And runs in O(n log n) time.  More practically, this means that common
use cases for contains_exactly will enjoy a massive speedup.  Previously,
users have examples where comparing arrays of 30 integers "never finishes."
Using `.transitive` here with arrays of 10,000 integers runs in < 0.1s
on my machine.
@bclayman-sq bclayman-sq force-pushed the bclayman/improve-contains-exactly-speed branch from 5cc1ea7 to 0dae38c Compare October 9, 2021 23:06
@pirj
Copy link
Member

pirj commented Oct 10, 2021

I dared to push some cosmetic changes, what do you think of such an auto-detection of transitivity?

@bclayman-sq
Copy link
Contributor Author

bclayman-sq commented Oct 10, 2021

I dared to push some cosmetic changes, what do you think of such an auto-detection of transitivity?

@pirj,

Thanks for doing that! I think this is great and a clear improvement over .transitive; users will get the benefit without having to know about or remember anything about .transitive.

If you're happy with this, would you mind approving? If so, I'll update the PR title and message since it's no longer a proof of concept and doesn't use transitive and then I'll merge 😄

@pirj
Copy link
Member

pirj commented Oct 10, 2021 via email

@bclayman-sq bclayman-sq changed the title Proof of concept for improving ContainExactly matcher speed when elements obey transitivity Improve ContainExactly matcher speed when elements obey transitivity Oct 10, 2021
@bclayman-sq
Copy link
Contributor Author

bclayman-sq commented Oct 10, 2021

I'm all good with the change. The only thing I think needs addressing are specs - there is a very similar example with 10 numbers and Timeout check. It makes sense to combine or at least put those two closer to each

@pirj Yeah, that makes a lot of sense. I just refactored the tests so that my speed tests are near the 10 number + Timeout check. Cleaned a few things up with shared examples and timeout_if_not_debugging too.

How does it look?

Copy link
Member

@pirj pirj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect! It's a huge win for this use case.
Thanks a lot for the contribution!

@genehsu
Copy link

genehsu commented Oct 25, 2021

Here's a different idea for the implementation.

What if in the pairings_maximizer method you pre-calculated pairs of comparable things between the expected and actual arrays? Then the amount of work the PairingsMaximizer class will normally be small because most values will be matched reciprocally, and only leftover values with matchers or other non-comparable items will need to be paired.

This implementation might be considered more straightforward because it doesn't introduce a new branches to the code flow, but optimizes the existing implementation. Here's a prototype: #1328

bclayman-sq added a commit to bclayman-sq/rspec-expectations that referenced this pull request Oct 29, 2021
Speed up the ContainExactly matcher by pre-emptively matching up corresponding elements in the expected and actual arrays.

This addresses rspec#1006, rspec#1161.

This PR is a collaboration between me and @genehsu based on
a couple of our earlier PRs and discussion that resulted:
1) rspec#1325
2) rspec#1328

Co-authored-by: Gene Hsu (@genehsu)
bclayman-sq added a commit to bclayman-sq/rspec-expectations that referenced this pull request Oct 29, 2021
Speed up the ContainExactly matcher by pre-emptively matching up corresponding elements in the expected and actual arrays.

This addresses rspec#1006, rspec#1161.

This PR is a collaboration between me and @genehsu based on
a couple of our earlier PRs and discussion that resulted:
1) rspec#1325
2) rspec#1328

Co-authored-by: Gene Hsu (@genehsu)
bclayman-sq added a commit to bclayman-sq/rspec-expectations that referenced this pull request Oct 29, 2021
Speed up the ContainExactly matcher by pre-emptively matching up corresponding elements in the expected and actual arrays.

This addresses rspec#1006, rspec#1161.

This PR is a collaboration between me and @genehsu based on
a couple of our earlier PRs and discussion that resulted:
1) rspec#1325
2) rspec#1328

Co-authored-by: Gene Hsu (@genehsu)
@bclayman-sq
Copy link
Contributor Author

Closing in favor of #1333

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants