Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add "sortedPrefix(_:by)" to Collection #9

Merged
merged 31 commits into from
Dec 4, 2020
Merged

Conversation

rakaramos
Copy link
Contributor

@rakaramos rakaramos commented Oct 8, 2020

This PR is a joint effort with @rockbruno and it adds two partial sort methods: sortedPrefix(_:by) for base Collections and an abstracted sortedPrefix(_) for Comparable Collections.

Forum thread

Checklist

  • I've added at least one test that validates that my change is working, if appropriate
  • I've followed the code style of the rest of the project
  • I've read the Contribution Guidelines
  • I've updated the documentation if necessary

@timvermeulen
Copy link
Contributor

I haven't looked too in-depth, but I just wanted to note that it's not necessary to add all elements to the heap at once — it only ever needs to store the smallest/largest n elements as you iterate the sequence. Whether that is actually faster probably needs to be benchmarked.

Copy link
Member

@natecook1000 natecook1000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this contribution, @rakaramos and @rockbruno! I have a few notes for you below as you add tests and fill out the rest of your implementation.

Could you also think about how this could apply as a mutating method? For comparison, the C++ algorithms library includes a partial_sort function.

Sources/Algorithms/PartialSort.swift Outdated Show resolved Hide resolved
Sources/Algorithms/PartialSort.swift Outdated Show resolved Hide resolved
Sources/Algorithms/PartialSort.swift Outdated Show resolved Hide resolved
Sources/Algorithms/PartialSort.swift Outdated Show resolved Hide resolved
@rockbruno
Copy link

I pushed a commit to add in place methods and to make the protocols more generic. I still would like to remove the Index == Int dependency, but now basically it's working as a reverse in place heap sort.

This is working like the C++ counterpart in where you get the entire array back, but we would like to add a variant where you directly get the prefix you're looking for. Though I'm not sure how to name that...

@timvermeulen I was wondering if that's balanced by the fact that we only heapify the first half of array, but I don't know which is faster also.

@rakaramos rakaramos marked this pull request as ready for review October 9, 2020 13:32
@timvermeulen
Copy link
Contributor

This seems to give an incorrect result for me:

[0, 1].partiallySorted(1)

@rockbruno
Copy link

Thanks @timvermeulen ! The bounds when heapifying were incorrect. I added a bunch of more tests now.

@natecook1000
Copy link
Member

A few more notes on this:

  • The stdlib's sort has been stable since version 5.0, even though we haven't documented that guarantee (yet). Any new sorting operations should match that behavior, or probably have a different name.
  • I'm having a bit of trouble with the partiallySorted name for the non-mutating version. It honestly seems more akin to Sequence.min() than to Sequence.sorted(). Can you think about naming alternatives for that?

@rakaramos
Copy link
Contributor Author

Just to make sure we are on the same page. Are you having trouble with the sort part of the name?
At the beginning @rockbruno and me have listed names such as sort(upTo:), sortFirst(_:) among others, but decided to go with what C++ calls it.

I wonder if using any synonym like arrange or classify would be better.

@rockbruno
Copy link

The stdlib's sort has been stable since version 5.0, even though we haven't documented that guarantee (yet). Any new sorting operations should match that behavior, or probably have a different name.

Hm, even if we don't sort everything? @timvermeulen made a suggestion of using Quicksort instead which would be one way to make it stable, but then we would have to eat the potential additional worst case. WDYT?

@pyrtsa
Copy link

pyrtsa commented Oct 13, 2020

I wonder if instead of the non-mutating partiallySorted, we in fact need just the function which only returns the sorted prefix of the partially sorted input. Wouldn't that lead to a more natural name?

extension Sequence {
     public func sortedPrefix(_ count: Int, by areInIncreasingOrder: (Element, Element) throws -> Bool) rethrows -> [Element]
 }

 extension Sequence where Element: Comparable {
     public func sortedPrefix(_ count: Int) -> [Element]
 }

Example:

let numbers = [7,1,6,2,8,3,9]
let smallestThree = numbers.sortedPrefix(3, <)
// [1, 2, 3]

@khanlou
Copy link

khanlou commented Oct 16, 2020

I'm really excited to see this! This is an algorithm I spent a lot of time with about two years ago, and I wrote up a blog post about it.

I think there's definitely some tweaking we can do for performance. I've found that created a sorted array of length k and then inserting any other elements and sorting as you go is the best balance between code complexity and speed. I'm happy to share specific code if needed. You can add binary sort or a max heap for insertion, but you end up adding a lot of complexity for practically no performance gain (see the Attabench chart in the blog post, which I can also the Attabench files for if anyone wants them)).

@natecook1000
Copy link
Member

@pyrtsa sortedPrefix is a good name for this! I had been thinking about something like min(5, by: <), but that doesn't indicate that the result is sorted, and would require max(5, by: <) for parity.

I can see the justification for the mutating partial sort, but I don't think I see it for a version that returns the whole array with only part of it sorted. What's the use case there?

@rockbruno
Copy link

I like sortedPrefix! In fact, our original idea was to return just the desired prefix indeed. We kept the whole array because it's what C++ was doing, so it would feel more familiar. Do we agree in returning an ArraySlice in all methods then?

@rockbruno
Copy link

@khanlou Thanks for the post! It's great to see a comparison chart and we could modify it to become stable.

@rockbruno
Copy link

I made some benchmarks for us to analyze! I pitched this PR's heap implementation with @khanlou's, @timvermeulen 's quicksort version and the slower .sorted().prefix() as a base:

Screenshot 2020-10-20 at 18 21 03

If we fetch a small prefix (4) from an increasing array, then SmallestM will be a lot quicker with the others being similar to each other (although I might have made a mistake, because the first time I've ran this the quicksort one was faster than the heap one)

The interesting thing is what happens if you increase the size of the prefix instead of the size of the array (now a fixed 500k elements:)

Screenshot 2020-10-20 at 18 29 04

SmallestM is a lot faster in general, but if you try to prefix too many elements it will also become worse than sorting the entire array quicker than the other algorithms. Here's the same thing but with a smaller amount of elements (32k):

Screenshot 2020-10-20 at 19 08 13

The place where this cut happens gets smaller the larger the array is, but it looks like 10% is a good number in average. What came to my mind is that the best implementation would likely to be @khanlou's one with an additional logic that falls back to sorting the entire thing if you try to prefix more than 10% of the array. What are your thoughts on this?

@natecook1000
Copy link
Member

This is looking good — we could use a couple more tests, then we should be ready to merge:

  • asking for a sortedPrefix larger than the input should return the input, sorted
  • stability test for multiple equal elements

@rakaramos
Copy link
Contributor Author

Awesome! 🎉
I've added the missing tests

for element in Set(actual) {
let filtered = sorted.filter { $0.element == element }.map(\.offset)
XCTAssertEqual(filtered, filtered.sorted())
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
}
}

@@ -0,0 +1,52 @@
# Partial Sort (sortedPrefix)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would make sense to rename all documents with consistent terminology:

Suggested change
# Partial Sort (sortedPrefix)
# Sorted Prefix


```swift
let numbers = [7,1,6,2,8,3,9]
let smallestThree = numbers.sortedPrefix(<)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't a correct invocation of the API provided here.


### Complexity

The algorithm used is based on [Soroush Khanlou's research on this matter](https://khanlou.com/2018/12/analyzing-complexity/). The total complexity is `O(k log k + nk)`, which will result in a runtime close to `O(n)` if k is a small amount. If k is a large amount (more than 10% of the collection), we fallback to sorting the entire array. Realistically, this means the worst case is actually `O(n log n)`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The algorithm used is based on [Soroush Khanlou's research on this matter](https://khanlou.com/2018/12/analyzing-complexity/). The total complexity is `O(k log k + nk)`, which will result in a runtime close to `O(n)` if k is a small amount. If k is a large amount (more than 10% of the collection), we fallback to sorting the entire array. Realistically, this means the worst case is actually `O(n log n)`.
The algorithm used is based on [Soroush Khanlou's research on this matter](https://khanlou.com/2018/12/analyzing-complexity/). The total complexity is `O(k log k + nk)`, which will result in a runtime close to `O(n)` if k is a small amount. If k is a large amount (more than 10% of the collection), we fall back to sorting the entire array. Realistically, this means the worst case is actually `O(n log n)`.

I'm not sure how the last statement is arrived at. Could you explain?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we reach a point where O(k log k + nk) is going to be worse than sorting the full array we fall back to stdlib's O(n log n) sort, so in practice it shouldn't get much worse than that.


Here are some benchmarks we made that demonstrates how this implementation (SmallestM) behaves when k increases (before implementing the fallback):

![Benchmark](https://i.imgur.com/F5UEQnl.png)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we embed these into this project, as opposed to arbitrary external URLs?

Copy link
Contributor

@xwu xwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nits.

Sources/Algorithms/PartialSort.swift Outdated Show resolved Hide resolved
Sources/Algorithms/PartialSort.swift Outdated Show resolved Hide resolved
Sources/Algorithms/PartialSort.swift Outdated Show resolved Hide resolved

**C++:** The `<algorithm>` library defines a `partial_sort` function where the entire array is returned using a partial heap sort.

**Python:** Defines a `heapq` priority queue that can be used to manually
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: it'd be nice if this document were either consistently wrapped to 80 columns, or else not wrapped. It seems there are two styles here.

rockbruno and others added 4 commits November 1, 2020 11:42
Co-authored-by: Xiaodi Wu <13952+xwu@users.noreply.github.com>
Co-authored-by: Xiaodi Wu <13952+xwu@users.noreply.github.com>
Co-authored-by: Xiaodi Wu <13952+xwu@users.noreply.github.com>
@ensan-hcl
Copy link

Hi! I'd like to tell you one very slight but important problem.
This is the graph shown in the document, and the labels of each lines are reading "笑Priority Queue" "笑SmallestM" "笑QuickSortBased" "笑Stdlib.sort + prefix".

It looks strange for kanji-users because the letter "笑" means "laugh" or "smile" in China and Japan, and the graph has nothing with such ideas.
If you all don't care of it, it's OK, but if you think it's a problem, it should be refined in some way!

But If it was came from some intention or customary rule to use "笑" in the graph label, I'm so sorry for bothering you🙏

@rockbruno
Copy link

In Attabench that was supposed to be a colored square, I think my computer might be missing some special font or pack for that to have happened.

@rakaramos
Copy link
Contributor Author

Is there any other thing left for us to do? cc @natecook1000

@natecook1000
Copy link
Member

Thanks for your patience, @rakaramos and @rockbruno — this is ready to go! 👍

@natecook1000
Copy link
Member

J/K, found a bug — (1...100).sortedPrefix(0) traps on the removeLast() call. The zero-argument tests aren't catching it because they're on collections with fewer than ten elements, which triggers the sort, then prefix branch.

@rockbruno
Copy link

Ah, looks like we were missing tests for massive inputs. Added a bunch of them now

@karwa
Copy link
Contributor

karwa commented Dec 2, 2020

I'd like to question the name sortedPrefix. To me it sounds like it returns a sorted version of the prefix, when it in fact returns a prefix of the sorted elements. In other words:

let numbers = [7,1,6,2,8,3,9]

print(numbers.prefix(3)) // [7, 1, 6]
print(numbers.prefix(3).sorted()) // [1, 6, 7]
print(numbers.sortedPrefix(3)) // [1, 2, 3] - huh? that's not the sorted prefix!

I would suggest perhaps adding a parameter to sorted - e.g. numbers.sorted(count: 3). Names relating to "partial sorting" also avoid this ambiguity IMO.

@pyrtsa
Copy link

pyrtsa commented Dec 2, 2020

I'd like to question the name sortedPrefix. To me it sounds like it returns a sorted version of the prefix, when it in fact returns a prefix of the sorted elements. In other words:

let numbers = [7,1,6,2,8,3,9]

print(numbers.prefix(3)) // [7, 1, 6]
print(numbers.prefix(3).sorted()) // [1, 6, 7]
print(numbers.sortedPrefix(3)) // [1, 2, 3] - huh? that's not the sorted prefix!

You're right, that is a possible point of confusion in the naming. However please note that this holds:

print(numbers.sorted().prefix(3)) // [1, 2, 3]
print(numbers.sortedPrefix(3)) // [1, 2, 3]

Edited to add: I think the problem with partial sorting related names (which are good for the mutating algorithm variants!) is that this version only returns the prefix, not the full but partially sorted array.

@karwa
Copy link
Contributor

karwa commented Dec 2, 2020

You're right, that is a possible point of confusion in the naming. However please note that this holds:

print(numbers.sorted().prefix(3)) // [1, 6, 7]
print(numbers.sortedPrefix(3)) // [1, 6, 7]

No it doesn’t. The example is taken from the guide document:

let numbers = [7,1,6,2,8,3,9]
let smallestThree = numbers.sortedPrefix(3, by: <)
 // [1, 2, 3]

Oh, you mean .sorted().prefix(3) (you didn’t update the comments). Sure, but that isn’t what the method reads as - it sounds like “sorted” is an adjective describing what happens to the prefix (i.e. the prefix gets sorted)

@pyrtsa
Copy link

pyrtsa commented Dec 2, 2020

Sorry, I think I only copied the wrong results here. What I mean is the output of numbers.sorted().prefix(n) matches element-wise that of numbers.sortedPrefix(n).

Comment on lines +59 to +61
if let last = result.last, try areInIncreasingOrder(last, e) {
continue
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's still a logic issue here — if e is equal to result.last, execution will pass by this continue statement. That's a problem, because the call to partitioningIndex then returns endIndex, which becomes invalid after the call to result.removeLast(). What you want to ensure is that e is strictly less than result.last before proceeding.

Test case:

Array(repeating: 1, count: 100).sortedPrefix(5)
// Fatal error: Array index is out of range

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Crap... That's another case that we did have covered in the tests, but the prefix wasn't low enough to trigger the algorithm. I added more high input cases, hopefully it will work now.

@natecook1000 natecook1000 merged commit 3864606 into apple:main Dec 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants