-
Notifications
You must be signed in to change notification settings - Fork 442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add "sortedPrefix(_:by)" to Collection #9
Conversation
I haven't looked too in-depth, but I just wanted to note that it's not necessary to add all elements to the heap at once — it only ever needs to store the smallest/largest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this contribution, @rakaramos and @rockbruno! I have a few notes for you below as you add tests and fill out the rest of your implementation.
Could you also think about how this could apply as a mutating method? For comparison, the C++ algorithms
library includes a partial_sort
function.
I pushed a commit to add in place methods and to make the protocols more generic. I still would like to remove the This is working like the C++ counterpart in where you get the entire array back, but we would like to add a variant where you directly get the prefix you're looking for. Though I'm not sure how to name that... @timvermeulen I was wondering if that's balanced by the fact that we only heapify the first half of array, but I don't know which is faster also. |
This seems to give an incorrect result for me: [0, 1].partiallySorted(1) |
Thanks @timvermeulen ! The bounds when heapifying were incorrect. I added a bunch of more tests now. |
A few more notes on this:
|
Just to make sure we are on the same page. Are you having trouble with the I wonder if using any synonym like |
Hm, even if we don't sort everything? @timvermeulen made a suggestion of using Quicksort instead which would be one way to make it stable, but then we would have to eat the potential additional worst case. WDYT? |
I wonder if instead of the non-mutating extension Sequence {
public func sortedPrefix(_ count: Int, by areInIncreasingOrder: (Element, Element) throws -> Bool) rethrows -> [Element]
}
extension Sequence where Element: Comparable {
public func sortedPrefix(_ count: Int) -> [Element]
} Example: let numbers = [7,1,6,2,8,3,9]
let smallestThree = numbers.sortedPrefix(3, <)
// [1, 2, 3] |
I'm really excited to see this! This is an algorithm I spent a lot of time with about two years ago, and I wrote up a blog post about it. I think there's definitely some tweaking we can do for performance. I've found that created a sorted array of length k and then inserting any other elements and sorting as you go is the best balance between code complexity and speed. I'm happy to share specific code if needed. You can add binary sort or a max heap for insertion, but you end up adding a lot of complexity for practically no performance gain (see the Attabench chart in the blog post, which I can also the Attabench files for if anyone wants them)). |
@pyrtsa I can see the justification for the mutating partial sort, but I don't think I see it for a version that returns the whole array with only part of it sorted. What's the use case there? |
I like |
@khanlou Thanks for the post! It's great to see a comparison chart and we could modify it to become stable. |
I made some benchmarks for us to analyze! I pitched this PR's heap implementation with @khanlou's, @timvermeulen 's quicksort version and the slower If we fetch a small prefix (4) from an increasing array, then SmallestM will be a lot quicker with the others being similar to each other (although I might have made a mistake, because the first time I've ran this the quicksort one was faster than the heap one) The interesting thing is what happens if you increase the size of the prefix instead of the size of the array (now a fixed 500k elements:) SmallestM is a lot faster in general, but if you try to prefix too many elements it will also become worse than sorting the entire array quicker than the other algorithms. Here's the same thing but with a smaller amount of elements (32k): The place where this cut happens gets smaller the larger the array is, but it looks like 10% is a good number in average. What came to my mind is that the best implementation would likely to be @khanlou's one with an additional logic that falls back to sorting the entire thing if you try to prefix more than 10% of the array. What are your thoughts on this? |
This is looking good — we could use a couple more tests, then we should be ready to merge:
|
Awesome! 🎉 |
Update PartialSortTests.swift
for element in Set(actual) { | ||
let filtered = sorted.filter { $0.element == element }.map(\.offset) | ||
XCTAssertEqual(filtered, filtered.sorted()) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | |
} |
Guides/PartialSort.md
Outdated
@@ -0,0 +1,52 @@ | |||
# Partial Sort (sortedPrefix) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would make sense to rename all documents with consistent terminology:
# Partial Sort (sortedPrefix) | |
# Sorted Prefix |
Guides/PartialSort.md
Outdated
|
||
```swift | ||
let numbers = [7,1,6,2,8,3,9] | ||
let smallestThree = numbers.sortedPrefix(<) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't a correct invocation of the API provided here.
Guides/PartialSort.md
Outdated
|
||
### Complexity | ||
|
||
The algorithm used is based on [Soroush Khanlou's research on this matter](https://khanlou.com/2018/12/analyzing-complexity/). The total complexity is `O(k log k + nk)`, which will result in a runtime close to `O(n)` if k is a small amount. If k is a large amount (more than 10% of the collection), we fallback to sorting the entire array. Realistically, this means the worst case is actually `O(n log n)`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The algorithm used is based on [Soroush Khanlou's research on this matter](https://khanlou.com/2018/12/analyzing-complexity/). The total complexity is `O(k log k + nk)`, which will result in a runtime close to `O(n)` if k is a small amount. If k is a large amount (more than 10% of the collection), we fallback to sorting the entire array. Realistically, this means the worst case is actually `O(n log n)`. | |
The algorithm used is based on [Soroush Khanlou's research on this matter](https://khanlou.com/2018/12/analyzing-complexity/). The total complexity is `O(k log k + nk)`, which will result in a runtime close to `O(n)` if k is a small amount. If k is a large amount (more than 10% of the collection), we fall back to sorting the entire array. Realistically, this means the worst case is actually `O(n log n)`. |
I'm not sure how the last statement is arrived at. Could you explain?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we reach a point where O(k log k + nk)
is going to be worse than sorting the full array we fall back to stdlib's O(n log n)
sort, so in practice it shouldn't get much worse than that.
Guides/PartialSort.md
Outdated
|
||
Here are some benchmarks we made that demonstrates how this implementation (SmallestM) behaves when k increases (before implementing the fallback): | ||
|
||
![Benchmark](https://i.imgur.com/F5UEQnl.png) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we embed these into this project, as opposed to arbitrary external URLs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some nits.
Guides/PartialSort.md
Outdated
|
||
**C++:** The `<algorithm>` library defines a `partial_sort` function where the entire array is returned using a partial heap sort. | ||
|
||
**Python:** Defines a `heapq` priority queue that can be used to manually |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: it'd be nice if this document were either consistently wrapped to 80 columns, or else not wrapped. It seems there are two styles here.
Co-authored-by: Xiaodi Wu <13952+xwu@users.noreply.github.com>
Co-authored-by: Xiaodi Wu <13952+xwu@users.noreply.github.com>
Co-authored-by: Xiaodi Wu <13952+xwu@users.noreply.github.com>
Hi! I'd like to tell you one very slight but important problem. It looks strange for kanji-users because the letter "笑" means "laugh" or "smile" in China and Japan, and the graph has nothing with such ideas. But If it was came from some intention or customary rule to use "笑" in the graph label, I'm so sorry for bothering you🙏 |
In Attabench that was supposed to be a colored square, I think my computer might be missing some special font or pack for that to have happened. |
Is there any other thing left for us to do? cc @natecook1000 |
Thanks for your patience, @rakaramos and @rockbruno — this is ready to go! 👍 |
J/K, found a bug — |
Ah, looks like we were missing tests for massive inputs. Added a bunch of them now |
I'd like to question the name
I would suggest perhaps adding a parameter to |
You're right, that is a possible point of confusion in the naming. However please note that this holds:
Edited to add: I think the problem with partial sorting related names (which are good for the mutating algorithm variants!) is that this version only returns the prefix, not the full but partially sorted array. |
No it doesn’t. The example is taken from the guide document:
Oh, you mean |
Sorry, I think I only copied the wrong results here. What I mean is the output of |
if let last = result.last, try areInIncreasingOrder(last, e) { | ||
continue | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's still a logic issue here — if e
is equal to result.last
, execution will pass by this continue
statement. That's a problem, because the call to partitioningIndex
then returns endIndex
, which becomes invalid after the call to result.removeLast()
. What you want to ensure is that e
is strictly less than result.last
before proceeding.
Test case:
Array(repeating: 1, count: 100).sortedPrefix(5)
// Fatal error: Array index is out of range
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Crap... That's another case that we did have covered in the tests, but the prefix wasn't low enough to trigger the algorithm. I added more high input cases, hopefully it will work now.
This PR is a joint effort with @rockbruno and it adds two partial sort methods:
sortedPrefix(_:by)
for baseCollections
and an abstractedsortedPrefix(_)
forComparable
Collections.Forum thread
Checklist