Add "sortedPrefix(_:by)" to Collection #9

rakaramos · 2020-10-08T19:19:11Z

This PR is a joint effort with @rockbruno and it adds two partial sort methods: sortedPrefix(_:by) for base Collections and an abstracted sortedPrefix(_) for Comparable Collections.

Forum thread

Checklist

I've added at least one test that validates that my change is working, if appropriate
I've followed the code style of the rest of the project
I've read the Contribution Guidelines
I've updated the documentation if necessary

timvermeulen · 2020-10-08T19:39:12Z

I haven't looked too in-depth, but I just wanted to note that it's not necessary to add all elements to the heap at once — it only ever needs to store the smallest/largest n elements as you iterate the sequence. Whether that is actually faster probably needs to be benchmarked.

natecook1000

Thanks for this contribution, @rakaramos and @rockbruno! I have a few notes for you below as you add tests and fill out the rest of your implementation.

Could you also think about how this could apply as a mutating method? For comparison, the C++ algorithms library includes a partial_sort function.

Sources/Algorithms/PartialSort.swift

rockbruno · 2020-10-09T11:14:24Z

I pushed a commit to add in place methods and to make the protocols more generic. I still would like to remove the Index == Int dependency, but now basically it's working as a reverse in place heap sort.

This is working like the C++ counterpart in where you get the entire array back, but we would like to add a variant where you directly get the prefix you're looking for. Though I'm not sure how to name that...

@timvermeulen I was wondering if that's balanced by the fact that we only heapify the first half of array, but I don't know which is faster also.

Guide docs

timvermeulen · 2020-10-09T13:48:14Z

This seems to give an incorrect result for me:

[0, 1].partiallySorted(1)

rockbruno · 2020-10-09T14:09:25Z

Thanks @timvermeulen ! The bounds when heapifying were incorrect. I added a bunch of more tests now.

natecook1000 · 2020-10-12T20:32:42Z

A few more notes on this:

The stdlib's sort has been stable since version 5.0, even though we haven't documented that guarantee (yet). Any new sorting operations should match that behavior, or probably have a different name.
I'm having a bit of trouble with the partiallySorted name for the non-mutating version. It honestly seems more akin to Sequence.min() than to Sequence.sorted(). Can you think about naming alternatives for that?

rakaramos · 2020-10-12T23:52:18Z

Just to make sure we are on the same page. Are you having trouble with the sort part of the name?
At the beginning @rockbruno and me have listed names such as sort(upTo:), sortFirst(_:) among others, but decided to go with what C++ calls it.

I wonder if using any synonym like arrange or classify would be better.

rockbruno · 2020-10-13T07:02:35Z

The stdlib's sort has been stable since version 5.0, even though we haven't documented that guarantee (yet). Any new sorting operations should match that behavior, or probably have a different name.

Hm, even if we don't sort everything? @timvermeulen made a suggestion of using Quicksort instead which would be one way to make it stable, but then we would have to eat the potential additional worst case. WDYT?

pyrtsa · 2020-10-13T07:43:07Z

I wonder if instead of the non-mutating partiallySorted, we in fact need just the function which only returns the sorted prefix of the partially sorted input. Wouldn't that lead to a more natural name?

extension Sequence {
     public func sortedPrefix(_ count: Int, by areInIncreasingOrder: (Element, Element) throws -> Bool) rethrows -> [Element]
 }

 extension Sequence where Element: Comparable {
     public func sortedPrefix(_ count: Int) -> [Element]
 }

Example:

let numbers = [7,1,6,2,8,3,9]
let smallestThree = numbers.sortedPrefix(3, <)
// [1, 2, 3]

khanlou · 2020-10-16T00:02:21Z

I'm really excited to see this! This is an algorithm I spent a lot of time with about two years ago, and I wrote up a blog post about it.

I think there's definitely some tweaking we can do for performance. I've found that created a sorted array of length k and then inserting any other elements and sorting as you go is the best balance between code complexity and speed. I'm happy to share specific code if needed. You can add binary sort or a max heap for insertion, but you end up adding a lot of complexity for practically no performance gain (see the Attabench chart in the blog post, which I can also the Attabench files for if anyone wants them)).

natecook1000 · 2020-10-19T22:08:02Z

@pyrtsa sortedPrefix is a good name for this! I had been thinking about something like min(5, by: <), but that doesn't indicate that the result is sorted, and would require max(5, by: <) for parity.

I can see the justification for the mutating partial sort, but I don't think I see it for a version that returns the whole array with only part of it sorted. What's the use case there?

rockbruno · 2020-10-20T13:08:02Z

I like sortedPrefix! In fact, our original idea was to return just the desired prefix indeed. We kept the whole array because it's what C++ was doing, so it would feel more familiar. Do we agree in returning an ArraySlice in all methods then?

rockbruno · 2020-10-20T13:20:52Z

@khanlou Thanks for the post! It's great to see a comparison chart and we could modify it to become stable.

rockbruno · 2020-10-20T17:22:48Z

I made some benchmarks for us to analyze! I pitched this PR's heap implementation with @khanlou's, @timvermeulen 's quicksort version and the slower .sorted().prefix() as a base:

If we fetch a small prefix (4) from an increasing array, then SmallestM will be a lot quicker with the others being similar to each other (although I might have made a mistake, because the first time I've ran this the quicksort one was faster than the heap one)

The interesting thing is what happens if you increase the size of the prefix instead of the size of the array (now a fixed 500k elements:)

SmallestM is a lot faster in general, but if you try to prefix too many elements it will also become worse than sorting the entire array quicker than the other algorithms. Here's the same thing but with a smaller amount of elements (32k):

The place where this cut happens gets smaller the larger the array is, but it looks like 10% is a good number in average. What came to my mind is that the best implementation would likely to be @khanlou's one with an additional logic that falls back to sorting the entire thing if you try to prefix more than 10% of the array. What are your thoughts on this?

natecook1000 · 2020-10-30T15:53:27Z

This is looking good — we could use a couple more tests, then we should be ready to merge:

asking for a sortedPrefix larger than the input should return the input, sorted
stability test for multiple equal elements

rakaramos · 2020-10-31T19:11:03Z

Awesome! 🎉
I've added the missing tests

Update PartialSortTests.swift

xwu · 2020-10-31T23:29:13Z

Tests/SwiftAlgorithmsTests/PartialSortTests.swift

+    for element in Set(actual) {
+      let filtered = sorted.filter { $0.element == element }.map(\.offset)
+      XCTAssertEqual(filtered, filtered.sorted())
+      }


Suggested change

}

}

xwu · 2020-10-31T23:30:32Z

Guides/PartialSort.md

@@ -0,0 +1,52 @@
+# Partial Sort (sortedPrefix)


I think it would make sense to rename all documents with consistent terminology:

Suggested change

# Partial Sort (sortedPrefix)

# Sorted Prefix

xwu · 2020-10-31T23:31:30Z

Guides/PartialSort.md

+
+```swift
+let numbers = [7,1,6,2,8,3,9]
+let smallestThree = numbers.sortedPrefix(<)


This isn't a correct invocation of the API provided here.

xwu · 2020-10-31T23:33:03Z

Guides/PartialSort.md

+
+### Complexity
+
+The algorithm used is based on [Soroush Khanlou's research on this matter](https://khanlou.com/2018/12/analyzing-complexity/). The total complexity is `O(k log k + nk)`, which will result in a runtime close to `O(n)` if k is a small amount. If k is a large amount (more than 10% of the collection), we fallback to sorting the entire array. Realistically, this means the worst case is actually `O(n log n)`.


Suggested change

The algorithm used is based on [Soroush Khanlou's research on this matter](https://khanlou.com/2018/12/analyzing-complexity/). The total complexity is `O(k log k + nk)`, which will result in a runtime close to `O(n)` if k is a small amount. If k is a large amount (more than 10% of the collection), we fallback to sorting the entire array. Realistically, this means the worst case is actually `O(n log n)`.

The algorithm used is based on [Soroush Khanlou's research on this matter](https://khanlou.com/2018/12/analyzing-complexity/). The total complexity is `O(k log k + nk)`, which will result in a runtime close to `O(n)` if k is a small amount. If k is a large amount (more than 10% of the collection), we fall back to sorting the entire array. Realistically, this means the worst case is actually `O(n log n)`.

I'm not sure how the last statement is arrived at. Could you explain?

If we reach a point where O(k log k + nk) is going to be worse than sorting the full array we fall back to stdlib's O(n log n) sort, so in practice it shouldn't get much worse than that.

xwu · 2020-10-31T23:33:32Z

Guides/PartialSort.md

+
+Here are some benchmarks we made that demonstrates how this implementation (SmallestM) behaves when k increases (before implementing the fallback):
+
+![Benchmark](https://i.imgur.com/F5UEQnl.png)


Can we embed these into this project, as opposed to arbitrary external URLs?

xwu

Some nits.

Sources/Algorithms/PartialSort.swift

xwu · 2020-10-31T23:36:27Z

Guides/PartialSort.md

+
+**C++:** The `<algorithm>` library defines a `partial_sort` function where the entire array is returned using a partial heap sort.
+
+**Python:** Defines a `heapq` priority queue that can be used to manually 


Nit: it'd be nice if this document were either consistently wrapped to 80 columns, or else not wrapped. It seems there are two styles here.

Co-authored-by: Xiaodi Wu <13952+xwu@users.noreply.github.com>

ensan-hcl · 2020-11-01T13:19:57Z

Hi! I'd like to tell you one very slight but important problem.
This is the graph shown in the document, and the labels of each lines are reading "笑Priority Queue" "笑SmallestM" "笑QuickSortBased" "笑Stdlib.sort + prefix".

It looks strange for kanji-users because the letter "笑" means "laugh" or "smile" in China and Japan, and the graph has nothing with such ideas.
If you all don't care of it, it's OK, but if you think it's a problem, it should be refined in some way!

But If it was came from some intention or customary rule to use "笑" in the graph label, I'm so sorry for bothering you🙏

rockbruno · 2020-11-01T13:50:25Z

In Attabench that was supposed to be a colored square, I think my computer might be missing some special font or pack for that to have happened.

rakaramos · 2020-11-23T13:24:16Z

Is there any other thing left for us to do? cc @natecook1000

natecook1000 · 2020-12-01T22:39:49Z

Thanks for your patience, @rakaramos and @rockbruno — this is ready to go! 👍

natecook1000 · 2020-12-01T22:48:34Z

J/K, found a bug — (1...100).sortedPrefix(0) traps on the removeLast() call. The zero-argument tests aren't catching it because they're on collections with fewer than ten elements, which triggers the sort, then prefix branch.

rockbruno · 2020-12-02T10:22:28Z

Ah, looks like we were missing tests for massive inputs. Added a bunch of them now

karwa · 2020-12-02T12:44:13Z

I'd like to question the name sortedPrefix. To me it sounds like it returns a sorted version of the prefix, when it in fact returns a prefix of the sorted elements. In other words:

let numbers = [7,1,6,2,8,3,9]

print(numbers.prefix(3)) // [7, 1, 6]
print(numbers.prefix(3).sorted()) // [1, 6, 7]
print(numbers.sortedPrefix(3)) // [1, 2, 3] - huh? that's not the sorted prefix!

I would suggest perhaps adding a parameter to sorted - e.g. numbers.sorted(count: 3). Names relating to "partial sorting" also avoid this ambiguity IMO.

pyrtsa · 2020-12-02T14:13:08Z

I'd like to question the name sortedPrefix. To me it sounds like it returns a sorted version of the prefix, when it in fact returns a prefix of the sorted elements. In other words:
let numbers = [7,1,6,2,8,3,9]

print(numbers.prefix(3)) // [7, 1, 6]
print(numbers.prefix(3).sorted()) // [1, 6, 7]
print(numbers.sortedPrefix(3)) // [1, 2, 3] - huh? that's not the sorted prefix!

You're right, that is a possible point of confusion in the naming. However please note that this holds:

print(numbers.sorted().prefix(3)) // [1, 2, 3]
print(numbers.sortedPrefix(3)) // [1, 2, 3]

Edited to add: I think the problem with partial sorting related names (which are good for the mutating algorithm variants!) is that this version only returns the prefix, not the full but partially sorted array.

karwa · 2020-12-02T14:57:33Z

You're right, that is a possible point of confusion in the naming. However please note that this holds:
print(numbers.sorted().prefix(3)) // [1, 6, 7]
print(numbers.sortedPrefix(3)) // [1, 6, 7]

No it doesn’t. The example is taken from the guide document:

let numbers = [7,1,6,2,8,3,9]
let smallestThree = numbers.sortedPrefix(3, by: <)
 // [1, 2, 3]

Oh, you mean .sorted().prefix(3) (you didn’t update the comments). Sure, but that isn’t what the method reads as - it sounds like “sorted” is an adjective describing what happens to the prefix (i.e. the prefix gets sorted)

pyrtsa · 2020-12-02T15:01:46Z

Sorry, I think I only copied the wrong results here. What I mean is the output of numbers.sorted().prefix(n) matches element-wise that of numbers.sortedPrefix(n).

natecook1000 · 2020-12-02T19:12:36Z

Sources/Algorithms/SortedPrefix.swift

+      if let last = result.last, try areInIncreasingOrder(last, e) {
+        continue
+      }


There's still a logic issue here — if e is equal to result.last, execution will pass by this continue statement. That's a problem, because the call to partitioningIndex then returns endIndex, which becomes invalid after the call to result.removeLast(). What you want to ensure is that e is strictly less than result.last before proceeding.

Test case:

Array(repeating: 1, count: 100).sortedPrefix(5) // Fatal error: Array index is out of range

Crap... That's another case that we did have covered in the tests, but the prefix wasn't low enough to trigger the algorithm. I added more high input cases, hopefully it will work now.

Add partial sort algorithm

5429d3b

natecook1000 reviewed Oct 8, 2020

View reviewed changes

Sources/Algorithms/PartialSort.swift Outdated Show resolved Hide resolved

Sources/Algorithms/PartialSort.swift Outdated Show resolved Hide resolved

Sources/Algorithms/PartialSort.swift Outdated Show resolved Hide resolved

Sources/Algorithms/PartialSort.swift Outdated Show resolved Hide resolved

Add in place partial sorting

4362197

rockbruno and others added 5 commits October 9, 2020 14:44

Guide docs

f299df1

Use Indexes

6cd2870

Merge pull request #1 from rakaramos/guide

63b2dd0

Guide docs

Add partial sort tests

88216e1

Indent up to 80 columns

afe7111

rakaramos marked this pull request as ready for review October 9, 2020 13:32

Fix heapify stopping before it should

4652ae7

rockbruno force-pushed the main branch from 64545b8 to 4652ae7 Compare October 9, 2020 14:10

rockbruno added 6 commits October 9, 2020 19:59

Update PartialSort.md

37d494a

Update PartialSort.md

83d5f1e

Update PartialSort.swift

bf31ba1

Cleaning up iterators logic

acb3583

Update PartialSort.swift

6227bd8

Cleaning docs

d4a2e6b

Add more tests (#4)

1d22ef9

rockbruno added 2 commits October 31, 2020 20:19

Update PartialSortTests.swift

62096e1

Merge pull request #5 from rakaramos/rockbruno-patch-1

d0c1ccd

Update PartialSortTests.swift

xwu reviewed Oct 31, 2020

View reviewed changes

rockbruno and others added 4 commits November 1, 2020 11:42

Update Sources/Algorithms/PartialSort.swift

23bf863

Co-authored-by: Xiaodi Wu <13952+xwu@users.noreply.github.com>

Update Sources/Algorithms/PartialSort.swift

379609b

Co-authored-by: Xiaodi Wu <13952+xwu@users.noreply.github.com>

Update Sources/Algorithms/PartialSort.swift

435a38c

Co-authored-by: Xiaodi Wu <13952+xwu@users.noreply.github.com>

Documentation fixes

70973a2

rockbruno force-pushed the main branch from 1def897 to 70973a2 Compare November 1, 2020 10:53

Add tests for massive inputs

70a263c

natecook1000 requested changes Dec 2, 2020

View reviewed changes

isLastElement

1d3dcaf

natecook1000 merged commit 3864606 into apple:main Dec 4, 2020


		### Complexity

		The algorithm used is based on [Soroush Khanlou's research on this matter](https://khanlou.com/2018/12/analyzing-complexity/). The total complexity is `O(k log k + nk)`, which will result in a runtime close to `O(n)` if k is a small amount. If k is a large amount (more than 10% of the collection), we fallback to sorting the entire array. Realistically, this means the worst case is actually `O(n log n)`.


		Here are some benchmarks we made that demonstrates how this implementation (SmallestM) behaves when k increases (before implementing the fallback):

		![Benchmark](https://i.imgur.com/F5UEQnl.png)


		C++: The `<algorithm>` library defines a `partial_sort` function where the entire array is returned using a partial heap sort.

		Python: Defines a `heapq` priority queue that can be used to manually

Add "sortedPrefix(_:by)" to Collection #9

Add "sortedPrefix(_:by)" to Collection #9

Conversation

rakaramos commented Oct 8, 2020 • edited Loading

Checklist

timvermeulen commented Oct 8, 2020

natecook1000 left a comment

Choose a reason for hiding this comment

rockbruno commented Oct 9, 2020

timvermeulen commented Oct 9, 2020

rockbruno commented Oct 9, 2020

natecook1000 commented Oct 12, 2020

rakaramos commented Oct 12, 2020

rockbruno commented Oct 13, 2020

pyrtsa commented Oct 13, 2020

khanlou commented Oct 16, 2020

natecook1000 commented Oct 19, 2020

rockbruno commented Oct 20, 2020

rockbruno commented Oct 20, 2020

rockbruno commented Oct 20, 2020

natecook1000 commented Oct 30, 2020

rakaramos commented Oct 31, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xwu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ensan-hcl commented Nov 1, 2020

rockbruno commented Nov 1, 2020

rakaramos commented Nov 23, 2020

natecook1000 commented Dec 1, 2020

natecook1000 commented Dec 1, 2020

rockbruno commented Dec 2, 2020

karwa commented Dec 2, 2020

pyrtsa commented Dec 2, 2020 • edited Loading

karwa commented Dec 2, 2020 • edited Loading

pyrtsa commented Dec 2, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rakaramos commented Oct 8, 2020 •

edited

Loading

pyrtsa commented Dec 2, 2020 •

edited

Loading

karwa commented Dec 2, 2020 •

edited

Loading

pyrtsa commented Dec 2, 2020 •

edited

Loading