Add lazy `split` implementation. #78

toddthomas · 2021-02-18T00:24:37Z

This makes

split(maxSplits:omittingEmptySubsequences:whereSeparator:)

and

split(separator:maxSplits:omittingEmptySubsequences:)

evaluate lazily when called on lazy sequences and collections, which should resolve #59. For sequence types, the resulting subsequences are vended as newly-allocated arrays. For collection types, the subsequences are slices and thus don't incur additional heap allocations (if I understand subsequences correctly!).

Discussion

The no-additional-allocations implementation is available only for Collection-conforming types, as the strategy for avoiding new allocations is to have the iterator return subsequences via ranged subscripting of the underlying collection.

Test Plan

I've encountered a lot of interesting edge cases for this algorithm, which has led me to pursue a combinatorial approach to testing it.

Sequences to be split are patterns of separators and non-separators, the latter of which I refer to as "elements" to avoid the negative. All elements are equivalent from the algorithm's perspective, so the patterns can be represented as strings of "S" for separators, and "E" for elements. For example, SSEE is a pattern that begins with two separators and ends with two elements.

Patterns are tested using the Validator type. It compares the results of splitting a collection value as both a lazy sequence and a lazy collection—and with all combinations of the default and a provided maxSplits, and the default (true) and an explicit false omittingEmptySubsequences—with the Standard Library's eager splits, which are assumed to be correct.

The lazy splits of patterns of length 0 through 4 are validated comprehensively. Beyond that, the strategy is to validate splits of all possible relative positions of multiple adjacent separators—which seem to be involved in most edge cases. Specifically, patterns where at least two separators are located at the beginning, in the middle, at the end, and all combinations thereof, are validated.

To accomplish that, the test validates unique permutations of patterns that contain:

two separators, allowing for multiple adjacent separators at the beginning, middle, or end;
four separators, allowing for multiple adjacent separators at any two of those positions;
and six separators, allowing for multiple adjacent separators at any or all three of those positions.

The first two cases are tested for patterns with equal numbers of elements and separators, more elements than separators, and more separators than elements. The six-separator case is only tested with EEESSSSSS for performance reasons.

As this is the swift-algorithms package, I tried to use permutations() for this. But that resulted in unacceptable runtimes, since that method treats repeated elements as separate, resulting in a great number of logically duplicate patterns to be tested. For example, the number of unique permutations of [1, 1, 1, 1, 1, 0, 0, 0, 0] is 126, whereas 9! is 362,880.

Instead, I used Mathematica's Permutations function to generate the permutations, and copied them into the tests as array literals. Mathematica is stunningly fast at generating unique permutations. I'm curious to investigate whether swift-algorithms could have a similarly fast implementation as an option for permutations(). That would make these tests much more concise.

In the current tests, it's likely there is redundancy in the patterns being tested—it's almost certainly not the minimum set needed for complete coverage! The few large array literals make Live Issues evaluation noticeably slower. So it would be desirable to eliminate any redundant patterns. But I'm not currently confident I can identify them. The previous ad-hoc tests missed several bugs, so now I'm erring on the side of caution. Runtime impact is minimal: running all tests in swift-algorithms takes less than 0.1 sec longer with this PR (on my modest, couple-year-old MacBook Pro).

Checklist

I've added at least one test that validates that my change is working, if appropriate
I've followed the code style of the rest of the project
I've read the Contribution Guidelines
I've updated the documentation if necessary

natecook1000

This is a really great start, @toddthomas! 🎉 What would you think of splitting this into two wrapper types, a lazy sequence version that vends arrays, and a lazy collection wrapper that is itself a collection? The collection's index type could simply wrap a range, which would let you re-use your Iterator.next() code in index(after:). Edited to add: if you just want to tackle one of these at a time, that's fine too!

Re your questions:

The Algorithms package is separate from the standard library, so we handle additions through this PR process rather than Swift Evolution. People will only get these lazy versions if they import the package, so this won't change the behavior of any existing code.
Sometimes I've just stored a copy of the sequence when there are a lot of configuration options like this, but I don't think there's anything wrong with what you've got here.
The README will just need a single line description of the addition. We'd also like a brief guide that covers the design of what you're adding in Guides/.
We can audit for inlinability later, but in general public methods should be annotated @inlinable, with internal methods @usableFromInline.

natecook1000 · 2021-02-18T00:37:29Z

Sources/Algorithms/LazySplitCollection.swift

+  ///   elements.
+  ///
+  /// - Complexity: O(*n*), where *n* is the length of the collection.
+  func split(


This version of the method should be in an unconstrained extension, since it doesn't depend on Equatable conformance.

natecook1000 · 2021-02-18T00:40:35Z

Sources/Algorithms/LazySplitCollection.swift

+  }
+}
+
+extension LazySplitCollection.Iterator: IteratorProtocol, Sequence {


We don't typically have custom iterators conform to Sequence, even though it's trivial.

natecook1000 · 2021-02-18T04:20:25Z

Sources/Algorithms/LazySplitCollection.swift

+      lastAdjacentSeparator = subSequenceEnd
+
+      if omittingEmptySubsequences {
+        /// TODO: should be able to replace this raw loop with something like


I think instead of finding lastAdjacentSeparator, it should be simpler to look for the start of the next subsequence here. (And you'll be able to use firstIndex(where:) instead of manual iteration.)

Yes, that is much nicer. Thank you.

timvermeulen · 2021-02-18T18:19:08Z

Sources/Algorithms/LazySplitCollection.swift

+///
+/// - Note: This type is the result of `x.split(maxSplits:omittingEmptySubsequences:whereSeparator)` and
+/// `x.split(separator:maxSplits:omittingEmptySubsequences)`, where `x` conforms to `LazyCollection`.
+public struct LazySplitCollection<Base: LazyCollectionProtocol> where Base.Element: Equatable, Base.Elements.Index == Base.Index {


We don't need a Base.Element: Equatable constraint here.

I think it also simplified things a bit if Base is simply a Collection, and we wrap self.elements instead of self from LazyCollectionProtocol.split, because then we don't need to bother with Base.Elements here.

Eliminated the unnecessary constraint. I need to ponder your other suggestion some more to make sure I fully understand it.

Yes, wrapping a collection instead of a lazy collection did make the code more concise. Thank you.

timvermeulen · 2021-02-18T18:19:57Z

Sources/Algorithms/LazySplitCollection.swift

+  }
+}
+
+extension LazyCollection where Element: Equatable {


Suggested change

extension LazyCollection where Element: Equatable {

extension LazyCollectionProtocol where Element: Equatable {

Oops! Done. Thanks.

toddthomas · 2021-02-18T20:46:04Z

Thanks @timvermeulen, @natecook1000, and anyone else who's reviewed this so far. I appreciate your time.

I'm currently rethinking the iteration algorithm. I must admit the currently-committed version was the result of considering simple cases first, and then trying to handle things like omittingEmptySubsequences = false later. It was getting pretty convoluted. I hope to improve on that now that I understand the more complicated cases better.

After I sort that out, I'll address the feedback. Again, thanks.

toddthomas · 2021-02-19T23:35:07Z

I'm very happy to split (😉) this into two wrapper types, @natecook1000. I have some questions:

I had thought about making LazySplitCollection conform to Collection, but paused at the expectation of endIndex and subscripting being O(1) operations, which I don't think can be done. Would it be okay for them to be O(n) for this class?
Could you please elaborate on your comment about reusing Iterator.next() code in index(after:)? Is there an example of this elsewhere in the Standard Library to which you could point me?

Thanks!

LucianoPAlmeida · 2021-02-21T02:35:06Z

Could you please elaborate on your comment about reusing Iterator.next() code in index(after:)? Is there an example of this elsewhere in the Standard Library to which you could point me?

I think what Nate means was when implementing a lazy collection wrapper that is itself a collection the requirement for conformance to Collection is to implement index(after:) and to implement that you probably can reuse the same logic used on this iterator based next(). For that we would need a custom collection Index for this lazy collection wrapper that would store a range in the same way we do for LazyChunked. You can look how is done here

@natecook1000 or @timvermeulen can elaborate more, but I hope this could be helpful =]

LucianoPAlmeida · 2021-02-21T02:39:22Z

Sources/Algorithms/LazySplitCollection.swift

+//
+//===----------------------------------------------------------------------===//
+
+/// A collection that lazily splits a base collection into subsequences separated by elements that satisfy the


Nit: Really, really a nit, but I think is a code style guide we try to maintain 80 columns max per line.
So as the comment in line 10 marks the limit :)

Funny/sad story: I actually did go through the file and manually reformat all these comments so they were visually wrapped to those comment lines. But I did it in Xcode, which of course renders triple-slash comments in a smaller font. 🤦‍♂️

I'm very happy to follow code style guidelines. I'd looked around a few times for a formatter so it would be painless, but hadn't had much luck. I was just searching again now and found swift-format. I don't know why it eluded me before. Going to give it a try.

Mostly done. swift-format doesn't seem to reformat doc comments for line length, at least with default config. I reformatted most comments manually, but there are still a few with lines longer than 80 in code examples that I didn't want to tackle yet.

LucianoPAlmeida · 2021-02-21T02:45:51Z

Sources/Algorithms/LazySplitCollection.swift

+  /// - Returns: A lazy sequence of subsequences, split from this collection's
+  ///   elements.
+  ///
+  /// - Complexity: O(*n*), where *n* is the length of the collection.


Nit:

Suggested change

/// - Complexity: O(*n*), where *n* is the length of the collection.

/// - Complexity: O(*n*), where *n* is the count of the collection.

"length" comes from the split method comments in Collection.swift. I have no strong preference in this case, but in general I'm inclined to keep the docs for these lazy splits as close to the existing ones as possible. Is there a doc prose style guide that we could defer to?

Ah no worries that is not that important, that was just a suggestion because to me its more reasonable to refer to count for sequences and collections in Swift since the property is count ... but if length is how it is on other docs we can just keep it :))

That is a very compelling argument. I'm less inclined to defer to the existing docs now. 🤔

Either phrasing can be argued for, but "length" is used in this context ubiquitously in the rest of the Algorithms package, so let's keep it consistent 🙂

IIRC, we use length for most of these because sequences don't have a count, per se. They're largely interchangeable otherwise!

toddthomas · 2021-02-21T07:22:57Z

You can look how is done here

Thanks for the pointer! I will study Chunked.swift.

LucianoPAlmeida · 2021-02-21T15:17:05Z

Sources/Algorithms/LazySplitCollection.swift

+  ///   elements.
+  ///
+  /// - Complexity: O(*n*), where *n* is the length of the collection.
+  func split(


Should this be @inlinable?
I think we should look to make functions @inlinable or @usableFromInline where it makes sense :)

Definitely, inlinability is on my to-do list. Currently working on making LazySplitCollection conform to Collection.

natecook1000

I had thought about making LazySplitCollection conform to Collection, but paused at the expectation of endIndex and subscripting being O(1) operations, which I don't think can be done. Would it be okay for them to be O(n) for this class?

You should be able to have the collection version wrap Range<Base.Index>, with endIndex as just base.endIndex..<base.endIndex. The part that will be O(n) is finding the first subsequence — we've been having an ongoing discussion about how lazy collections should handle this, and we're going to move to precalculating and storing startIndex for this kind of collection. So your collection type should end up with both the base collection and its own startIndex, along with the various pieces of configuration, as stored properties. Does that all make sense?

Could you please elaborate on your comment about reusing Iterator.next() code in index(after:)? Is there an example of this elsewhere in the Standard Library to which you could point me?

@LucianoPAlmeida's answer on this one is correct! If you're using a range as the index, the logic in your index(after:) method will be essentially the same as what's in your next() method. (Come to think of it, your index will probably also need to include a subsequence counter so that you can handle maxSplits correctly.)

natecook1000 · 2021-02-22T18:57:32Z

Sources/Algorithms/LazySplitCollection.swift

+  }
+}
+
+extension LazyCollectionProtocol where Elements.Index == Index {


Do you need this Elements.Index constraint? I'm not seeing what effect it would have.

The compiler says no! I think that was a leftover from when the lazy collection was being wrapped. Thank you.

natecook1000 · 2021-02-22T19:02:02Z

Sources/Algorithms/LazySplitSequence.swift

+
+    defer {
+      sequenceLength += 1
+      subsequence = []


It looks like subsequence could be local to this next() method instead of being stored in the iterator.

Yes, much cleaner and more correct. Thank you.

natecook1000 · 2021-02-22T20:23:28Z

Sources/Algorithms/LazySplitSequence.swift

+      if sequenceLength < separatorCount + 1 {
+        if !subsequence.isEmpty || !omittingEmptySubsequences {
+          sequenceLength += 1
+          return subsequence


subsequence isn't getting zeroed out here (making it local would resolve that as well) — try AnySequence([1, 0, 0, 0, 0, 2, 0, 3]).lazy.split(separator: 0) to see the issue.

As you say, fixed with local subsequence, and tests added. Thanks!

LucianoPAlmeida · 2021-02-23T01:35:46Z

Sources/Algorithms/LazySplitSequence.swift

+
+extension LazySplitSequence.Iterator: IteratorProtocol {
+  public mutating func next() -> Element? {
+    /// Separators mark the points where we want to split (cut in two) the base


This seams like a doc comment, should it be above the function instead of the body?

This is not a doc comment, but rather a high-level discussion (more general than any particular implementation) of how the split operation works. I found writing this to be very helpful to my understanding of how any split implementation should work. In retrospect, it might serve better as the introduction to the /Guides/Split.md I need to write. I can see how in its current location its intention is unclear.

Ah I see, thanks! I mention doc comments because comments starting with /// are meant for documentation :)
But I agree such detailed info about algorithm itself could be on the guides. WDYT @natecook1000 @timvermeulen?

I used /// because when I'm mixing prose and code, I like seeing each rendered differently as Xcode does with ///. Should that style only be used for comments intended for the generated documentation?

Yes, /// is for documentation comments.

LucianoPAlmeida · 2021-02-23T01:42:32Z

Sources/Algorithms/LazySplitSequence.swift

+    ///
+    ///     [1, 2, 42, 3, 4, 42, 5].split(separator: 42,
+    ///                                   omittingEmptySubsequences: false)
+    ///     // first split -> [1, 2], [3, 4, 42, 5]


Sorry, maybe I'm missing something here but what is first split and last split? For what I understood the first is a split limited by 2 and last is not limited? Perhaps we could clarify more on that?

Yeah, sorry. Confusing. These are all examples of the high-level algorithm in progress in cases where the original sequence is split twice, resulting in three subsequences. By "first split" I mean, "the state after the first split (due to the first separator)". But my ASCII-art doesn't even make it clear what parts are the returned subsequences, and what part is remaining to be split. I will make this clearer, and also probably remove the whole comment and rewrite it into /Guides/Split.md.

LucianoPAlmeida · 2021-02-23T01:44:15Z

Sources/Algorithms/LazySplitSequence.swift

+    ///     [1, 2, 42, 3, 4, 42, 5, 42].split(separator: 42,
+    ///                                       maxSplits: 2,
+    ///                                       omittingEmptySubsequences: false)
+    ///     // first split -> [1, 2], [3, 4, 42, 5, 42]


Should the last 42 appear in the result here? If I understand correctly it should be an empty subsequence?

That actually meant for the line bellow (line 79) last split =]

In this case with maxSplits == 2, the base sequence is only split twice, even though it contains three separators. Although the docs for the existing eager splits say, "Elements that are used to split the sequence are not returned as part of any subsequence", that is not how those methods actually work. In the case where maxSplits is less than the number of separators, the remaining separators do appear in the last subsequence.

I dunno, is that a bug? Is the documentation saying exactly how it should be working?

Ah no sorry, I think it was just a confusion from my part I see the point now =]

LucianoPAlmeida · 2021-02-23T01:46:31Z

Tests/SwiftAlgorithmsTests/LazySplitCollectionTests.swift

+
+import XCTest
+
+@testable import Algorithms


Do we really need testable import here? For what I could see we only test the public API so we should be fine without it

Suggested change

@testable import Algorithms

import Algorithms

You found a serious error! I hadn't actually made the new splits I'm adding public. I've learned a valuable lesson: @testable is a big hammer that can hide access control bugs. Thank you.

natecook1000

Great progress, @toddthomas! I have a question about how we’re building endIndex in the weird separator-at-end edge case — otherwise this is in good shape!

natecook1000 · 2021-02-25T16:52:17Z

Sources/Algorithms/LazySplitCollection.swift

+        return Index(
+          baseRange: i.baseRange.upperBound..<i.baseRange.upperBound,
+          sequenceLength: i.sequenceLength + 1,
+          separatorCount: i.separatorCount


Something feels off here — I think we want the final subsequence to represent the space between the last element and the end of the collection, which has the range base.endIndex..<base.endIndex. In addition, if we do it this way I think the < and == implementations aren’t going to be correct for this penultimate index.

Here’s an example with the existing split. If we don’t omit the empty subsequences and have a separator-only collection, the resulting ranges are all the empty gaps between elements—including endIndex..<endIndex:

let a = [0, 0, 0] .split(separator: 0, omittingEmptySubsequences: false) .map(\.indices) // a == [0..<0, 1..<1, 2..<2, 3..<3]

Given this, we can’t depend on there being any range of base that would be uniquely the endIndex of LazySplitCollection. Maybe the subsequence counter == maxSplits or Int.max is how endIndex is designated? Otherwise, we’ve used an enumeration with an atEnd case in various other places. What do you think?

I figured there would be questions about this, because I have them too! It is wonky right now. I meant to put a note here, but got distracted thinking about BidirectionalCollection conformance.

I thought perhaps base.endIndex..<base.endIndex was a convention for the end index of these range-based indexes that I should maintain. But that does result in a logically duplicate index in the case where the base collection ends with at least two separators.

I pondered including sequenceLength in the equality comparison, which would fix that particular problem. But even when the collection ends with only one separator, and the final empty sequence is not strictly a duplicate, its range doesn't feel right--it's the empty range right before the last element, instead of after it.

[7, 42].split(separator: 42, omittingEmptySubsequences: false) .map(\.indices) // -> [0..<1, 2..<2] [7, 42].lazy.split(separator: 42, omittingEmptySubsequences: false) .map(\.indices) // -> [0..<1, 1..<1]

So I agree that it feels right to leave base.endIndex..<base.endIndex available for that final empty subsequence. I'm leaning towards using an enumeration, because I think there are some problematic cases trying to use something like sequenceLength == maxSplits + 1 or (sequenceLength - 1) == maxSplits (to avoid a problem at maxSplits == Int.max) as the determining factor.

For example, the following would work:

let parts = [7, 42].lazy.split(separator: 42, omittingEmptySubsequences: false) // maxSplits == Int.max // endIndex == Index(baseRange: Range(2..<2), sequenceLength: Int.max, separatorCount: ?) parts.index(after: parts.startIndex) // -> Index(baseRange: Range(2..<2), sequenceLength: 2, separatorCount: 1)

But

let parts = [7, 42].lazy.split(separator: 42, maxSplits: 1, omittingEmptySubsequences: false) // endIndex == Index(baseRange: Range(2..<2), sequenceLength: Int.max, separatorCount: ?) parts.index(after: parts.startIndex) // -> Index(baseRange: Range(2..<2), sequenceLength: 2, separatorCount: 1)

doesn't, without stretching the meaning of one of the available properties.

Maybe the subsequence counter == maxSplits or Int.max is how endIndex is designated?

Sorry, I think I misunderstood you. When I wrote my response above, I thought you were saying an index with sequenceLength == maxSplits, whatever that was, would be the end index. But now I think you were saying sequenceLength == Int.max is the end, which wouldn't be problematic except in a very extreme edge case. 😄

I still prefer the idea of using an enum as it would be more explicit, but maybe I'll try this first, just to see how it feels, because it's a smaller change.

Defining endIndex as

Index( baseRange: base.endIndex..<base.endIndex, sequenceLength: Int.max, separatorCount: Int.max )

and redefining the equality and comparison operators as

public static func == (lhs: Index, rhs: Index) -> Bool { // `sequenceLength` is equivalent to its index's 1-based position in the // collection of indices. lhs.sequenceLength == rhs.sequenceLength } public static func < (lhs: Index, rhs: Index) -> Bool { lhs.sequenceLength < rhs.sequenceLength }

lets us achieve this goal

[42, 42, 42].lazy.split( separator: 42, omittingEmptySubsequences: false ).map(\.indices) // -> [0..<0, 1..<1, 2..<2, 3..<3]

[Edited to remove a bit that was just me being confused.]

The eager split returns one empty sequence with indices 0..<0, whereas the lazy split returns none because I set startIndex to endIndex for empty base collections.

😮 If I think about it, that makes sense, since splitting something with just one separator yields two empty subsequences. But it's still really odd!

According to the tests, I still have some issues to work out with the empty collection case. Then I do want to make an attempt at an enum version, to see if I prefer that.

I actually played around with an enum implementation before I submitted this PR, but abandoned it because switch statements didn't seem like a good fit in this algorithm, where in the majority of code paths you're not dealing with the .atEnd case. I really like how the enum works in Intersperse, but that type has a lovely symmetry where every method can be just one switch statement. That didn't seem to be the case here, but I want to give it another try.

I ended up pushing the commit that redefined endIndex to have sequenceLength == Int.max. It was a fairly small change--at least in terms of diffs--that fixed a bug in the previous implementation, in addition to allowing the more natural representation of a final empty subsequence, so it seemed like a win.

I'll investigate an enum-based implementation tomorrow.

I've been playing around with an enum-based implementation, and I don't think it's headed toward something I want to submit. However, I don't have much experience designing types based on Swift's super-cool, super-powerful enum, so it might be me that headed in the wrong direction to begin with.

This is the Index I wrote:

extension LazySplitCollection: LazyCollectionProtocol { /// Position of a subsequence in a split collection. public struct Index: Comparable { internal enum Representation: Equatable { case includingEmptySubsequences( /// The range corresponding to the subsequence at this position. baseRange: Range<Base.Index>, /// The number of subsequences up to and including this position in the /// collection. sequenceLength: Int, /// The number of separators encountered up to and including this /// position in the collection. separatorCount: Int ) case omittingEmptySubsequences( /// The range corresponding to the subsequence at this position. baseRange: Range<Base.Index> /// The number of separators encountered up to and inclluding this /// position in the collection. separatorCount: Int ) case end } internal let representation: Representation // [etc.]

and I like how that feels--how the associated types are exactly the properties needed for each case. But in adapting the rest of the implementation to this, I'm finding the amount of code and code duplication is growing, and I'm ending up with too many situations like

switch index { case .omittingEmptySubsequences: // some code case .includingEmptySubsequences: // some mostly similar code with a few crucial differences case .end: // there's a precondition to ensure we never get here! }

In terms of what methods the cases of this enumeration need to participate in, and the implementation for each case, the first two overlap too much, while the third barely overlaps with the first two at all. Is there a rule of thumb to be gleaned here?

But again, I don't have much experience in this area. Did I get off on the wrong foot? I can show additional, more specific code examples if my intent isn't clear.

Thank you for your patience with this PR, and I appreciate the opportunity to learn.

I think the distinction between the omitting and not omitting case isn't as sharp as I first thought. I'm going to try just an element and end case, and see how much of the redundancy and awkwardness that eliminates.

Sounds good — I think I would go that route if I were using an enum as well. I think in this case, since the .max value works well to indicate the endIndex, I'd just stick with that.

natecook1000 · 2021-02-25T16:54:35Z

Sources/Algorithms/LazySplitCollection.swift

+  internal func indexForSubsequence(
+    atOrAfter lowerBound: Base.Index,
+    sequenceLength: Int = 0,
+    separatorCount: Int = 0


These defaults seem a bit risky. Can we be explicit at the call sites instead about what these counters are for the start and end index?

Agreed, yes.

separatorCount is another bit of this index implementation I don't feel great about. It's only correct for all indices when we're not omitting empty subsequences, or when there aren't multiple adjacent separators. Otherwise, we explicitly skip over multiple adjacent separators without counting them, for a more succinct implementation. We don't need an accurate separator count when we're omitting empty subsequences, but it still bugs me that the property is there and not correct.

On the other hand, these are internal properties, and the initializer is internal, so I can view them as implementation details and feel somewhat better. Do you have any concerns about this?

I am thinking the index will become an enum though, as discussed above. Then the end case wouldn't have these properties at all.

LucianoPAlmeida · 2021-02-26T00:47:46Z

Tests/SwiftAlgorithmsTests/LazySplitCollectionTests.swift

+import Algorithms
+import XCTest
+
+final class LazySplitCollectionTests: XCTestCase {


For the collection wrapper it may be a good idea to have a test that executes validateIndexTraversals just to ensure correctness :)

Ah, thank you. I had seen that in another type's test, but at the time it didn't apply to this. Now it does--thank you for reminding me.

I learned validateIndexTraversals requires BidirectionalCollection conformance, which I haven't implemented yet, so adding this test will have to wait a bit.

We can/should break that test into versions that work for regular collections and also bidirectional ones, so no worries on this one.

ALSO: I don't think there's a way to efficiently implement bidirectionality here, because of maxSplits. You'd need to traverse from the beginning every time, defeating the point.

Ah, I see. You're saying the collection needs to be the same forwards and backwards, right?

I think the issue is that with bidirectional conformance we would have to compute the splits form the last to first (implementing index before), but because of maxSplits we can't compute that without knowing the initial splits, so it will always have to start from startIndex which I agree with @natecook1000, defeats the point of implementing the conformance since we can't conceptually traverse from last to first... at least to my understanding

Exactly! Because maxSplits can limit the number of subsequences that are produced, there's no way to calculate the last subsequence without searching from the beginning. BidirectionalCollection conformance is for collections that can efficiently do backward traversal, so there's just no need to do that here.

LucianoPAlmeida · 2021-02-26T00:50:32Z

Sources/Algorithms/LazySplitSequence.swift

+
+    defer {
+      sequenceLength += 1
+      subsequence = []


If subsequence is now local there is not need for this subsequence = [] to be here any more right?

Absolutely. Thank you.

LucianoPAlmeida · 2021-02-26T01:09:38Z

Sources/Algorithms/LazySplitCollection.swift

+    self._startIndex = Index(baseRange: base.startIndex..<base.startIndex)
+
+    if !base.isEmpty {
+      // Precompute the start index.
+      _startIndex = indexForSubsequence(atOrAfter: base.startIndex)
+    }


Nit: That way we avoid create an index that would be discarded if the condition !base.isEmpty is true

Suggested change

self._startIndex = Index(baseRange: base.startIndex..<base.startIndex)

if !base.isEmpty {

// Precompute the start index.

_startIndex = indexForSubsequence(atOrAfter: base.startIndex)

}

if !base.isEmpty {

// Precompute the start index.

self._startIndex = indexForSubsequence(atOrAfter: base.startIndex)

} else {

self._startIndex = Index(baseRange: base.startIndex..<base.startIndex)

}

Nit: That way we avoid create an index that would be discarded if the condition !base.isEmpty is true

When I try this, I get Variable 'self._startIndex' used before being initialized on line 40. I actually tried a similar thing first and ran into the same error, which is why it ended up the way it's committed now. I have a vague notion the compiler sees some difference between assigning the result of an initializer, and copying the result of a method call? Seems like a thing I should know. I'll research it.

Ah sorry my mistake, the compiler is correct in that case ... I just didn't notice that indexForSubsequence is a member function of self so calling self.indexForSubsequence(atOrAfter: base.startIndex) before self._startIndex initialization is definitely an access to self before all properties being initialized... so I think we have to stay with this

Ah, right. Thanks for that explanation. I think the specificity of the error message confused me: self is definitely used before being initialized there, but self._startIndex isn't used in that member function.

Ah this is just diagnostics that could be improved on the compiler :)
We have a bug tracking similar situations where diagnostics could be more specific https://bugs.swift.org/browse/SR-12421

This implementation is provided only for `LazyCollection`-conforming types, because the strategy for avoiding new allocations is to have the iterator return subsequences via ranged subscripting of the underlying collection.

Shuffling to match `LazyMapSequence`.

@natecook1000

The main simplification is replacing the convoluted empty- subsequence-skipping strategy of finding the last adjacent separator with a concise search for the next non-separator. Thanks, @natecook1000. The main correctness improvement is keeping track of the number of elements returned from the iterator, so a final empty subsequence can be returned if necessary in the case where the base collection ends with a separator. Added documentation describing the algorithm. Added test coverage, particularly for `omittingEmptySubsequences == false`.

These are the suggestions I currently understand. Need to ponder the others and maybe ask questions.

@timvermeulen

Thanks, @timvermeulen, that does make the code more concise! I see that this is also the pattern used in `LazyMapSequence`, for example.

Reordered them so they appear in a somewhat more logical progression. Made the names more consistent.

It's now defined as an index with `sequenceLength == Int.max`. This frees up the range `base.endIndex..<base.endIndex` to represent a final empty subsequence, if necessary. Also fixes a bug when splitting an empty collection and not omitting empty subsequences. Test added for this case.

All values for `indexForSubsequence` arguments are now explicit at call sites.

When omitting empty subsequences, we may skip over multiple adjacent separators, so that property wasn't counting separators in that case. It's really counting splits.

Only use doc comments for doc. Wrap all to 80 columns. Also removed high-level explanation of split operation, so it isn't taken to be an explanation of this particular split algorithm.

I love how the latter followed from the former. 😄

Matches the convention of this project.

Added a line to the readme which links to a new LazySplit guide.

Includes multiple adjacent separators.

toddthomas · 2021-03-02T02:35:16Z

I think I've addressed all the feedback thus far.

Note that I combined the LazySplitSequence and LazySplitCollection definitions into a single file, LazySplit.swift, and did the analogous thing with their test classes. That follows what I understand to be the convention in this repo, and allows the new Guides/LazySplit.md to follow the convention I see for guides as well.

Thank you reviewers! I've learned a lot. And I imagine I have at least one more thing yet to learn. 😄

natecook1000

Almost there, @toddthomas! The lazy collection splits look ready; I found one hiccup in the sequence implementation. After that, and a couple of other notes, we'll be all ready to go.

Thanks again for taking this on — the split method is so much more complex than it seems at first glance! 👏👏

natecook1000 · 2021-03-04T15:45:33Z

Sources/Algorithms/LazySplit.swift

+  @usableFromInline
+  internal var _endIndex: Index


Since this is more or less constant regardless of the contents of base, this can just be a computed property.

Thanks. I thought I needed indirection to set _startIndex to endIndex in that one case, but that was just me not reasoning correctly about initializer code (again). I think I might have got it now! 😂

natecook1000 · 2021-03-04T16:45:21Z

Sources/Algorithms/LazySplit.swift

+    // times. In all cases, stop at the end of the base sequence.
+    while currentElement != nil {
+      if separatorCount < maxSplits && isSeparator(currentElement!) {
+        separatorCount += 1


I don't think we want to unconditionally advance separatorCount here, since we're only actually splitting in certain cases (i.e. not when we continue in the first branch below).

Try out this test case:

let s = AnySequence([1, 1, 1, 1, 1, 2, 2]) let expected = s.split(separator: 1, maxSplits: 1) let actual = s.lazy.split(separator: 1, maxSplits: 1)

(Note: the lazy collection version works correctly here.)

Thank you. The code was keeping an accurate count of separators, which was then being used wrongly as the split count. I renamed the property, and it keeps an accurate split count now.

This was the final strike against the ad-hoc tests. I completely revised them using a much more deliberate, combinatorial approach. I updated the PR description to discuss this in detail.

natecook1000 · 2021-03-04T17:26:55Z

Sources/Algorithms/LazySplit.swift

+  }
+}
+
+extension LazySplitCollection.Index: Hashable where Base.Index: Hashable {}


I think the way we have it implemented now, the synthesized conformance would work, but generally when you implement custom Equatable conformance that only considers a subset of a type's members, the Hashable conformance should have the same customization. Can you implement hash(into:) here? (Interestingly, with that done, we also shouldn't need the conditional constraint anymore.)

The computed `startIndex` can be used where we need it in the initializer.

We were correctly recording the separator count, but that wasn't what we needed! Rename `separatorCount` to `splitCount`, and correctly count the splits. Also, completely revamp the tests: - expand coverage of the sequence pattern space; - test each pattern as both sequence and collection; - test each pattern with all combinations of arguments; - provide useful failure message.

Most of the coverage is for sequences with 1, 2, 4, and 8 separators. This adds tests for some longer sequences with an odd number of separators.

kylemacomber · 2021-03-08T17:32:58Z

Is there a reason we're not using permutations(ofCount:) (maybe combined with uniqued()?) to generate the test cases rather than listing them out—it strikes me as more readable/maintainable. Does it just take too long to run?

toddthomas · 2021-03-08T18:42:47Z

Is there a reason we're not using permutations(ofCount:) (maybe combined with uniqued()?) to generate the test cases rather than listing them out—it strikes me as more readable/maintainable. Does it just take too long to run?

I tried that first, for the reasons you mention (and the power of suggestion from working on this package? 😄), and yes, it took too long to run. I proceeded with the array literal approach and made a note to look into the performance of the algorithm later to see if it could be made to work. I had found Mathematica to be so fast at generating unique permutations that I was hopeful it could be done.

But I just double-checked using permutations().uniqued() on the two longest patterns in the test (both of length 9), and actually the performance is not nearly as bad as I remember it. I remember killing the test run after waiting about a minute. Instead, I find that those two tests take a few seconds each.

I recall now that my original tests were with a pattern of length 12 (6 elements, 6 separators)! That, I can confirm, is unworkable with permutations().uniqued(). I've had it running in the background for several minutes now and I'm going to kill it. It's nearly 500 million non-unique permutations, so it's not surprising, but Mathematica does it in less than a millisecond. I'm real curious to know what tricks they're employing!

Prior to this PR, the whole test suite took just under 4 seconds to run. Using the algorithm approach will more than double that, but if that's acceptable, I much prefer it. I didn't feel good about those big array literals, and they really hurt Live Issues performance.

And even if we don't like the extra 5 seconds or so for the length 9 tests, I will use the algorithm for all the shorter patterns where I imagine it will have no performance impact.

Thank you for encouraging me to give this a second look!

(All timings on my very modestly-configured 2018 15" MBP.)

natecook1000 · 2021-03-08T20:32:01Z

I recall now that my original tests were with a pattern of length 12 (6 elements, 6 separators)! That, I can confirm, is unworkable with permutations().uniqued(). I've had it running in the background for several minutes now and I'm going to kill it. It's nearly 500 million non-unique permutations, so it's not surprising, but Mathematica does it in less than a millisecond. I'm real curious to know what tricks they're employing!

Oof! There's a separate algorithm for producing these unique permutations, which runs in much faster time for inputs like these. It's actually in the package already as an implementation detail of permutations() — I've made a public wrapper that we can add later. Let's not worry about changing the lazy split tests now. Thanks for your hard work on getting those built!

kylemacomber · 2021-03-08T21:51:39Z

@natecook1000 that new wrapper is awesome!

natecook1000

This is ready to go! 🎉🎉🎉

@toddthomas this has been a journey, thanks for sticking with it! Do you mind squashing these commits, and then I'll merge? Alternatively, you can just give me a commit message you'd like for the final squashed commit. Thanks!

toddthomas · 2021-03-08T23:11:46Z

I've made a public wrapper that we can add later.

Cool! I made the changes to use the existing algorithm, but I'll stash them and wait for the new hotness. uniqued() is doing a huge amount of work in these cases that is better not done at all.

toddthomas · 2021-03-08T23:19:19Z

This is ready to go! 🎉🎉🎉

@toddthomas this has been a journey, thanks for sticking with it!

🥳😂 Thanks so much for all the careful reviewing and wise suggestions. It's been a great learning experience.

Do you mind squashing these commits, and then I'll merge? Alternatively, you can just give me a commit message you'd like for the final squashed commit. Thanks!

If you don't mind squashing when you merge, I'd like to keep the history of the journey in my repo for a bit. I'll follow up with the commit message. Thanks.

toddthomas · 2021-03-08T23:41:06Z

I'll follow up with the commit message.

Add lazy implementations of `split`.

`split(maxSplits:omittingEmptySubsequences:whereSeparator:)`

and

`split(separator:maxSplits:omittingEmptySubsequences:)`

now operate lazily when called on `LazySequenceProtocol`- or
`LazyCollectionProtocol`-conforming values. In the latter case,
no additional storage is allocated on the heap.

toddthomas changed the title ~~Add lazy split implementation for LazyCollection-conforming types.~~ Add lazy split implementation for lazy collection types. Feb 18, 2021

toddthomas changed the title ~~Add lazy split implementation for lazy collection types.~~ WIP: Add lazy split implementation for lazy collection types. Feb 18, 2021

natecook1000 reviewed Feb 18, 2021

View reviewed changes

timvermeulen reviewed Feb 18, 2021

View reviewed changes

toddthomas changed the title ~~WIP: Add lazy split implementation for lazy collection types.~~ Add lazy split implementation for lazy collection types. Feb 19, 2021

LucianoPAlmeida reviewed Feb 21, 2021

View reviewed changes

toddthomas changed the title ~~Add lazy split implementation for lazy collection types.~~ Add lazy split implementation. Feb 21, 2021

LucianoPAlmeida reviewed Feb 21, 2021

View reviewed changes

natecook1000 reviewed Feb 22, 2021

View reviewed changes

LucianoPAlmeida reviewed Feb 23, 2021

View reviewed changes

natecook1000 reviewed Feb 25, 2021

View reviewed changes

LucianoPAlmeida reviewed Feb 26, 2021

View reviewed changes

toddthomas added 11 commits March 1, 2021 14:34

Add a lazy implementation of split.

8c32f20

This implementation is provided only for `LazyCollection`-conforming types, because the strategy for avoiding new allocations is to have the iterator return subsequences via ranged subscripting of the underlying collection.

Shuffle the conformance declarations and test lazy sequence conformance.

9640fff

Shuffling to match `LazyMapSequence`.

Add precondition check for maxSplits.

90a928b

Hoist a check out of a loop and note future improvement.

84bf789

Match project style for array literals.

275eaaf

Test trailing multiple adjacent separators.

900ae90

Implement some protocol conformance suggestions.

ae3ab9a

These are the suggestions I currently understand. Need to ponder the others and maybe ask questions.

Add a test for maxSplits == 0.

68cd8ad

Fix doc typo.

31d26f7

Wrap a collection instead of a lazy collection.

ab8c71d

Thanks, @timvermeulen, that does make the code more concise! I see that this is also the pattern used in `LazyMapSequence`, for example.

toddthomas added 8 commits March 1, 2021 14:34

Reorganize the lazy collection tests.

fe115ed

Reordered them so they appear in a somewhat more logical progression. Made the names more consistent.

Remove risky default argument values.

c880918

All values for `indexForSubsequence` arguments are now explicit at call sites.

Give separatorCount the more accurate name splitCount.

d30e7ae

When omitting empty subsequences, we may skip over multiple adjacent separators, so that property wasn't counting separators in that case. It's really counting splits.

Reformat comments.

a82d266

Only use doc comments for doc. Wrap all to 80 columns. Also removed high-level explanation of split operation, so it isn't taken to be an explanation of this particular split algorithm.

Comment next and simplify its termination logic.

34573d4

I love how the latter followed from the former. 😄

A few more comments.

73237e9

swift-format gives its opinions.

0f5d9e3

toddthomas force-pushed the lazy-split branch from b898f6c to 0f5d9e3 Compare March 1, 2021 23:20

toddthomas added 5 commits March 1, 2021 16:37

Use one file each for wrapper type definitions and tests.

305a687

Matches the convention of this project.

Update docs.

ad30d89

Added a line to the readme which links to a new LazySplit guide.

Use sequence examples in the sequence docs.

4df60a6

Better example of splitting a sequence.

3e50f85

Includes multiple adjacent separators.

Add inlinablity.

c524411

natecook1000 reviewed Mar 4, 2021

View reviewed changes

toddthomas added 5 commits March 5, 2021 23:15

Remove unnecessary property.

8c60c56

The computed `startIndex` can be used where we need it in the initializer.

Use documented convention for Hashable conformance.

d1846fa

Add cases with 3 separators.

7a4a0eb

Most of the coverage is for sequences with 1, 2, 4, and 8 separators. This adds tests for some longer sequences with an odd number of separators.

A few more tests to complete length 4.

2001237

natecook1000 approved these changes Mar 8, 2021

View reviewed changes

natecook1000 merged commit e7c6716 into apple:main Mar 9, 2021

	extension LazyCollection where Element: Equatable {
	extension LazyCollectionProtocol where Element: Equatable {

	/// - Complexity: O(n), where n is the length of the collection.
	/// - Complexity: O(n), where n is the count of the collection.

Add lazy split implementation. #78

Add lazy split implementation. #78

Conversation

toddthomas commented Feb 18, 2021 • edited Loading

Discussion

Test Plan

Checklist

natecook1000 left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

toddthomas commented Feb 18, 2021

toddthomas commented Feb 19, 2021

LucianoPAlmeida commented Feb 21, 2021

Choose a reason for hiding this comment

toddthomas Feb 21, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

toddthomas Feb 21, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

toddthomas commented Feb 21, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

natecook1000 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

toddthomas Feb 23, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

toddthomas Feb 23, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

natecook1000 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

toddthomas Feb 25, 2021 • edited Loading

Choose a reason for hiding this comment

toddthomas Feb 25, 2021 • edited Loading

Choose a reason for hiding this comment

toddthomas Feb 25, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

toddthomas Feb 26, 2021 • edited Loading

Choose a reason for hiding this comment

toddthomas Feb 28, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Add lazy `split` implementation. #78

Add lazy `split` implementation. #78

toddthomas commented Feb 18, 2021 •

edited

Loading

natecook1000 left a comment •

edited

Loading

toddthomas Feb 21, 2021 •

edited

Loading

toddthomas Feb 21, 2021 •

edited

Loading

toddthomas Feb 23, 2021 •

edited

Loading

toddthomas Feb 23, 2021 •

edited

Loading

toddthomas Feb 25, 2021 •

edited

Loading

toddthomas Feb 25, 2021 •

edited

Loading

toddthomas Feb 25, 2021 •

edited

Loading

toddthomas Feb 26, 2021 •

edited

Loading

toddthomas Feb 28, 2021 •

edited

Loading

LucianoPAlmeida Feb 26, 2021 •

edited

Loading