Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add lazy split implementation. #78

Merged
merged 36 commits into from
Mar 9, 2021
Merged

Conversation

toddthomas
Copy link
Contributor

@toddthomas toddthomas commented Feb 18, 2021

This makes

split(maxSplits:omittingEmptySubsequences:whereSeparator:)

and

split(separator:maxSplits:omittingEmptySubsequences:)

evaluate lazily when called on lazy sequences and collections, which should resolve #59. For sequence types, the resulting subsequences are vended as newly-allocated arrays. For collection types, the subsequences are slices and thus don't incur additional heap allocations (if I understand subsequences correctly!).

Discussion

The no-additional-allocations implementation is available only for Collection-conforming types, as the strategy for avoiding new allocations is to have the iterator return subsequences via ranged subscripting of the underlying collection.

Test Plan

I've encountered a lot of interesting edge cases for this algorithm, which has led me to pursue a combinatorial approach to testing it.

Sequences to be split are patterns of separators and non-separators, the latter of which I refer to as "elements" to avoid the negative. All elements are equivalent from the algorithm's perspective, so the patterns can be represented as strings of "S" for separators, and "E" for elements. For example, SSEE is a pattern that begins with two separators and ends with two elements.

Patterns are tested using the Validator type. It compares the results of splitting a collection value as both a lazy sequence and a lazy collection—and with all combinations of the default and a provided maxSplits, and the default (true) and an explicit false omittingEmptySubsequences—with the Standard Library's eager splits, which are assumed to be correct.

The lazy splits of patterns of length 0 through 4 are validated comprehensively. Beyond that, the strategy is to validate splits of all possible relative positions of multiple adjacent separators—which seem to be involved in most edge cases. Specifically, patterns where at least two separators are located at the beginning, in the middle, at the end, and all combinations thereof, are validated.

To accomplish that, the test validates unique permutations of patterns that contain:

  • two separators, allowing for multiple adjacent separators at the beginning, middle, or end;
  • four separators, allowing for multiple adjacent separators at any two of those positions;
  • and six separators, allowing for multiple adjacent separators at any or all three of those positions.

The first two cases are tested for patterns with equal numbers of elements and separators, more elements than separators, and more separators than elements. The six-separator case is only tested with EEESSSSSS for performance reasons.

As this is the swift-algorithms package, I tried to use permutations() for this. But that resulted in unacceptable runtimes, since that method treats repeated elements as separate, resulting in a great number of logically duplicate patterns to be tested. For example, the number of unique permutations of [1, 1, 1, 1, 1, 0, 0, 0, 0] is 126, whereas 9! is 362,880.

Instead, I used Mathematica's Permutations function to generate the permutations, and copied them into the tests as array literals. Mathematica is stunningly fast at generating unique permutations. I'm curious to investigate whether swift-algorithms could have a similarly fast implementation as an option for permutations(). That would make these tests much more concise.

In the current tests, it's likely there is redundancy in the patterns being tested—it's almost certainly not the minimum set needed for complete coverage! The few large array literals make Live Issues evaluation noticeably slower. So it would be desirable to eliminate any redundant patterns. But I'm not currently confident I can identify them. The previous ad-hoc tests missed several bugs, so now I'm erring on the side of caution. Runtime impact is minimal: running all tests in swift-algorithms takes less than 0.1 sec longer with this PR (on my modest, couple-year-old MacBook Pro).

Checklist

  • I've added at least one test that validates that my change is working, if appropriate
  • I've followed the code style of the rest of the project
  • I've read the Contribution Guidelines
  • I've updated the documentation if necessary

@toddthomas toddthomas changed the title Add lazy split implementation for LazyCollection-conforming types. Add lazy split implementation for lazy collection types. Feb 18, 2021
@toddthomas toddthomas changed the title Add lazy split implementation for lazy collection types. WIP: Add lazy split implementation for lazy collection types. Feb 18, 2021
Copy link
Member

@natecook1000 natecook1000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really great start, @toddthomas! 🎉 What would you think of splitting this into two wrapper types, a lazy sequence version that vends arrays, and a lazy collection wrapper that is itself a collection? The collection's index type could simply wrap a range, which would let you re-use your Iterator.next() code in index(after:). Edited to add: if you just want to tackle one of these at a time, that's fine too!

Re your questions:

  • The Algorithms package is separate from the standard library, so we handle additions through this PR process rather than Swift Evolution. People will only get these lazy versions if they import the package, so this won't change the behavior of any existing code.
  • Sometimes I've just stored a copy of the sequence when there are a lot of configuration options like this, but I don't think there's anything wrong with what you've got here.
  • The README will just need a single line description of the addition. We'd also like a brief guide that covers the design of what you're adding in Guides/.
  • We can audit for inlinability later, but in general public methods should be annotated @inlinable, with internal methods @usableFromInline.

/// elements.
///
/// - Complexity: O(*n*), where *n* is the length of the collection.
func split(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This version of the method should be in an unconstrained extension, since it doesn't depend on Equatable conformance.

}
}

extension LazySplitCollection.Iterator: IteratorProtocol, Sequence {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't typically have custom iterators conform to Sequence, even though it's trivial.

lastAdjacentSeparator = subSequenceEnd

if omittingEmptySubsequences {
/// TODO: should be able to replace this raw loop with something like
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think instead of finding lastAdjacentSeparator, it should be simpler to look for the start of the next subsequence here. (And you'll be able to use firstIndex(where:) instead of manual iteration.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is much nicer. Thank you.

///
/// - Note: This type is the result of `x.split(maxSplits:omittingEmptySubsequences:whereSeparator)` and
/// `x.split(separator:maxSplits:omittingEmptySubsequences)`, where `x` conforms to `LazyCollection`.
public struct LazySplitCollection<Base: LazyCollectionProtocol> where Base.Element: Equatable, Base.Elements.Index == Base.Index {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need a Base.Element: Equatable constraint here.

I think it also simplified things a bit if Base is simply a Collection, and we wrap self.elements instead of self from LazyCollectionProtocol.split, because then we don't need to bother with Base.Elements here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eliminated the unnecessary constraint. I need to ponder your other suggestion some more to make sure I fully understand it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, wrapping a collection instead of a lazy collection did make the code more concise. Thank you.

}
}

extension LazyCollection where Element: Equatable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
extension LazyCollection where Element: Equatable {
extension LazyCollectionProtocol where Element: Equatable {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops! Done. Thanks.

@toddthomas
Copy link
Contributor Author

Thanks @timvermeulen, @natecook1000, and anyone else who's reviewed this so far. I appreciate your time.

I'm currently rethinking the iteration algorithm. I must admit the currently-committed version was the result of considering simple cases first, and then trying to handle things like omittingEmptySubsequences = false later. It was getting pretty convoluted. I hope to improve on that now that I understand the more complicated cases better.

After I sort that out, I'll address the feedback. Again, thanks.

@toddthomas toddthomas changed the title WIP: Add lazy split implementation for lazy collection types. Add lazy split implementation for lazy collection types. Feb 19, 2021
@toddthomas
Copy link
Contributor Author

I'm very happy to split (😉) this into two wrapper types, @natecook1000. I have some questions:

  1. I had thought about making LazySplitCollection conform to Collection, but paused at the expectation of endIndex and subscripting being O(1) operations, which I don't think can be done. Would it be okay for them to be O(n) for this class?
  2. Could you please elaborate on your comment about reusing Iterator.next() code in index(after:)? Is there an example of this elsewhere in the Standard Library to which you could point me?

Thanks!

@LucianoPAlmeida
Copy link
Contributor

  1. Could you please elaborate on your comment about reusing Iterator.next() code in index(after:)? Is there an example of this elsewhere in the Standard Library to which you could point me?

I think what Nate means was when implementing a lazy collection wrapper that is itself a collection the requirement for conformance to Collection is to implement index(after:) and to implement that you probably can reuse the same logic used on this iterator based next(). For that we would need a custom collection Index for this lazy collection wrapper that would store a range in the same way we do for LazyChunked. You can look how is done here

@natecook1000 or @timvermeulen can elaborate more, but I hope this could be helpful =]

//
//===----------------------------------------------------------------------===//

/// A collection that lazily splits a base collection into subsequences separated by elements that satisfy the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Really, really a nit, but I think is a code style guide we try to maintain 80 columns max per line.
So as the comment in line 10 marks the limit :)

Copy link
Contributor Author

@toddthomas toddthomas Feb 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Funny/sad story: I actually did go through the file and manually reformat all these comments so they were visually wrapped to those comment lines. But I did it in Xcode, which of course renders triple-slash comments in a smaller font. 🤦‍♂️

I'm very happy to follow code style guidelines. I'd looked around a few times for a formatter so it would be painless, but hadn't had much luck. I was just searching again now and found swift-format. I don't know why it eluded me before. Going to give it a try.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly done. swift-format doesn't seem to reformat doc comments for line length, at least with default config. I reformatted most comments manually, but there are still a few with lines longer than 80 in code examples that I didn't want to tackle yet.

/// - Returns: A lazy sequence of subsequences, split from this collection's
/// elements.
///
/// - Complexity: O(*n*), where *n* is the length of the collection.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

Suggested change
/// - Complexity: O(*n*), where *n* is the length of the collection.
/// - Complexity: O(*n*), where *n* is the count of the collection.

Copy link
Contributor Author

@toddthomas toddthomas Feb 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"length" comes from the split method comments in Collection.swift. I have no strong preference in this case, but in general I'm inclined to keep the docs for these lazy splits as close to the existing ones as possible. Is there a doc prose style guide that we could defer to?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah no worries that is not that important, that was just a suggestion because to me its more reasonable to refer to count for sequences and collections in Swift since the property is count ... but if length is how it is on other docs we can just keep it :))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a very compelling argument. I'm less inclined to defer to the existing docs now. 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either phrasing can be argued for, but "length" is used in this context ubiquitously in the rest of the Algorithms package, so let's keep it consistent 🙂

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, we use length for most of these because sequences don't have a count, per se. They're largely interchangeable otherwise!

@toddthomas toddthomas changed the title Add lazy split implementation for lazy collection types. Add lazy split implementation. Feb 21, 2021
@toddthomas
Copy link
Contributor Author

You can look how is done here

Thanks for the pointer! I will study Chunked.swift.

/// elements.
///
/// - Complexity: O(*n*), where *n* is the length of the collection.
func split(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be @inlinable?
I think we should look to make functions @inlinable or @usableFromInline where it makes sense :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely, inlinability is on my to-do list. Currently working on making LazySplitCollection conform to Collection.

Copy link
Member

@natecook1000 natecook1000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I had thought about making LazySplitCollection conform to Collection, but paused at the expectation of endIndex and subscripting being O(1) operations, which I don't think can be done. Would it be okay for them to be O(n) for this class?

You should be able to have the collection version wrap Range<Base.Index>, with endIndex as just base.endIndex..<base.endIndex. The part that will be O(n) is finding the first subsequence — we've been having an ongoing discussion about how lazy collections should handle this, and we're going to move to precalculating and storing startIndex for this kind of collection. So your collection type should end up with both the base collection and its own startIndex, along with the various pieces of configuration, as stored properties. Does that all make sense?

  1. Could you please elaborate on your comment about reusing Iterator.next() code in index(after:)? Is there an example of this elsewhere in the Standard Library to which you could point me?

@LucianoPAlmeida's answer on this one is correct! If you're using a range as the index, the logic in your index(after:) method will be essentially the same as what's in your next() method. (Come to think of it, your index will probably also need to include a subsequence counter so that you can handle maxSplits correctly.)

}
}

extension LazyCollectionProtocol where Elements.Index == Index {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need this Elements.Index constraint? I'm not seeing what effect it would have.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The compiler says no! I think that was a leftover from when the lazy collection was being wrapped. Thank you.


defer {
sequenceLength += 1
subsequence = []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like subsequence could be local to this next() method instead of being stored in the iterator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, much cleaner and more correct. Thank you.

if sequenceLength < separatorCount + 1 {
if !subsequence.isEmpty || !omittingEmptySubsequences {
sequenceLength += 1
return subsequence
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

subsequence isn't getting zeroed out here (making it local would resolve that as well) — try AnySequence([1, 0, 0, 0, 0, 2, 0, 3]).lazy.split(separator: 0) to see the issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you say, fixed with local subsequence, and tests added. Thanks!


extension LazySplitSequence.Iterator: IteratorProtocol {
public mutating func next() -> Element? {
/// Separators mark the points where we want to split (cut in two) the base
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seams like a doc comment, should it be above the function instead of the body?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a doc comment, but rather a high-level discussion (more general than any particular implementation) of how the split operation works. I found writing this to be very helpful to my understanding of how any split implementation should work. In retrospect, it might serve better as the introduction to the /Guides/Split.md I need to write. I can see how in its current location its intention is unclear.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, thanks! I mention doc comments because comments starting with /// are meant for documentation :)
But I agree such detailed info about algorithm itself could be on the guides. WDYT @natecook1000 @timvermeulen?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used /// because when I'm mixing prose and code, I like seeing each rendered differently as Xcode does with ///. Should that style only be used for comments intended for the generated documentation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, /// is for documentation comments.

///
/// [1, 2, 42, 3, 4, 42, 5].split(separator: 42,
/// omittingEmptySubsequences: false)
/// // first split -> [1, 2], [3, 4, 42, 5]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, maybe I'm missing something here but what is first split and last split? For what I understood the first is a split limited by 2 and last is not limited? Perhaps we could clarify more on that?

Copy link
Contributor Author

@toddthomas toddthomas Feb 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, sorry. Confusing. These are all examples of the high-level algorithm in progress in cases where the original sequence is split twice, resulting in three subsequences. By "first split" I mean, "the state after the first split (due to the first separator)". But my ASCII-art doesn't even make it clear what parts are the returned subsequences, and what part is remaining to be split. I will make this clearer, and also probably remove the whole comment and rewrite it into /Guides/Split.md.

/// [1, 2, 42, 3, 4, 42, 5, 42].split(separator: 42,
/// maxSplits: 2,
/// omittingEmptySubsequences: false)
/// // first split -> [1, 2], [3, 4, 42, 5, 42]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the last 42 appear in the result here? If I understand correctly it should be an empty subsequence?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That actually meant for the line bellow (line 79) last split =]

Copy link
Contributor Author

@toddthomas toddthomas Feb 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case with maxSplits == 2, the base sequence is only split twice, even though it contains three separators. Although the docs for the existing eager splits say, "Elements that are used to split the sequence are not returned as part of any subsequence", that is not how those methods actually work. In the case where maxSplits is less than the number of separators, the remaining separators do appear in the last subsequence.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dunno, is that a bug? Is the documentation saying exactly how it should be working?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah no sorry, I think it was just a confusion from my part I see the point now =]


import XCTest

@testable import Algorithms
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need testable import here? For what I could see we only test the public API so we should be fine without it

Suggested change
@testable import Algorithms
import Algorithms

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You found a serious error! I hadn't actually made the new splits I'm adding public. I've learned a valuable lesson: @testable is a big hammer that can hide access control bugs. Thank you.

Copy link
Member

@natecook1000 natecook1000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great progress, @toddthomas! I have a question about how we’re building endIndex in the weird separator-at-end edge case — otherwise this is in good shape!

return Index(
baseRange: i.baseRange.upperBound..<i.baseRange.upperBound,
sequenceLength: i.sequenceLength + 1,
separatorCount: i.separatorCount
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something feels off here — I think we want the final subsequence to represent the space between the last element and the end of the collection, which has the range base.endIndex..<base.endIndex. In addition, if we do it this way I think the < and == implementations aren’t going to be correct for this penultimate index.

Here’s an example with the existing split. If we don’t omit the empty subsequences and have a separator-only collection, the resulting ranges are all the empty gaps between elements—including endIndex..<endIndex:

let a = [0, 0, 0]
    .split(separator: 0, omittingEmptySubsequences: false)
    .map(\.indices)
// a == [0..<0, 1..<1, 2..<2, 3..<3]

Given this, we can’t depend on there being any range of base that would be uniquely the endIndex of LazySplitCollection. Maybe the subsequence counter == maxSplits or Int.max is how endIndex is designated? Otherwise, we’ve used an enumeration with an atEnd case in various other places. What do you think?

Copy link
Contributor Author

@toddthomas toddthomas Feb 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figured there would be questions about this, because I have them too! It is wonky right now. I meant to put a note here, but got distracted thinking about BidirectionalCollection conformance.

I thought perhaps base.endIndex..<base.endIndex was a convention for the end index of these range-based indexes that I should maintain. But that does result in a logically duplicate index in the case where the base collection ends with at least two separators.

I pondered including sequenceLength in the equality comparison, which would fix that particular problem. But even when the collection ends with only one separator, and the final empty sequence is not strictly a duplicate, its range doesn't feel right--it's the empty range right before the last element, instead of after it.

[7, 42].split(separator: 42, omittingEmptySubsequences: false)
  .map(\.indices)
// -> [0..<1, 2..<2]
[7, 42].lazy.split(separator: 42, omittingEmptySubsequences: false)
  .map(\.indices)
// -> [0..<1, 1..<1]

So I agree that it feels right to leave base.endIndex..<base.endIndex available for that final empty subsequence. I'm leaning towards using an enumeration, because I think there are some problematic cases trying to use something like sequenceLength == maxSplits + 1 or (sequenceLength - 1) == maxSplits (to avoid a problem at maxSplits == Int.max) as the determining factor.

For example, the following would work:

let parts = [7, 42].lazy.split(separator: 42, omittingEmptySubsequences: false)
// maxSplits == Int.max
// endIndex == Index(baseRange: Range(2..<2), sequenceLength: Int.max, separatorCount: ?)
parts.index(after: parts.startIndex)
// -> Index(baseRange: Range(2..<2), sequenceLength: 2, separatorCount: 1)

But

let parts = [7, 42].lazy.split(separator: 42, maxSplits: 1, omittingEmptySubsequences: false)
// endIndex == Index(baseRange: Range(2..<2), sequenceLength: Int.max, separatorCount: ?)
parts.index(after: parts.startIndex)
// -> Index(baseRange: Range(2..<2), sequenceLength: 2, separatorCount: 1)

doesn't, without stretching the meaning of one of the available properties.

Copy link
Contributor Author

@toddthomas toddthomas Feb 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the subsequence counter == maxSplits or Int.max is how endIndex is designated?

Sorry, I think I misunderstood you. When I wrote my response above, I thought you were saying an index with sequenceLength == maxSplits, whatever that was, would be the end index. But now I think you were saying sequenceLength == Int.max is the end, which wouldn't be problematic except in a very extreme edge case. 😄

I still prefer the idea of using an enum as it would be more explicit, but maybe I'll try this first, just to see how it feels, because it's a smaller change.

Copy link
Contributor Author

@toddthomas toddthomas Feb 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defining endIndex as

Index(
  baseRange: base.endIndex..<base.endIndex,
  sequenceLength: Int.max,
  separatorCount: Int.max
)

and redefining the equality and comparison operators as

public static func == (lhs: Index, rhs: Index) -> Bool {
  // `sequenceLength` is equivalent to its index's 1-based position in the
  // collection of indices.
  lhs.sequenceLength == rhs.sequenceLength
}

public static func < (lhs: Index, rhs: Index) -> Bool {
  lhs.sequenceLength < rhs.sequenceLength
}

lets us achieve this goal

[42, 42, 42].lazy.split(
  separator: 42,
  omittingEmptySubsequences: false
).map(\.indices)
// -> [0..<0, 1..<1, 2..<2, 3..<3]

[Edited to remove a bit that was just me being confused.]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The eager split returns one empty sequence with indices 0..<0, whereas the lazy split returns none because I set startIndex to endIndex for empty base collections.

😮 If I think about it, that makes sense, since splitting something with just one separator yields two empty subsequences. But it's still really odd!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the tests, I still have some issues to work out with the empty collection case. Then I do want to make an attempt at an enum version, to see if I prefer that.

I actually played around with an enum implementation before I submitted this PR, but abandoned it because switch statements didn't seem like a good fit in this algorithm, where in the majority of code paths you're not dealing with the .atEnd case. I really like how the enum works in Intersperse, but that type has a lovely symmetry where every method can be just one switch statement. That didn't seem to be the case here, but I want to give it another try.

Copy link
Contributor Author

@toddthomas toddthomas Feb 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up pushing the commit that redefined endIndex to have sequenceLength == Int.max. It was a fairly small change--at least in terms of diffs--that fixed a bug in the previous implementation, in addition to allowing the more natural representation of a final empty subsequence, so it seemed like a win.

I'll investigate an enum-based implementation tomorrow.

Copy link
Contributor Author

@toddthomas toddthomas Feb 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been playing around with an enum-based implementation, and I don't think it's headed toward something I want to submit. However, I don't have much experience designing types based on Swift's super-cool, super-powerful enum, so it might be me that headed in the wrong direction to begin with.

This is the Index I wrote:

extension LazySplitCollection: LazyCollectionProtocol {
  /// Position of a subsequence in a split collection.
  public struct Index: Comparable {
    internal enum Representation: Equatable {
      case includingEmptySubsequences(
            /// The range corresponding to the subsequence at this position.
            baseRange: Range<Base.Index>,
            /// The number of subsequences up to and including this position in the
            /// collection.
            sequenceLength: Int,
            /// The number of separators encountered up to and including this
            /// position in the collection.
            separatorCount: Int
           )
      case omittingEmptySubsequences(
            /// The range corresponding to the subsequence at this position.
            baseRange: Range<Base.Index>
            /// The number of separators encountered up to and inclluding this
            /// position in the collection.
            separatorCount: Int
           )
      case end
    }

    internal let representation: Representation
// [etc.]

and I like how that feels--how the associated types are exactly the properties needed for each case. But in adapting the rest of the implementation to this, I'm finding the amount of code and code duplication is growing, and I'm ending up with too many situations like

switch index {
case .omittingEmptySubsequences:
  // some code
case .includingEmptySubsequences:
  // some mostly similar code with a few crucial differences
case .end:
  // there's a precondition to ensure we never get here!
}

In terms of what methods the cases of this enumeration need to participate in, and the implementation for each case, the first two overlap too much, while the third barely overlaps with the first two at all. Is there a rule of thumb to be gleaned here?

But again, I don't have much experience in this area. Did I get off on the wrong foot? I can show additional, more specific code examples if my intent isn't clear.

Thank you for your patience with this PR, and I appreciate the opportunity to learn.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the distinction between the omitting and not omitting case isn't as sharp as I first thought. I'm going to try just an element and end case, and see how much of the redundancy and awkwardness that eliminates.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good — I think I would go that route if I were using an enum as well. I think in this case, since the .max value works well to indicate the endIndex, I'd just stick with that.

internal func indexForSubsequence(
atOrAfter lowerBound: Base.Index,
sequenceLength: Int = 0,
separatorCount: Int = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These defaults seem a bit risky. Can we be explicit at the call sites instead about what these counters are for the start and end index?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, yes.

separatorCount is another bit of this index implementation I don't feel great about. It's only correct for all indices when we're not omitting empty subsequences, or when there aren't multiple adjacent separators. Otherwise, we explicitly skip over multiple adjacent separators without counting them, for a more succinct implementation. We don't need an accurate separator count when we're omitting empty subsequences, but it still bugs me that the property is there and not correct.

On the other hand, these are internal properties, and the initializer is internal, so I can view them as implementation details and feel somewhat better. Do you have any concerns about this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking the index will become an enum though, as discussed above. Then the end case wouldn't have these properties at all.

import Algorithms
import XCTest

final class LazySplitCollectionTests: XCTestCase {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the collection wrapper it may be a good idea to have a test that executes validateIndexTraversals just to ensure correctness :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thank you. I had seen that in another type's test, but at the time it didn't apply to this. Now it does--thank you for reminding me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I learned validateIndexTraversals requires BidirectionalCollection conformance, which I haven't implemented yet, so adding this test will have to wait a bit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can/should break that test into versions that work for regular collections and also bidirectional ones, so no worries on this one.

ALSO: I don't think there's a way to efficiently implement bidirectionality here, because of maxSplits. You'd need to traverse from the beginning every time, defeating the point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. You're saying the collection needs to be the same forwards and backwards, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the issue is that with bidirectional conformance we would have to compute the splits form the last to first (implementing index before), but because of maxSplits we can't compute that without knowing the initial splits, so it will always have to start from startIndex which I agree with @natecook1000, defeats the point of implementing the conformance since we can't conceptually traverse from last to first... at least to my understanding

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly! Because maxSplits can limit the number of subsequences that are produced, there's no way to calculate the last subsequence without searching from the beginning. BidirectionalCollection conformance is for collections that can efficiently do backward traversal, so there's just no need to do that here.


defer {
sequenceLength += 1
subsequence = []
Copy link
Contributor

@LucianoPAlmeida LucianoPAlmeida Feb 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If subsequence is now local there is not need for this subsequence = [] to be here any more right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely. Thank you.

Comment on lines 38 to 67
self._startIndex = Index(baseRange: base.startIndex..<base.startIndex)

if !base.isEmpty {
// Precompute the start index.
_startIndex = indexForSubsequence(atOrAfter: base.startIndex)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: That way we avoid create an index that would be discarded if the condition !base.isEmpty is true

Suggested change
self._startIndex = Index(baseRange: base.startIndex..<base.startIndex)
if !base.isEmpty {
// Precompute the start index.
_startIndex = indexForSubsequence(atOrAfter: base.startIndex)
}
if !base.isEmpty {
// Precompute the start index.
self._startIndex = indexForSubsequence(atOrAfter: base.startIndex)
} else {
self._startIndex = Index(baseRange: base.startIndex..<base.startIndex)
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: That way we avoid create an index that would be discarded if the condition !base.isEmpty is true

When I try this, I get Variable 'self._startIndex' used before being initialized on line 40. I actually tried a similar thing first and ran into the same error, which is why it ended up the way it's committed now. I have a vague notion the compiler sees some difference between assigning the result of an initializer, and copying the result of a method call? Seems like a thing I should know. I'll research it.

Copy link
Contributor

@LucianoPAlmeida LucianoPAlmeida Feb 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry my mistake, the compiler is correct in that case ... I just didn't notice that indexForSubsequence is a member function of self so calling self.indexForSubsequence(atOrAfter: base.startIndex) before self._startIndex initialization is definitely an access to self before all properties being initialized... so I think we have to stay with this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right. Thanks for that explanation. I think the specificity of the error message confused me: self is definitely used before being initialized there, but self._startIndex isn't used in that member function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah this is just diagnostics that could be improved on the compiler :)
We have a bug tracking similar situations where diagnostics could be more specific https://bugs.swift.org/browse/SR-12421

This implementation is provided only for `LazyCollection`-conforming types,
because the strategy for avoiding new allocations is to have the iterator return
subsequences via ranged subscripting of the underlying collection.
The main simplification is replacing the convoluted empty-
subsequence-skipping strategy of finding the last adjacent
separator with a concise search for the next non-separator.
Thanks, @natecook1000.

The main correctness improvement is keeping track of the
number of elements returned from the iterator, so a final
empty subsequence can be returned if necessary in the case
where the base collection ends with a separator.

Added documentation describing the algorithm.

Added test coverage, particularly for
`omittingEmptySubsequences == false`.
These are the suggestions I currently understand. Need to ponder the
others and maybe ask questions.
Thanks, @timvermeulen, that does make the code more concise!
I see that this is also the pattern used in `LazyMapSequence`, for
example.
Reordered them so they appear in a somewhat more logical progression. Made the names more consistent.
It's now defined as an index with `sequenceLength == Int.max`. This
frees up the range `base.endIndex..<base.endIndex` to represent a
final empty subsequence, if necessary.

Also fixes a bug when splitting an empty collection and not omitting
empty subsequences. Test added for this case.
All values for `indexForSubsequence` arguments are now explicit at call sites.
When omitting empty subsequences, we may skip over multiple adjacent separators, so that
property wasn't counting separators in that case. It's really counting splits.
Only use doc comments for doc. Wrap all to 80 columns.

Also removed high-level explanation of split operation, so it isn't taken
to be an explanation of this particular split algorithm.
I love how the latter followed from the former. 😄
Matches the convention of this project.
Added a line to the readme which links to a new LazySplit guide.
Includes multiple adjacent separators.
@toddthomas
Copy link
Contributor Author

I think I've addressed all the feedback thus far.

Note that I combined the LazySplitSequence and LazySplitCollection definitions into a single file, LazySplit.swift, and did the analogous thing with their test classes. That follows what I understand to be the convention in this repo, and allows the new Guides/LazySplit.md to follow the convention I see for guides as well.

Thank you reviewers! I've learned a lot. And I imagine I have at least one more thing yet to learn. 😄

Copy link
Member

@natecook1000 natecook1000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there, @toddthomas! The lazy collection splits look ready; I found one hiccup in the sequence implementation. After that, and a couple of other notes, we'll be all ready to go.

Thanks again for taking this on — the split method is so much more complex than it seems at first glance! 👏👏

Comment on lines 364 to 365
@usableFromInline
internal var _endIndex: Index
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is more or less constant regardless of the contents of base, this can just be a computed property.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I thought I needed indirection to set _startIndex to endIndex in that one case, but that was just me not reasoning correctly about initializer code (again). I think I might have got it now! 😂

// times. In all cases, stop at the end of the base sequence.
while currentElement != nil {
if separatorCount < maxSplits && isSeparator(currentElement!) {
separatorCount += 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to unconditionally advance separatorCount here, since we're only actually splitting in certain cases (i.e. not when we continue in the first branch below).

Try out this test case:

let s = AnySequence([1, 1, 1, 1, 1, 2, 2])
let expected = s.split(separator: 1, maxSplits: 1)
let actual = s.lazy.split(separator: 1, maxSplits: 1)

(Note: the lazy collection version works correctly here.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. The code was keeping an accurate count of separators, which was then being used wrongly as the split count. I renamed the property, and it keeps an accurate split count now.

This was the final strike against the ad-hoc tests. I completely revised them using a much more deliberate, combinatorial approach. I updated the PR description to discuss this in detail.

}
}

extension LazySplitCollection.Index: Hashable where Base.Index: Hashable {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the way we have it implemented now, the synthesized conformance would work, but generally when you implement custom Equatable conformance that only considers a subset of a type's members, the Hashable conformance should have the same customization. Can you implement hash(into:) here? (Interestingly, with that done, we also shouldn't need the conditional constraint anymore.)

The computed `startIndex` can be used where we need it
in the initializer.
We were correctly recording the separator count, but that wasn't what
we needed! Rename `separatorCount` to `splitCount`, and correctly
count the splits.

Also, completely revamp the tests:
- expand coverage of the sequence pattern space;
- test each pattern as both sequence and collection;
- test each pattern with all combinations of arguments;
- provide useful failure message.
Most of the coverage is for sequences with 1, 2, 4, and 8
separators. This adds tests for some longer sequences
with an odd number of separators.
@kylemacomber
Copy link
Contributor

Is there a reason we're not using permutations(ofCount:) (maybe combined with uniqued()?) to generate the test cases rather than listing them out—it strikes me as more readable/maintainable. Does it just take too long to run?

@toddthomas
Copy link
Contributor Author

toddthomas commented Mar 8, 2021

Is there a reason we're not using permutations(ofCount:) (maybe combined with uniqued()?) to generate the test cases rather than listing them out—it strikes me as more readable/maintainable. Does it just take too long to run?

I tried that first, for the reasons you mention (and the power of suggestion from working on this package? 😄), and yes, it took too long to run. I proceeded with the array literal approach and made a note to look into the performance of the algorithm later to see if it could be made to work. I had found Mathematica to be so fast at generating unique permutations that I was hopeful it could be done.

But I just double-checked using permutations().uniqued() on the two longest patterns in the test (both of length 9), and actually the performance is not nearly as bad as I remember it. I remember killing the test run after waiting about a minute. Instead, I find that those two tests take a few seconds each.

I recall now that my original tests were with a pattern of length 12 (6 elements, 6 separators)! That, I can confirm, is unworkable with permutations().uniqued(). I've had it running in the background for several minutes now and I'm going to kill it. It's nearly 500 million non-unique permutations, so it's not surprising, but Mathematica does it in less than a millisecond. I'm real curious to know what tricks they're employing!

Prior to this PR, the whole test suite took just under 4 seconds to run. Using the algorithm approach will more than double that, but if that's acceptable, I much prefer it. I didn't feel good about those big array literals, and they really hurt Live Issues performance.

And even if we don't like the extra 5 seconds or so for the length 9 tests, I will use the algorithm for all the shorter patterns where I imagine it will have no performance impact.

Thank you for encouraging me to give this a second look!

(All timings on my very modestly-configured 2018 15" MBP.)

@natecook1000
Copy link
Member

I recall now that my original tests were with a pattern of length 12 (6 elements, 6 separators)! That, I can confirm, is unworkable with permutations().uniqued(). I've had it running in the background for several minutes now and I'm going to kill it. It's nearly 500 million non-unique permutations, so it's not surprising, but Mathematica does it in less than a millisecond. I'm real curious to know what tricks they're employing!

Oof! There's a separate algorithm for producing these unique permutations, which runs in much faster time for inputs like these. It's actually in the package already as an implementation detail of permutations() — I've made a public wrapper that we can add later. Let's not worry about changing the lazy split tests now. Thanks for your hard work on getting those built!

@kylemacomber
Copy link
Contributor

@natecook1000 that new wrapper is awesome!

Copy link
Member

@natecook1000 natecook1000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is ready to go! 🎉🎉🎉

@toddthomas this has been a journey, thanks for sticking with it! Do you mind squashing these commits, and then I'll merge? Alternatively, you can just give me a commit message you'd like for the final squashed commit. Thanks!

@toddthomas
Copy link
Contributor Author

I've made a public wrapper that we can add later.

Cool! I made the changes to use the existing algorithm, but I'll stash them and wait for the new hotness. uniqued() is doing a huge amount of work in these cases that is better not done at all.

@toddthomas
Copy link
Contributor Author

This is ready to go! 🎉🎉🎉

@toddthomas this has been a journey, thanks for sticking with it!

🥳😂 Thanks so much for all the careful reviewing and wise suggestions. It's been a great learning experience.

Do you mind squashing these commits, and then I'll merge? Alternatively, you can just give me a commit message you'd like for the final squashed commit. Thanks!

If you don't mind squashing when you merge, I'd like to keep the history of the journey in my repo for a bit. I'll follow up with the commit message. Thanks.

@toddthomas
Copy link
Contributor Author

toddthomas commented Mar 8, 2021

I'll follow up with the commit message.

Add lazy implementations of `split`.

`split(maxSplits:omittingEmptySubsequences:whereSeparator:)`

and

`split(separator:maxSplits:omittingEmptySubsequences:)`

now operate lazily when called on `LazySequenceProtocol`- or
`LazyCollectionProtocol`-conforming values. In the latter case,
no additional storage is allocated on the heap.

@natecook1000 natecook1000 merged commit e7c6716 into apple:main Mar 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Lazy split
6 participants