-
Notifications
You must be signed in to change notification settings - Fork 442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add lazy split
implementation.
#78
Conversation
split
implementation for LazyCollection
-conforming types.split
implementation for lazy collection types.
split
implementation for lazy collection types.split
implementation for lazy collection types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a really great start, @toddthomas! 🎉 What would you think of splitting this into two wrapper types, a lazy sequence version that vends arrays, and a lazy collection wrapper that is itself a collection? The collection's index type could simply wrap a range, which would let you re-use your Iterator.next()
code in index(after:)
. Edited to add: if you just want to tackle one of these at a time, that's fine too!
Re your questions:
- The Algorithms package is separate from the standard library, so we handle additions through this PR process rather than Swift Evolution. People will only get these lazy versions if they import the package, so this won't change the behavior of any existing code.
- Sometimes I've just stored a copy of the sequence when there are a lot of configuration options like this, but I don't think there's anything wrong with what you've got here.
- The README will just need a single line description of the addition. We'd also like a brief guide that covers the design of what you're adding in
Guides/
. - We can audit for inlinability later, but in general public methods should be annotated
@inlinable
, with internal methods@usableFromInline
.
/// elements. | ||
/// | ||
/// - Complexity: O(*n*), where *n* is the length of the collection. | ||
func split( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This version of the method should be in an unconstrained extension, since it doesn't depend on Equatable
conformance.
} | ||
} | ||
|
||
extension LazySplitCollection.Iterator: IteratorProtocol, Sequence { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't typically have custom iterators conform to Sequence
, even though it's trivial.
lastAdjacentSeparator = subSequenceEnd | ||
|
||
if omittingEmptySubsequences { | ||
/// TODO: should be able to replace this raw loop with something like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think instead of finding lastAdjacentSeparator
, it should be simpler to look for the start of the next subsequence here. (And you'll be able to use firstIndex(where:)
instead of manual iteration.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is much nicer. Thank you.
/// | ||
/// - Note: This type is the result of `x.split(maxSplits:omittingEmptySubsequences:whereSeparator)` and | ||
/// `x.split(separator:maxSplits:omittingEmptySubsequences)`, where `x` conforms to `LazyCollection`. | ||
public struct LazySplitCollection<Base: LazyCollectionProtocol> where Base.Element: Equatable, Base.Elements.Index == Base.Index { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need a Base.Element: Equatable
constraint here.
I think it also simplified things a bit if Base
is simply a Collection
, and we wrap self.elements
instead of self
from LazyCollectionProtocol.split
, because then we don't need to bother with Base.Elements
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eliminated the unnecessary constraint. I need to ponder your other suggestion some more to make sure I fully understand it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, wrapping a collection instead of a lazy collection did make the code more concise. Thank you.
} | ||
} | ||
|
||
extension LazyCollection where Element: Equatable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extension LazyCollection where Element: Equatable { | |
extension LazyCollectionProtocol where Element: Equatable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops! Done. Thanks.
Thanks @timvermeulen, @natecook1000, and anyone else who's reviewed this so far. I appreciate your time. I'm currently rethinking the iteration algorithm. I must admit the currently-committed version was the result of considering simple cases first, and then trying to handle things like After I sort that out, I'll address the feedback. Again, thanks. |
split
implementation for lazy collection types.split
implementation for lazy collection types.
I'm very happy to split (😉) this into two wrapper types, @natecook1000. I have some questions:
Thanks! |
I think what Nate means was when implementing a lazy collection wrapper that is itself a collection the requirement for conformance to @natecook1000 or @timvermeulen can elaborate more, but I hope this could be helpful =] |
// | ||
//===----------------------------------------------------------------------===// | ||
|
||
/// A collection that lazily splits a base collection into subsequences separated by elements that satisfy the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Really, really a nit, but I think is a code style guide we try to maintain 80 columns max per line.
So as the comment in line 10 marks the limit :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Funny/sad story: I actually did go through the file and manually reformat all these comments so they were visually wrapped to those comment lines. But I did it in Xcode, which of course renders triple-slash comments in a smaller font. 🤦♂️
I'm very happy to follow code style guidelines. I'd looked around a few times for a formatter so it would be painless, but hadn't had much luck. I was just searching again now and found swift-format. I don't know why it eluded me before. Going to give it a try.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly done. swift-format
doesn't seem to reformat doc comments for line length, at least with default config. I reformatted most comments manually, but there are still a few with lines longer than 80 in code examples that I didn't want to tackle yet.
/// - Returns: A lazy sequence of subsequences, split from this collection's | ||
/// elements. | ||
/// | ||
/// - Complexity: O(*n*), where *n* is the length of the collection. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit:
/// - Complexity: O(*n*), where *n* is the length of the collection. | |
/// - Complexity: O(*n*), where *n* is the count of the collection. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"length" comes from the split
method comments in Collection.swift. I have no strong preference in this case, but in general I'm inclined to keep the docs for these lazy splits as close to the existing ones as possible. Is there a doc prose style guide that we could defer to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah no worries that is not that important, that was just a suggestion because to me its more reasonable to refer to count for sequences and collections in Swift since the property is count
... but if length
is how it is on other docs we can just keep it :))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a very compelling argument. I'm less inclined to defer to the existing docs now. 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either phrasing can be argued for, but "length" is used in this context ubiquitously in the rest of the Algorithms package, so let's keep it consistent 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, we use length
for most of these because sequences don't have a count
, per se. They're largely interchangeable otherwise!
split
implementation for lazy collection types.split
implementation.
Thanks for the pointer! I will study Chunked.swift. |
/// elements. | ||
/// | ||
/// - Complexity: O(*n*), where *n* is the length of the collection. | ||
func split( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be @inlinable
?
I think we should look to make functions @inlinable
or @usableFromInline
where it makes sense :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely, inlinability is on my to-do list. Currently working on making LazySplitCollection
conform to Collection
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I had thought about making
LazySplitCollection
conform toCollection
, but paused at the expectation ofendIndex
and subscripting being O(1) operations, which I don't think can be done. Would it be okay for them to be O(n) for this class?
You should be able to have the collection version wrap Range<Base.Index>
, with endIndex
as just base.endIndex..<base.endIndex
. The part that will be O(n) is finding the first subsequence — we've been having an ongoing discussion about how lazy collections should handle this, and we're going to move to precalculating and storing startIndex
for this kind of collection. So your collection type should end up with both the base collection and its own startIndex
, along with the various pieces of configuration, as stored properties. Does that all make sense?
- Could you please elaborate on your comment about reusing Iterator.next() code in
index(after:)
? Is there an example of this elsewhere in the Standard Library to which you could point me?
@LucianoPAlmeida's answer on this one is correct! If you're using a range as the index, the logic in your index(after:)
method will be essentially the same as what's in your next()
method. (Come to think of it, your index will probably also need to include a subsequence counter so that you can handle maxSplits
correctly.)
} | ||
} | ||
|
||
extension LazyCollectionProtocol where Elements.Index == Index { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need this Elements.Index
constraint? I'm not seeing what effect it would have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The compiler says no! I think that was a leftover from when the lazy collection was being wrapped. Thank you.
|
||
defer { | ||
sequenceLength += 1 | ||
subsequence = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like subsequence
could be local to this next()
method instead of being stored in the iterator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, much cleaner and more correct. Thank you.
if sequenceLength < separatorCount + 1 { | ||
if !subsequence.isEmpty || !omittingEmptySubsequences { | ||
sequenceLength += 1 | ||
return subsequence |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
subsequence
isn't getting zeroed out here (making it local would resolve that as well) — try AnySequence([1, 0, 0, 0, 0, 2, 0, 3]).lazy.split(separator: 0)
to see the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you say, fixed with local subsequence
, and tests added. Thanks!
|
||
extension LazySplitSequence.Iterator: IteratorProtocol { | ||
public mutating func next() -> Element? { | ||
/// Separators mark the points where we want to split (cut in two) the base |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seams like a doc comment, should it be above the function instead of the body?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a doc comment, but rather a high-level discussion (more general than any particular implementation) of how the split operation works. I found writing this to be very helpful to my understanding of how any split implementation should work. In retrospect, it might serve better as the introduction to the /Guides/Split.md I need to write. I can see how in its current location its intention is unclear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see, thanks! I mention doc comments because comments starting with ///
are meant for documentation :)
But I agree such detailed info about algorithm itself could be on the guides. WDYT @natecook1000 @timvermeulen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used ///
because when I'm mixing prose and code, I like seeing each rendered differently as Xcode does with ///
. Should that style only be used for comments intended for the generated documentation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, ///
is for documentation comments.
/// | ||
/// [1, 2, 42, 3, 4, 42, 5].split(separator: 42, | ||
/// omittingEmptySubsequences: false) | ||
/// // first split -> [1, 2], [3, 4, 42, 5] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, maybe I'm missing something here but what is first split
and last split
? For what I understood the first is a split limited by 2
and last is not limited? Perhaps we could clarify more on that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, sorry. Confusing. These are all examples of the high-level algorithm in progress in cases where the original sequence is split twice, resulting in three subsequences. By "first split" I mean, "the state after the first split (due to the first separator)". But my ASCII-art doesn't even make it clear what parts are the returned subsequences, and what part is remaining to be split. I will make this clearer, and also probably remove the whole comment and rewrite it into /Guides/Split.md.
/// [1, 2, 42, 3, 4, 42, 5, 42].split(separator: 42, | ||
/// maxSplits: 2, | ||
/// omittingEmptySubsequences: false) | ||
/// // first split -> [1, 2], [3, 4, 42, 5, 42] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the last 42
appear in the result here? If I understand correctly it should be an empty subsequence?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That actually meant for the line bellow (line 79) last split =]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case with maxSplits == 2
, the base sequence is only split twice, even though it contains three separators. Although the docs for the existing eager split
s say, "Elements that are used to split the sequence are not returned as part of any subsequence", that is not how those methods actually work. In the case where maxSplits
is less than the number of separators, the remaining separators do appear in the last subsequence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dunno, is that a bug? Is the documentation saying exactly how it should be working?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah no sorry, I think it was just a confusion from my part I see the point now =]
|
||
import XCTest | ||
|
||
@testable import Algorithms |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need testable import here? For what I could see we only test the public API so we should be fine without it
@testable import Algorithms | |
import Algorithms |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You found a serious error! I hadn't actually made the new split
s I'm adding public. I've learned a valuable lesson: @testable
is a big hammer that can hide access control bugs. Thank you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great progress, @toddthomas! I have a question about how we’re building endIndex
in the weird separator-at-end edge case — otherwise this is in good shape!
return Index( | ||
baseRange: i.baseRange.upperBound..<i.baseRange.upperBound, | ||
sequenceLength: i.sequenceLength + 1, | ||
separatorCount: i.separatorCount |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something feels off here — I think we want the final subsequence to represent the space between the last element and the end of the collection, which has the range base.endIndex..<base.endIndex
. In addition, if we do it this way I think the <
and ==
implementations aren’t going to be correct for this penultimate index.
Here’s an example with the existing split. If we don’t omit the empty subsequences and have a separator-only collection, the resulting ranges are all the empty gaps between elements—including endIndex..<endIndex
:
let a = [0, 0, 0]
.split(separator: 0, omittingEmptySubsequences: false)
.map(\.indices)
// a == [0..<0, 1..<1, 2..<2, 3..<3]
Given this, we can’t depend on there being any range of base
that would be uniquely the endIndex
of LazySplitCollection
. Maybe the subsequence counter == maxSplits
or Int.max
is how endIndex
is designated? Otherwise, we’ve used an enumeration with an atEnd
case in various other places. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I figured there would be questions about this, because I have them too! It is wonky right now. I meant to put a note here, but got distracted thinking about BidirectionalCollection
conformance.
I thought perhaps base.endIndex..<base.endIndex
was a convention for the end index of these range-based indexes that I should maintain. But that does result in a logically duplicate index in the case where the base collection ends with at least two separators.
I pondered including sequenceLength
in the equality comparison, which would fix that particular problem. But even when the collection ends with only one separator, and the final empty sequence is not strictly a duplicate, its range doesn't feel right--it's the empty range right before the last element, instead of after it.
[7, 42].split(separator: 42, omittingEmptySubsequences: false)
.map(\.indices)
// -> [0..<1, 2..<2]
[7, 42].lazy.split(separator: 42, omittingEmptySubsequences: false)
.map(\.indices)
// -> [0..<1, 1..<1]
So I agree that it feels right to leave base.endIndex..<base.endIndex
available for that final empty subsequence. I'm leaning towards using an enumeration, because I think there are some problematic cases trying to use something like sequenceLength == maxSplits + 1
or (sequenceLength - 1) == maxSplits
(to avoid a problem at maxSplits == Int.max
) as the determining factor.
For example, the following would work:
let parts = [7, 42].lazy.split(separator: 42, omittingEmptySubsequences: false)
// maxSplits == Int.max
// endIndex == Index(baseRange: Range(2..<2), sequenceLength: Int.max, separatorCount: ?)
parts.index(after: parts.startIndex)
// -> Index(baseRange: Range(2..<2), sequenceLength: 2, separatorCount: 1)
But
let parts = [7, 42].lazy.split(separator: 42, maxSplits: 1, omittingEmptySubsequences: false)
// endIndex == Index(baseRange: Range(2..<2), sequenceLength: Int.max, separatorCount: ?)
parts.index(after: parts.startIndex)
// -> Index(baseRange: Range(2..<2), sequenceLength: 2, separatorCount: 1)
doesn't, without stretching the meaning of one of the available properties.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the subsequence counter ==
maxSplits
orInt.max
is howendIndex
is designated?
Sorry, I think I misunderstood you. When I wrote my response above, I thought you were saying an index with sequenceLength == maxSplits
, whatever that was, would be the end index. But now I think you were saying sequenceLength == Int.max
is the end, which wouldn't be problematic except in a very extreme edge case. 😄
I still prefer the idea of using an enum as it would be more explicit, but maybe I'll try this first, just to see how it feels, because it's a smaller change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Defining endIndex
as
Index(
baseRange: base.endIndex..<base.endIndex,
sequenceLength: Int.max,
separatorCount: Int.max
)
and redefining the equality and comparison operators as
public static func == (lhs: Index, rhs: Index) -> Bool {
// `sequenceLength` is equivalent to its index's 1-based position in the
// collection of indices.
lhs.sequenceLength == rhs.sequenceLength
}
public static func < (lhs: Index, rhs: Index) -> Bool {
lhs.sequenceLength < rhs.sequenceLength
}
lets us achieve this goal
[42, 42, 42].lazy.split(
separator: 42,
omittingEmptySubsequences: false
).map(\.indices)
// -> [0..<0, 1..<1, 2..<2, 3..<3]
[Edited to remove a bit that was just me being confused.]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The eager split returns one empty sequence with indices 0..<0, whereas the lazy split returns none because I set startIndex to endIndex for empty base collections.
😮 If I think about it, that makes sense, since splitting something with just one separator yields two empty subsequences. But it's still really odd!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the tests, I still have some issues to work out with the empty collection case. Then I do want to make an attempt at an enum version, to see if I prefer that.
I actually played around with an enum implementation before I submitted this PR, but abandoned it because switch
statements didn't seem like a good fit in this algorithm, where in the majority of code paths you're not dealing with the .atEnd
case. I really like how the enum works in Intersperse
, but that type has a lovely symmetry where every method can be just one switch
statement. That didn't seem to be the case here, but I want to give it another try.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up pushing the commit that redefined endIndex
to have sequenceLength == Int.max
. It was a fairly small change--at least in terms of diffs--that fixed a bug in the previous implementation, in addition to allowing the more natural representation of a final empty subsequence, so it seemed like a win.
I'll investigate an enum-based implementation tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been playing around with an enum
-based implementation, and I don't think it's headed toward something I want to submit. However, I don't have much experience designing types based on Swift's super-cool, super-powerful enum
, so it might be me that headed in the wrong direction to begin with.
This is the Index
I wrote:
extension LazySplitCollection: LazyCollectionProtocol {
/// Position of a subsequence in a split collection.
public struct Index: Comparable {
internal enum Representation: Equatable {
case includingEmptySubsequences(
/// The range corresponding to the subsequence at this position.
baseRange: Range<Base.Index>,
/// The number of subsequences up to and including this position in the
/// collection.
sequenceLength: Int,
/// The number of separators encountered up to and including this
/// position in the collection.
separatorCount: Int
)
case omittingEmptySubsequences(
/// The range corresponding to the subsequence at this position.
baseRange: Range<Base.Index>
/// The number of separators encountered up to and inclluding this
/// position in the collection.
separatorCount: Int
)
case end
}
internal let representation: Representation
// [etc.]
and I like how that feels--how the associated types are exactly the properties needed for each case. But in adapting the rest of the implementation to this, I'm finding the amount of code and code duplication is growing, and I'm ending up with too many situations like
switch index {
case .omittingEmptySubsequences:
// some code
case .includingEmptySubsequences:
// some mostly similar code with a few crucial differences
case .end:
// there's a precondition to ensure we never get here!
}
In terms of what methods the cases of this enumeration need to participate in, and the implementation for each case, the first two overlap too much, while the third barely overlaps with the first two at all. Is there a rule of thumb to be gleaned here?
But again, I don't have much experience in this area. Did I get off on the wrong foot? I can show additional, more specific code examples if my intent isn't clear.
Thank you for your patience with this PR, and I appreciate the opportunity to learn.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the distinction between the omitting and not omitting case isn't as sharp as I first thought. I'm going to try just an element and end case, and see how much of the redundancy and awkwardness that eliminates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good — I think I would go that route if I were using an enum as well. I think in this case, since the .max
value works well to indicate the endIndex
, I'd just stick with that.
internal func indexForSubsequence( | ||
atOrAfter lowerBound: Base.Index, | ||
sequenceLength: Int = 0, | ||
separatorCount: Int = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These defaults seem a bit risky. Can we be explicit at the call sites instead about what these counters are for the start and end index?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, yes.
separatorCount
is another bit of this index implementation I don't feel great about. It's only correct for all indices when we're not omitting empty subsequences, or when there aren't multiple adjacent separators. Otherwise, we explicitly skip over multiple adjacent separators without counting them, for a more succinct implementation. We don't need an accurate separator count when we're omitting empty subsequences, but it still bugs me that the property is there and not correct.
On the other hand, these are internal properties, and the initializer is internal, so I can view them as implementation details and feel somewhat better. Do you have any concerns about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am thinking the index will become an enum though, as discussed above. Then the end case wouldn't have these properties at all.
import Algorithms | ||
import XCTest | ||
|
||
final class LazySplitCollectionTests: XCTestCase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the collection wrapper it may be a good idea to have a test that executes validateIndexTraversals
just to ensure correctness :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, thank you. I had seen that in another type's test, but at the time it didn't apply to this. Now it does--thank you for reminding me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I learned validateIndexTraversals
requires BidirectionalCollection
conformance, which I haven't implemented yet, so adding this test will have to wait a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can/should break that test into versions that work for regular collections and also bidirectional ones, so no worries on this one.
ALSO: I don't think there's a way to efficiently implement bidirectionality here, because of maxSplits
. You'd need to traverse from the beginning every time, defeating the point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. You're saying the collection needs to be the same forwards and backwards, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the issue is that with bidirectional conformance we would have to compute the splits
form the last to first (implementing index before), but because of maxSplits
we can't compute that without knowing the initial splits, so it will always have to start from startIndex
which I agree with @natecook1000, defeats the point of implementing the conformance since we can't conceptually traverse from last
to first
... at least to my understanding
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly! Because maxSplits
can limit the number of subsequences that are produced, there's no way to calculate the last subsequence without searching from the beginning. BidirectionalCollection
conformance is for collections that can efficiently do backward traversal, so there's just no need to do that here.
|
||
defer { | ||
sequenceLength += 1 | ||
subsequence = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If subsequence
is now local there is not need for this subsequence = []
to be here any more right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely. Thank you.
self._startIndex = Index(baseRange: base.startIndex..<base.startIndex) | ||
|
||
if !base.isEmpty { | ||
// Precompute the start index. | ||
_startIndex = indexForSubsequence(atOrAfter: base.startIndex) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: That way we avoid create an index that would be discarded if the condition !base.isEmpty
is true
self._startIndex = Index(baseRange: base.startIndex..<base.startIndex) | |
if !base.isEmpty { | |
// Precompute the start index. | |
_startIndex = indexForSubsequence(atOrAfter: base.startIndex) | |
} | |
if !base.isEmpty { | |
// Precompute the start index. | |
self._startIndex = indexForSubsequence(atOrAfter: base.startIndex) | |
} else { | |
self._startIndex = Index(baseRange: base.startIndex..<base.startIndex) | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: That way we avoid create an index that would be discarded if the condition
!base.isEmpty
istrue
When I try this, I get Variable 'self._startIndex' used before being initialized
on line 40. I actually tried a similar thing first and ran into the same error, which is why it ended up the way it's committed now. I have a vague notion the compiler sees some difference between assigning the result of an initializer, and copying the result of a method call? Seems like a thing I should know. I'll research it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah sorry my mistake, the compiler is correct in that case ... I just didn't notice that indexForSubsequence
is a member function of self
so calling self.indexForSubsequence(atOrAfter: base.startIndex)
before self._startIndex
initialization is definitely an access to self
before all properties being initialized... so I think we have to stay with this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, right. Thanks for that explanation. I think the specificity of the error message confused me: self
is definitely used before being initialized there, but self._startIndex
isn't used in that member function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah this is just diagnostics that could be improved on the compiler :)
We have a bug tracking similar situations where diagnostics could be more specific https://bugs.swift.org/browse/SR-12421
This implementation is provided only for `LazyCollection`-conforming types, because the strategy for avoiding new allocations is to have the iterator return subsequences via ranged subscripting of the underlying collection.
Shuffling to match `LazyMapSequence`.
The main simplification is replacing the convoluted empty- subsequence-skipping strategy of finding the last adjacent separator with a concise search for the next non-separator. Thanks, @natecook1000. The main correctness improvement is keeping track of the number of elements returned from the iterator, so a final empty subsequence can be returned if necessary in the case where the base collection ends with a separator. Added documentation describing the algorithm. Added test coverage, particularly for `omittingEmptySubsequences == false`.
These are the suggestions I currently understand. Need to ponder the others and maybe ask questions.
Thanks, @timvermeulen, that does make the code more concise! I see that this is also the pattern used in `LazyMapSequence`, for example.
Reordered them so they appear in a somewhat more logical progression. Made the names more consistent.
It's now defined as an index with `sequenceLength == Int.max`. This frees up the range `base.endIndex..<base.endIndex` to represent a final empty subsequence, if necessary. Also fixes a bug when splitting an empty collection and not omitting empty subsequences. Test added for this case.
All values for `indexForSubsequence` arguments are now explicit at call sites.
When omitting empty subsequences, we may skip over multiple adjacent separators, so that property wasn't counting separators in that case. It's really counting splits.
Only use doc comments for doc. Wrap all to 80 columns. Also removed high-level explanation of split operation, so it isn't taken to be an explanation of this particular split algorithm.
I love how the latter followed from the former. 😄
Matches the convention of this project.
Added a line to the readme which links to a new LazySplit guide.
Includes multiple adjacent separators.
I think I've addressed all the feedback thus far. Note that I combined the Thank you reviewers! I've learned a lot. And I imagine I have at least one more thing yet to learn. 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost there, @toddthomas! The lazy collection splits look ready; I found one hiccup in the sequence implementation. After that, and a couple of other notes, we'll be all ready to go.
Thanks again for taking this on — the split
method is so much more complex than it seems at first glance! 👏👏
Sources/Algorithms/LazySplit.swift
Outdated
@usableFromInline | ||
internal var _endIndex: Index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is more or less constant regardless of the contents of base
, this can just be a computed property.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I thought I needed indirection to set _startIndex
to endIndex
in that one case, but that was just me not reasoning correctly about initializer code (again). I think I might have got it now! 😂
Sources/Algorithms/LazySplit.swift
Outdated
// times. In all cases, stop at the end of the base sequence. | ||
while currentElement != nil { | ||
if separatorCount < maxSplits && isSeparator(currentElement!) { | ||
separatorCount += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we want to unconditionally advance separatorCount
here, since we're only actually splitting in certain cases (i.e. not when we continue
in the first branch below).
Try out this test case:
let s = AnySequence([1, 1, 1, 1, 1, 2, 2])
let expected = s.split(separator: 1, maxSplits: 1)
let actual = s.lazy.split(separator: 1, maxSplits: 1)
(Note: the lazy collection version works correctly here.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. The code was keeping an accurate count of separators, which was then being used wrongly as the split count. I renamed the property, and it keeps an accurate split count now.
This was the final strike against the ad-hoc tests. I completely revised them using a much more deliberate, combinatorial approach. I updated the PR description to discuss this in detail.
Sources/Algorithms/LazySplit.swift
Outdated
} | ||
} | ||
|
||
extension LazySplitCollection.Index: Hashable where Base.Index: Hashable {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the way we have it implemented now, the synthesized conformance would work, but generally when you implement custom Equatable
conformance that only considers a subset of a type's members, the Hashable
conformance should have the same customization. Can you implement hash(into:)
here? (Interestingly, with that done, we also shouldn't need the conditional constraint anymore.)
The computed `startIndex` can be used where we need it in the initializer.
We were correctly recording the separator count, but that wasn't what we needed! Rename `separatorCount` to `splitCount`, and correctly count the splits. Also, completely revamp the tests: - expand coverage of the sequence pattern space; - test each pattern as both sequence and collection; - test each pattern with all combinations of arguments; - provide useful failure message.
Most of the coverage is for sequences with 1, 2, 4, and 8 separators. This adds tests for some longer sequences with an odd number of separators.
Is there a reason we're not using |
I tried that first, for the reasons you mention (and the power of suggestion from working on this package? 😄), and yes, it took too long to run. I proceeded with the array literal approach and made a note to look into the performance of the algorithm later to see if it could be made to work. I had found Mathematica to be so fast at generating unique permutations that I was hopeful it could be done. But I just double-checked using I recall now that my original tests were with a pattern of length 12 (6 elements, 6 separators)! That, I can confirm, is unworkable with Prior to this PR, the whole test suite took just under 4 seconds to run. Using the algorithm approach will more than double that, but if that's acceptable, I much prefer it. I didn't feel good about those big array literals, and they really hurt Live Issues performance. And even if we don't like the extra 5 seconds or so for the length 9 tests, I will use the algorithm for all the shorter patterns where I imagine it will have no performance impact. Thank you for encouraging me to give this a second look! (All timings on my very modestly-configured 2018 15" MBP.) |
Oof! There's a separate algorithm for producing these unique permutations, which runs in much faster time for inputs like these. It's actually in the package already as an implementation detail of |
@natecook1000 that new wrapper is awesome! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is ready to go! 🎉🎉🎉
@toddthomas this has been a journey, thanks for sticking with it! Do you mind squashing these commits, and then I'll merge? Alternatively, you can just give me a commit message you'd like for the final squashed commit. Thanks!
Cool! I made the changes to use the existing algorithm, but I'll stash them and wait for the new hotness. |
🥳😂 Thanks so much for all the careful reviewing and wise suggestions. It's been a great learning experience.
If you don't mind squashing when you merge, I'd like to keep the history of the journey in my repo for a bit. I'll follow up with the commit message. Thanks. |
|
This makes
and
evaluate lazily when called on lazy sequences and collections, which should resolve #59. For sequence types, the resulting subsequences are vended as newly-allocated arrays. For collection types, the subsequences are slices and thus don't incur additional heap allocations (if I understand subsequences correctly!).
Discussion
The no-additional-allocations implementation is available only for
Collection
-conforming types, as the strategy for avoiding new allocations is to have the iterator return subsequences via ranged subscripting of the underlying collection.Test Plan
I've encountered a lot of interesting edge cases for this algorithm, which has led me to pursue a combinatorial approach to testing it.
Sequences to be split are patterns of separators and non-separators, the latter of which I refer to as "elements" to avoid the negative. All elements are equivalent from the algorithm's perspective, so the patterns can be represented as strings of "S" for separators, and "E" for elements. For example, SSEE is a pattern that begins with two separators and ends with two elements.
Patterns are tested using the
Validator
type. It compares the results of splitting a collection value as both a lazy sequence and a lazy collection—and with all combinations of the default and a providedmaxSplits
, and the default (true
) and an explicitfalse
omittingEmptySubsequences
—with the Standard Library's eager splits, which are assumed to be correct.The lazy splits of patterns of length 0 through 4 are validated comprehensively. Beyond that, the strategy is to validate splits of all possible relative positions of multiple adjacent separators—which seem to be involved in most edge cases. Specifically, patterns where at least two separators are located at the beginning, in the middle, at the end, and all combinations thereof, are validated.
To accomplish that, the test validates unique permutations of patterns that contain:
The first two cases are tested for patterns with equal numbers of elements and separators, more elements than separators, and more separators than elements. The six-separator case is only tested with EEESSSSSS for performance reasons.
As this is the swift-algorithms package, I tried to use
permutations()
for this. But that resulted in unacceptable runtimes, since that method treats repeated elements as separate, resulting in a great number of logically duplicate patterns to be tested. For example, the number of unique permutations of[1, 1, 1, 1, 1, 0, 0, 0, 0]
is 126, whereas 9! is 362,880.Instead, I used Mathematica's
Permutations
function to generate the permutations, and copied them into the tests as array literals. Mathematica is stunningly fast at generating unique permutations. I'm curious to investigate whether swift-algorithms could have a similarly fast implementation as an option forpermutations()
. That would make these tests much more concise.In the current tests, it's likely there is redundancy in the patterns being tested—it's almost certainly not the minimum set needed for complete coverage! The few large array literals make Live Issues evaluation noticeably slower. So it would be desirable to eliminate any redundant patterns. But I'm not currently confident I can identify them. The previous ad-hoc tests missed several bugs, so now I'm erring on the side of caution. Runtime impact is minimal: running all tests in swift-algorithms takes less than 0.1 sec longer with this PR (on my modest, couple-year-old MacBook Pro).
Checklist