Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Chunked] Add chunks(of:) variant that divides a collection in chunks of a given size #54

Merged
merged 10 commits into from
Jan 16, 2021

Conversation

LucianoPAlmeida
Copy link
Contributor

This just adds the chunks(of:) to chunk a collection into subsequences of a given size.
As suggested by @bjhomer here this is just moving a long time opened evolution proposal to this package. So let me know what you think =]

cc @natecook1000 @kylemacomber

Checklist

  • I've added at least one test that validates that my change is working, if appropriate
  • I've followed the code style of the rest of the project
  • I've read the Contribution Guidelines
  • I've updated the documentation if necessary

@kylemacomber
Copy link
Contributor

I think this would make a great addition!

What do you think about the name chunks(ofCount:) similar to combinations(ofCount:)? I think this would make it clearer that the argument isn't specifying an element being searched for (like firstIndex(of:)):

let x = [1, 2, 2, 3, 3, 3, 2]
print(x.chunks(of: 2))
// [[2, 2], [2]] or [[1, 2], [2, 3], [3, 3], [2]]?

But rather the chunk size:

let x = [1, 2, 2, 3, 3, 3, 2]
print(x.chunks(ofCount: 2))
// [[1, 2], [2, 3], [3, 3], [2]]

@LucianoPAlmeida
Copy link
Contributor Author

Hey @kylemacomber :)

What do you think about the name chunks(ofCount:) similar to combinations(ofCount:)? I think this would make it clearer that the argument isn't specifying an element being searched for (like firstIndex(of:)):

I agree, that is a good idea and will remove any ambiguity as far as interpreting the function... I'll make the changes :)

@LucianoPAlmeida
Copy link
Contributor Author

All fixed Kyle =]

@LucianoPAlmeida LucianoPAlmeida force-pushed the chunked-size branch 2 times, most recently from 99d39f3 to 91f1e73 Compare January 7, 2021 03:07
Copy link
Member

@natecook1000 natecook1000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to see this PR, @LucianoPAlmeida! I’ve done an initial review pass over the implementation — let me know if you have any questions.

Guides/Chunked.md Outdated Show resolved Hide resolved
Guides/Chunked.md Outdated Show resolved Hide resolved
Sources/Algorithms/Chunked.swift Outdated Show resolved Hide resolved
Sources/Algorithms/Chunked.swift Outdated Show resolved Hide resolved
Sources/Algorithms/Chunked.swift Outdated Show resolved Hide resolved
Sources/Algorithms/Chunked.swift Outdated Show resolved Hide resolved
Sources/Algorithms/Chunked.swift Show resolved Hide resolved
Tests/SwiftAlgorithmsTests/ChunkedTests.swift Outdated Show resolved Hide resolved
Sources/Algorithms/Chunked.swift Outdated Show resolved Hide resolved

extension ChunkedCollection: Collection {
public struct Index {
public let base: Base.Index
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also be internal.

There’s a bit of decision to make here between using this single index and using the range for the chunk. For a non-random-access collection, a single index makes using the subscript O(n), while a range moves that calculation to startIndex. Given that startIndex is so frequently touched for testing boundary conditions, the current implementation is probably the right one, but I don’t like that iterating over a collection this way requires traversing the base collection twice (once for each subscript access and once for each call to index(after:)).

Another approach would be to calculate startIndex at initialization and store it along with the chunk size. That way both startIndex and subscripting would always be O(1). What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC you're suggesting we store a Range in the index and calculate it in init, which would allow both startIndex and subscript to be O(1)? I like that idea!

Copy link
Contributor Author

@LucianoPAlmeida LucianoPAlmeida Jan 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another approach would be to calculate startIndex at initialization and store it along with the chunk size. That way both startIndex and subscripting would always be O(1). What do you think?

I've experimented with this approach when putting up this PR, I like the idea ... but there are two points:

  • Both start index and endIndex(lower bound) will have to be computed.
  • If we compute at initialization(start and end indexes) that the call then [1,2,3].chunks(ofCount: 2) becomes O(n) in non-RandomAccessCollection base.

Edit: In fact end index don't need to be computed, so this should be fine :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just changed :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After implemented, I would say that I definitely like this approach!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have a point here... also I was thinking about this and this wouldn't be a big problem if we just compute the startIndex every time, as imagine most cases people base would likely to be an array or an lazy collection from a array source with RandomAccessCollection conditional conformance where base is random access so, in most cases startIndex wouldn't be expensive to compute. Also, I still like this range based approach because as Kyle mentioned is more robust and iterations are improved.

The actual solution would be changing the language so general notifications are made when a reference object is mutated (so in this case, the cached startIndex can be recomputed), but that’s severe and I’m not sure it’s implementable.

In C# we have a concept of ObservableCollection where mutations(well defined though an API) can be observed from anywhere. This not guarantee a notification of a custom collection mutation through a custom API, but I just remembered the concept, which is nice and very useful in the context of .NET framework =]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a reference-type collection or a collection of reference types, if there is a mutation between when a filter-like adapter is created and when it's traversed, I agree the result could be surprising and it's worth thinking carefully about our policy.

Note that for chunks(ofCount:), mutations to reference type elements in a collection shouldn't change the behavior, and mutations to a reference type collection itself, possibly could be made to work. Though depending on our policy, we may decide we don't need to be defensive here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More concretely, my only concern with this approach was that certain lazy collections (most notably LazyFilterCollection) will run closures and evaluate elements when an index is advanced, and so precomputing the range of the first chunk can possibly run code that the user may not expect to be run at that stage. (Counterpoint: while lazy sequences are useful for short-circuiting, the user is very likely to at least evaluate the first element, in which case precomputing the start index doesn't have much of an impact.)

Calling .prefix(n) on a lazy collection has the same behavior, so we might have to conclude that such things are unavoidable/acceptable 🙂 In this case I think we should consider precomputing the first chunk of chunked(on:) and chunked(by:) as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling .prefix(n) on a lazy collection has the same behavior, so we might have to conclude that such things are unavoidable/acceptable 🙂 In this case I think we should consider precomputing the first chunk of chunked(on:) and chunked(by:) as well.

I agree, and I think at the end of the day as @kylemacomber mentioned it all depends if we decide to be really strict and defensive or trade that of for performance :)
Also, I still don't think that it would be a big problem to always compute the start index if we really want to be conservative. In either way let me know what you guys think and I'll be happy to make the changes if necessary =]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The actual solution would be changing the language so general notifications are made when a reference object is mutated (so in this case, the cached startIndex can be recomputed), but that’s severe and I’m not sure it’s implementable.

Another solution would be limiting the Base to be a value type. I think this was one of my inspirations to request adding an AnyValue poison protocol.

@LucianoPAlmeida
Copy link
Contributor Author

Thanks for the review @natecook1000!

@LucianoPAlmeida LucianoPAlmeida force-pushed the chunked-size branch 11 times, most recently from 489a2de to 13d9171 Compare January 8, 2021 03:06
Sources/Algorithms/Chunked.swift Outdated Show resolved Hide resolved
Sources/Algorithms/Chunked.swift Show resolved Hide resolved
Sources/Algorithms/Chunked.swift Show resolved Hide resolved
Sources/Algorithms/Chunked.swift Outdated Show resolved Hide resolved
Sources/Algorithms/Chunked.swift Outdated Show resolved Hide resolved
Sources/Algorithms/Chunked.swift Outdated Show resolved Hide resolved
Sources/Algorithms/Chunked.swift Show resolved Hide resolved
@LucianoPAlmeida LucianoPAlmeida force-pushed the chunked-size branch 3 times, most recently from 753e0e9 to 741d559 Compare January 8, 2021 11:15
@LucianoPAlmeida LucianoPAlmeida force-pushed the chunked-size branch 2 times, most recently from 60fb045 to 01ed6d9 Compare January 10, 2021 19:42
@CTMacUser
Copy link
Contributor

Looking at the current code, it should be easy to set the SubSequence to be a wrapper over Base.SubSequence (instead of the default Slice<Self>). Maybe I’ll try it out later.

Sources/Algorithms/Chunked.swift Show resolved Hide resolved
Sources/Algorithms/Chunked.swift Outdated Show resolved Hide resolved
Sources/Algorithms/Chunked.swift Outdated Show resolved Hide resolved
Sources/Algorithms/Chunked.swift Outdated Show resolved Hide resolved
@LucianoPAlmeida LucianoPAlmeida force-pushed the chunked-size branch 2 times, most recently from 8fe1c9d to 45dcbbd Compare January 14, 2021 01:09
@CTMacUser
Copy link
Contributor

Tried doing turning ChunkedByCount<T.SubSequence> into SubSequence. I forgot that a collection and its SubSequence must use the same Index type. This is incompatible with a custom Index whose arity is per-instantiation, so I had to pull the Index type out first.

I used a Git(Hub) client that pulled directly from this PR. I can't upload back to it, of course. So I have no idea to express my changes as proper patches to your work. Worse, before I tried to upload, you pushed down one of your own. (I started an hour before you did.) Here's a copy/paste of that whole section "Chunked.swift":

//===----------------------------------------------------------------------===//
// chunks(ofCount:)
//===----------------------------------------------------------------------===//

/// An index for an element (or past-the-end) of a `ChunkedByCount` collection.
public struct ChunkedByCountIndex<Base: Comparable> {
  // Using a separate type, instead of one embedded in ChunkedByCount, allows
  // ChunkedByCount<T> and ChunkedByCount<T.SubSequence> to share indices.

  @usableFromInline
  internal let baseRange: Range<Base>

  @usableFromInline
  internal init(_baseRange: Range<Base>) {
    self.baseRange = _baseRange
  }
}

extension ChunkedByCountIndex: Equatable {}

extension ChunkedByCountIndex: Comparable {
  public static func < (lhs: Self, rhs: Self) -> Bool {
    // Intervals are harder to do straight comparisons for.  But any cross-
    // overlap between non-identical ranges means the operands weren't from
    // compatible collections to begin with, a precondition violation.
    //
    // (Similar reasoning applies for not bothering to do a customized defintion
    // for the equality operator.)
    return lhs.baseRange.lowerBound < rhs.baseRange.lowerBound
  }
}

extension ChunkedByCountIndex: Hashable where Base: Hashable {}

/// A collection that presents the elements of its base collection
/// in `SubSequence` chunks of any given count.
///
/// A `ChunkedByCount` is a lazy view on the base Collection, but it does not implicitly confer
/// laziness on algorithms applied to its result.  In other words, for ordinary collections `c`:
///
/// * `c.chunks(ofCount: 3)` does not create new storage
/// * `c.chunks(ofCount: 3).map(f)` maps eagerly and returns a new array
/// * `c.lazy.chunks(ofCount: 3).map(f)` maps lazily and returns a `LazyMapCollection`
public struct ChunkedByCount<Base: Collection> {
  
  public typealias Element = Base.SubSequence
  
  @usableFromInline
  internal let base: Base
  
  @usableFromInline
  internal let chunkCount: Int
  
  @usableFromInline
  internal var startUpperBound: Base.Index

  /// Creates a view instance that presents the elements of the given collection
  /// as `SubSequence` chunks with the given count, caching the given index for
  /// where the first chunk ends.
  ///
  /// - Complexity: O(1)
  @inlinable
  internal init(_base: Base, _chunkCount: Int, _firstChunkEnd: Base.Index) {
    self.base = _base
    self.chunkCount = _chunkCount
    self.startUpperBound = _firstChunkEnd
  }
  ///  Creates a view instance that presents the elements of `base`
  ///  in `SubSequence` chunks of the given count.
  ///
  /// - Complexity: O(n)
  @inlinable
  internal init(_base: Base, _chunkCount: Int) {
    // Compute the start index upfront in order to make
    // start index a O(1) lookup.
    let baseEnd = _base.index(_base.startIndex, offsetBy: _chunkCount,
                              limitedBy: _base.endIndex) ?? _base.endIndex
    self.init(_base: _base, _chunkCount: _chunkCount, _firstChunkEnd: baseEnd)
  }
}

extension ChunkedByCount: Collection {
  public typealias Index = ChunkedByCountIndex<Base.Index>
  public typealias SubSequence = ChunkedByCount<Base.SubSequence>

  /// - Complexity: O(1)
  @inlinable
  public var startIndex: Index {
    Index(_baseRange: base.startIndex..<startUpperBound)
  }
  @inlinable
  public var endIndex: Index {
    Index(_baseRange: base.endIndex..<base.endIndex)
  }
  
  /// - Complexity: O(1)
  public subscript(i: Index) -> Element {
    precondition(i < endIndex, "Index out of range")
    return base[i.baseRange]
  }
  public subscript(bounds: Range<Index>) -> SubSequence {
    let lowerRange = bounds.lowerBound.baseRange,
        upperRange = bounds.upperBound.baseRange
    let resultBounds: Range<Base.Index>
    if lowerRange.lowerBound < upperRange.lowerBound {
      resultBounds = lowerRange.lowerBound ..< upperRange.upperBound
    } else {
      resultBounds = lowerRange.lowerBound ..< upperRange.lowerBound
      assert(resultBounds.isEmpty)
    }
    return SubSequence(_base: base[resultBounds], _chunkCount: chunkCount,
                       _firstChunkEnd: lowerRange.upperBound)
  }
  
  @inlinable
  public func index(after i: Index) -> Index {
    precondition(i < endIndex, "Advancing past end index")
    let baseIdx = base.index(
      i.baseRange.upperBound, offsetBy: chunkCount,
      limitedBy: base.endIndex
    ) ?? base.endIndex
    return Index(_baseRange: i.baseRange.upperBound..<baseIdx)
  }
}

extension ChunkedByCount:
  BidirectionalCollection, RandomAccessCollection
where Base: RandomAccessCollection {
  @inlinable
  public func index(before i: Index) -> Index {
    precondition(i > startIndex, "Advancing past start index")
    
    var offset = chunkCount
    if i.baseRange.lowerBound == base.endIndex {
      let remainder = base.count%chunkCount
      if remainder != 0 {
        offset = remainder
      }
    }
    
    let baseIdx = base.index(
      i.baseRange.lowerBound, offsetBy: -offset,
      limitedBy: base.startIndex
    ) ?? base.startIndex
    return Index(_baseRange: baseIdx..<i.baseRange.lowerBound)
  }
}

extension ChunkedByCount {
  @inlinable
  public func distance(from start: Index, to end: Index) -> Int {
    let distance =
      base.distance(from: start.baseRange.lowerBound,
                    to: end.baseRange.lowerBound)
    let (quotient, remainder) =
      distance.quotientAndRemainder(dividingBy: chunkCount)
    return quotient + remainder.signum()
  }

  @inlinable
  public var count: Int {
    let (quotient, remainder) =
      base.count.quotientAndRemainder(dividingBy: chunkCount)
    return quotient + remainder.signum()
  }
  
  @inlinable
  public func index(
    _ i: Index, offsetBy offset: Int, limitedBy limit: Index
  ) -> Index? {
    guard offset != 0 else { return i }
    guard limit != i else { return nil }
    
    if offset > 0 {
      guard limit < i || distance(from: i, to: limit) >= offset else {
        return nil
      }
      return offsetForward(i, offsetBy: offset)
    } else {
      guard limit > i || distance(from: i, to: limit) <= offset else {
        return nil
      }
      return offsetBackward(i, offsetBy: offset)
    }
  }

  @inlinable
  public func index(_ i: Index, offsetBy distance: Int) -> Index {
    guard distance != 0 else { return i }
    
    return distance > 0
        ? offsetForward(i, offsetBy: distance)
        : offsetBackward(i, offsetBy: distance)
  }
  
  @usableFromInline
  internal func offsetForward(_ i: Index, offsetBy distance: Int) -> Index {
    assert(distance > 0)
    return makeOffsetIndex(
      from: i, baseBound: base.endIndex, baseDistance: distance * chunkCount
    )
  }
  
  @usableFromInline
  internal func offsetBackward(_ i: Index, offsetBy distance: Int) -> Index {
    assert(distance < 0)
    if i.baseRange.lowerBound == base.endIndex {
      let remainder = base.count%chunkCount
      // We have to take it into account when calculating offsets.
      if remainder != 0 {
        // Distance "minus" one(at this point distance is negative) because we
        // need to adjust for the last position that have a variadic(remainder)
        // number of elements.
        let baseDistance = ((distance + 1) * chunkCount) - remainder
        return makeOffsetIndex(
          from: i, baseBound: base.startIndex, baseDistance: baseDistance
        )
      }
    }
    return makeOffsetIndex(
      from: i, baseBound: base.startIndex, baseDistance: distance * chunkCount
    )
  }
  
  // Helper to compute index(offsetBy:) index.
  @inline(__always)
  private func makeOffsetIndex(
    from i: Index, baseBound: Base.Index, baseDistance: Int
  ) -> Index {
    let baseStartIdx = base.index(
      i.baseRange.lowerBound, offsetBy: baseDistance,
      limitedBy: baseBound
    ) ?? baseBound
    
    let baseEndIdx = base.index(
      baseStartIdx, offsetBy: chunkCount, limitedBy: base.endIndex
    ) ?? base.endIndex
    
    return Index(_baseRange: baseStartIdx..<baseEndIdx)
  }
}

extension Collection {
  /// Returns a `ChunkedCollection<Self>` view presenting the elements
  /// in chunks with count of the given count parameter.
  ///
  /// - Parameter size: The size of the chunks. If the count parameter
  ///   is evenly divided by the count of the base `Collection` all the
  ///   chunks will have the count equals to size.
  ///   Otherwise, the last chunk will contain the remaining elements.
  ///
  ///     let c = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
  ///     print(c.chunks(ofCount: 5).map(Array.init))
  ///     // [[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]]
  ///
  ///     print(c.chunks(ofCount: 3).map(Array.init))
  ///     // [[1, 2, 3], [4, 5, 6], [7, 8, 9], [10]]
  ///
  /// - Complexity: O(1)
  @inlinable
  public func chunks(ofCount count: Int) -> ChunkedByCount<Self> {
    precondition(count > 0, "Cannot chunk with count <= 0!")
    return ChunkedByCount(_base: self, _chunkCount: count)
  }
}

// Conditional conformances.
extension ChunkedByCount: Equatable where Base: Equatable {}

// Since we have another stored property of type `Index` on the
// collection, synthesis of `Hashble` conformace would require
// a `Base.Index: Hashable` constraint, so we implement the hasher
// only in terms of `base`. Since the computed index is based on it,
// it should not make a difference here.
extension ChunkedByCount: Hashable where Base: Hashable {
  public func hash(into hasher: inout Hasher) {
    hasher.combine(base)
  }
}

// Lazy conditional conformance.
extension ChunkedByCount: LazySequenceProtocol
  where Base: LazySequenceProtocol {}
extension ChunkedByCount: LazyCollectionProtocol
  where Base: LazyCollectionProtocol {}

I didn't touch the test file, so you have to add something that executes the subsequence subscript method.

@timvermeulen
Copy link
Contributor

@CTMacUser I like the idea of this — Slice<ChunkedBySize<Base>> effectively stores 5 instances of B.Index, whereas ChunkedBySize<Base.SubSequence> only store 1 (plus however many Base.SubSequence stores, which is 2 in the case of Slice<Base>). I'm just curious if this is also the upside you had in mind or that this has other benefits as well.

Either way, this might be more suitable as a follow-up PR since this PR seems to be good to go, and it seems like the LazyChunked collection could benefit from this as well?

I forgot that a collection and its SubSequence must use the same Index type.

Thanks for pointing this out, I'm making a note that validateIndexTraversals should probably test this somehow 🙂

@natecook1000
Copy link
Member

Either way, this might be more suitable as a follow-up PR since this PR seems to be good to go, and it seems like the LazyChunked collection could benefit from this as well?

Agreed, this is a cool change, but let's get this integrated first.

Copy link
Member

@natecook1000 natecook1000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@LucianoPAlmeida
Copy link
Contributor Author

@natecook1000 Sorry, I've sent some updates can you give another look? :)

@natecook1000 natecook1000 merged commit 9836a4f into apple:main Jan 16, 2021
@LucianoPAlmeida LucianoPAlmeida deleted the chunked-size branch January 16, 2021 11:33
@CTMacUser
Copy link
Contributor

Either way, this might be more suitable as a follow-up PR since this PR seems to be good to go, and it seems like the LazyChunked collection could benefit from this as well?

Since your post, this PR has been merged and withdrawn. So I have to figure out how to resubmit the changes.

I forgot that a collection and its SubSequence must use the same Index type.

Thanks for pointing this out, I'm making a note that validateIndexTraversals should probably test this somehow 🙂

It's a protocol requirement; the compiler will block attempts for SubSequence to use a different Index type by reporting non-conformance. So you wouldn't need an extra test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants