Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Iteration protocol change #25261

Merged
merged 2 commits into from
May 18, 2018
Merged

Iteration protocol change #25261

merged 2 commits into from
May 18, 2018

Conversation

Keno
Copy link
Member

@Keno Keno commented Dec 23, 2017

I imported @vtjnash's lowering change and then went ahead and rewrote the base iterator wrappers to use the new iteration protocol instead so people can see it in action. This passes the iterators test, but I haven't run other tests yet, so I suspect some fail.

The strategy here basically what is described in #18823, with the exception of using unions rather than nullables. Additionally, I kept done around as an optional addition (the semantic guarantee
is that if done is not an error, then it determines whether or not the next call to iterate will return true). This was necessary, because the isempty function was derived from the iteration protocol previously, which is no longer possible with the new code. However, after playing with this, it is not actually clear to me that any use of done other than to implement isempty is useful with the new iteration protocol, so maybe the correct thing here is just to have iterators optionally implement isempty just as they would length now (for iterators where this is determinable).

Couple of things on my todo list:

  • Fix other tests
  • Write documentation for new iteration protocol
  • Add lookahead iterator that adds a reliable done method to any iterator by prefetching an element and keeping it in the state (essentially the generic version of the trick people were using to implement done for such iterators before)

There was also some performance concerns needed to be looked at (and @JeffBezanson and @vtjnash wanted to help out with), so hopefully this branch can help with that.

Comments and feedback greatly appreciated (again though, WIP, not supposed to pass tests, etc, so there's probably bugs, but hopefully this is useful to get a feel for what this change does).

@Keno
Copy link
Member Author

Keno commented Dec 23, 2017

Another thing to discuss is whether the absence of start in this proposal is an advantage or a disadvantage. In essence, this proposal uses the absence of a state argument to iterate as way to write the separate method based on the observation that for simple iterators, the initial state argument is often trivial:

iterate(x::Array, i=1) = i > length(x) ? nothing : (a[i], i+1)

However, it has the distinct disadvantage that it can't carry any state. I didn't appreciate this aspect before (since it's sort of implicit in our current iteration protocol), but there are certain iterators that want to do some pre-processing at the beginning of an iteration, e.g. imagine:

struct EachLineFile; path::String; end
start(f::File) = (itr = EachLine(open(f.path)); (itr, start(itr)))
iterate(f::File, state=start(f)) = iterate(state...)

which you can of course do just as above for the first argument, but I think it might validly be a problem for separation of concerns. More practical considerations might be something like #22467 (which may also be the case for the EachLineFile operator above depending how you implement it).

On the other hand, there may be iterators for which constructing a valid, type-stable state object is not possible without obtaining the element.

I think the best solution I can currently imagine is making the iteration protocol something like the following:

y = iterate(coll, start(coll))
while y !== nothing
    val, state = y
    # body
    y = iterate(coll, state)
end

this has the advantage that it's file if the return type of start is different from that of the regular state object, because they never get assigned to the same variable.

We could even do something like:

start(x::Any) = start
iterate(x::Any, ::typeof(start)) = iterate(x)

to preserve the nice one liner above for simple iterators, but I worry about whether that'll be too complicated.

@Keno
Copy link
Member Author

Keno commented Dec 23, 2017

Another possibility:

it = prepare(coll)
y = iterate(it)
while y !== nothing
    val, state = y
    # body
    y = iterate(it, state)
end

with

prepare(x::Any) = x

@nalimilan
Copy link
Member

nalimilan commented Dec 23, 2017

Leaving breadcrumbs to ec91f63 from PR #23642, where I had to introduce a MaybeValue hack to ProductIterator to avoid the performance impact of returning a value equal to nothing when there's no value to return. It would be nice to be able to get rid of it. EDIT: I've just seen you had removed it, great! Nanosoldier covers this, so it should be easy to check whether or to what extent performance suffers.

Probably related: @davidanthoff discussed the case of iterators which need to allocate a resource on the first iteration, see #22466.

@davidanthoff
Copy link
Contributor

I think the prepare idea might allow me to get rid of the whole getiterator mess in https://github.com/davidanthoff/IteratorInterfaceExtensions.jl for Query.jl and IterableTables.jl. I'm not sure, need to think about it a bit more, but right now I'm pretty positive. If that was the case, it would be a major, major benefit and simplification for that ecosystem. If we also could solve the resource management issue, boy, would that be wonderful.

I hope this PR will stay open for a while? With the holidays it might be tricky to find the time needed to think about this... (I plan to eat, eat and eat the next couple of days, and do nothing else).

@timholy
Copy link
Member

timholy commented Dec 26, 2017

I had a brief look and it seemed pretty reasonable to me. It might be interesting to try to implement the CartesianIndices iteration "natively" (using iterate) rather than relying on its start/next/done methods. Given my interest in seeing #9080 put to bed, I can take a crack at that if you don't have time.

With regards to prepare, your strategy seems quite reasonable. I'd slightly favor the name iterator(x) over prepare(x), as with the latter it's not entirely clear what it's being prepared for. For example, if x is a file handle, I could imagine several different kinds of preparation depending on what you're planning to do with that file.

Thanks so much for doing this!

@nalimilan
Copy link
Member

I agree with @timholy it would make sense to use iterator instead of prepare. That's also really close to getiterator from IteratorInterfaceExtensions, which was indeed general enough to possibly warrant inclusion in Base when discussing the registration of the package (JuliaLang/METADATA.jl#11692). The poster child for this is calling iterator on a data frame to get an iterator encoding information on the column types even though the data frame itself doesn't. In that case, we don't really "prepare" anything, we just return an iterator which will ensure type stability of the iterations.

@Keno
Copy link
Member Author

Keno commented Dec 26, 2017

I have more changes locally, and I'm working on finishing up the PR without either of the proposed extensions (since they're extensions, it's fairly easy to do either in addition on top). However, having spent some more time thinking about this, I'm not sure I'm convinced that it is useful or necessary. It seems always possible to encode whatever information into the state object. I guess the fundamental question is what can you do with the state other than pass it back to iterate. For some iterators, we generally consider the ability to copy the state and start iterating at the same point again. However, whether that works for a given iterator is somewhat ill specified at this point. I'll keep thinking about this and finish up the PR in the mean time, but it's not clear to me.

@timholy
Copy link
Member

timholy commented Dec 26, 2017

I agree that it seems like it should be able to bundle everything into start. It's been ages since I've read the issue about "iterators that don't know they're done until they try," but it seems like it should be possible to try for the first object and stash it in state---the same union-splitting that makes this PR possible should also make that possible.

@Keno
Copy link
Member Author

Keno commented Dec 27, 2017

On slack, @vtjnash and I came up with the following proposed lowering:

it = iterator(coll)
advance = iterate(it...)
while advance !== nothing
   f(head(advance))
   advance = iterate(tail(advance)...)
end

@vtjnash also proposes the following extension to handle resource cleanup:

it = iterator(x)
hascleanup = isapplicable(cleanup, it)
if hascleanup
    let isclean = Atomic(false)
        finalizer( it -> (atomic_set!(isclean, true) || cleanup(it)), it)
    end
end
while true
...
end
hascleanup && cleanup(it...) # auto-inserted for any break or return

There's a couple of things to like about this. With a default implementation of

iterator(x::Any) = (x,)

this reduces to the original proposal in this PR for simple iterators. An implementation like,

iterator(x::MyIt) = (x, 1)

could be used to give initial state if necessary (or that could still be put into the iterate function itself).
For an iterator that's self-mutating, we could just have

function iterate(x::SelfMutatingIterator)
    mutate!(x)
    eof(x) && return nothing
    (x.val, x)
end

which I think nicely expresses the notion that such an iterator doesn't really have state.

@JeffBezanson JeffBezanson added the triage This should be discussed on a triage call label Dec 28, 2017
@Keno
Copy link
Member Author

Keno commented Dec 28, 2017

Slight revision from triage:

it = iterator(coll)
advance = iterate(it...)
while advance !== nothing
   f(head(advance))
   advance = iterate(first(it), tail(advance)...)
end

@vtjnash
Copy link
Member

vtjnash commented Dec 28, 2017

That seems to have a typo – what is first(it)? (probably my fault for writing the original wrong) Here's the fixed version:

it = iterate(iterator(coll)...) # NOTE: this perhaps should simplify to just `iterate(coll)`
while it !== nothing
   do(head(it))
   it = iterate(tail(it)...)
end

@Keno
Copy link
Member Author

Keno commented Dec 29, 2017

No, it's not typo. It's the first element of the tuple returned by iterator.

@JeffBezanson
Copy link
Member

I notice a small readability issue: previously we could write value, state = next(itr, state), but that now becomes x = iterate(itr, x[2]); x[1], x[2]. Maybe we could return a named tuple, so that this can be written with x.value, x.state?

@@ -435,9 +424,9 @@ julia> collect(Iterators.rest([1,2,3,4], 2))
```
"""
rest(itr,state) = Rest(itr,state)
rest(itr) = itr
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit confusing since often the functions first and rest are synonyms for head and tail, i.e. rest(x) drops the first element.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to drop this method, though it is useful (see the use in _totuple).

@StefanKarpinski
Copy link
Member

Maybe we could return a named tuple, so that this can be written with x.value, x.state?

Why not use a real struct type then? IterationPair or IterationValueState or something.

@vtjnash
Copy link
Member

vtjnash commented Dec 31, 2017

The ergonomics of not using a Tuple are pretty terrible for the Method signature. I don't want to write:

(iterate(x, state)::IterationPair) = (x[state], state + 1)

Have multiple ways of accessing the return value (x[1], first(x), x.value) also likely means test coverage will be worse.

@vtjnash
Copy link
Member

vtjnash commented Dec 31, 2017

If we want to recover / preserve the existing structure, we could provide something like:

val, state = @iterate (x, state) || return

But this is just an example of normal Union{T, Nothing} handling, so it'll benefit from whatever special syntax we eventually provide for that:

val, state = @nullable iterate(x, state) || return
val, state = iterate?(x, state) ||? return
etc., TBD, ...

i, state = next(destiter, state)
dest[i] = x
y == nothing &&
throw(ArgumentError(string("source has fewer elements than required")))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multi line conditionals should probably use if

Also, why are you calling string on a String literal?

x2, state = next(a, state)
done(a, state) && return hash(x2, hash(x1, h))
y1 = iterate(a)
y1 == nothing && return h
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

===

base/array.jl Outdated

for acurr in a
if f(acurr)
a[i] = acurr
i, state = next(idx, state)
y = iterate(idx, state)
y == nothing && (i += 1; break)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiple statements should probably be on multiple lines, but in this case, just assert that the number of indexes and the number of elements are matched:

@assert (y::Tuple{Any, Any}) !== nothing

# promote_shape guarantees that A and B have the
# same iteration space
while ay !== nothing
@inbounds r[ri] = ($f)(ay[1], by[1])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure this annotation is on the right expression. We just want it on the r[ri], right?

base/dict.jl Outdated
@@ -143,7 +143,7 @@ function Dict(kv)
try
dict_with_eltype((K, V) -> Dict{K, V}, kv, eltype(kv))
catch e
if !applicable(start, kv) || !all(x->isa(x,Union{Tuple,Pair}),kv)
if (!applicable(start, kv) && !applicable(iterate, kv)) || !all(x->isa(x,Union{Tuple,Pair}),kv)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iterate is always applicable (fallback to calling start), so this will just have to be broken now perhaps (or improved to actually look at the error?)

while true
for c in completions
(i > endof(c) || c[i] != cc) && return ret
end
ret = string(ret, cc)
i >= endof(c1) && return ret
i = nexti
cc, nexti = next(c1, i)
cc, nexti = c1, i+1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i+1 isn’t a valid string index.

base/sysimg.jl Outdated
@@ -443,7 +443,7 @@ init_load_path(ccall(:jl_get_julia_bindir, Any, ()))

INCLUDE_STATE = 3 # include = include_relative

import Base64
#import Base64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated?

base/sysimg.jl Outdated
# @deprecate_binding Profile root_module(:Profile) true ", run `using Profile` instead"
# @deprecate_binding Dates root_module(:Dates) true ", run `using Dates` instead"
# @deprecate_binding Distributed root_module(:Distributed) true ", run `using Distributed` instead"
# end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated?

`(block
;; *** either this or force all for loop vars local
,.(map (lambda (r) `(local ,r))
(lhs-vars (cadr (car ranges))))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indent

@@ -2269,7 +2274,7 @@
(cdr (cadr e))
(list (cadr e))))
(first #t))
(expand-for (if first 'while 'inner-while)
(expand-for first
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drop lowering code for inner-while also (a few lines above here)

strip_iter_union(X::Type{Union{Nothing, T}}) where {T} = T
strip_iter_union(::Type{Nothing}) = Nothing
strip_iter_union(::Type{Union{}}) = Union{}
strip_iter_union(T) = T
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this'll be very slow (because of compilation), and possibly unreliable (because T = Union{Nothing, T} is also a valid solution), use typesubtract instead

head[i] = y[1]
while i < n
y = iterate(c, y[2])
y == nothing && return (resize!(head, i), ())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you making this intentionally type-unstable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Frankly I just wanted to get the tests to pass so I could start looking at the performance, though I don't think it matters too much since it's only used in pmap, which is already fairly complex.

@Keno Keno force-pushed the kf/iterate branch 2 times, most recently from 75eeebb to 9f5c04f Compare January 1, 2018 19:22
@@ -1979,6 +1985,10 @@ function hash(a::AbstractArray{T}, h::UInt) where T
if isa(a, AbstractVector) && applicable(-, x2, x1)
n = 1
local step, laststep, laststate
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

laststate isn't used anymore.

@Keno
Copy link
Member Author

Keno commented Jan 5, 2018

After much triaging, it seems like the favored approach is to do this changed as originally proposed in this PR. As of yet unresolved is whether to add a canonical iterator hook, though given the absence of splatting in this iteration protocol, that wouldn't be allowed to specify an initial state.

@Keno
Copy link
Member Author

Keno commented May 17, 2018

Planning to merge once CI passes.
One final go for good measure:

@nanosoldier runbenchmarks(ALL, vs=":master")

@vtjnash
Copy link
Member

vtjnash commented May 17, 2018

I think ideally, we should deal with bug-fixes from merging SSAIR found by PkgDev (https://pkg.julialang.org/pulse.html) before we add a breaking change on top

@Keno
Copy link
Member Author

Keno commented May 17, 2018

You'll have to convince people to delay tagging the alpha for that reason.

@Keno
Copy link
Member Author

Keno commented May 17, 2018

Test failure is LLVM failing to vectorize because it doesn't like our loop new structure (early exit at the top). Shouldn't be too hard to fix. Will take a look in the morning. Alternatively, we may want to change the lowering after all to make the loop structure more obvious to LLVM.

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

@Keno
Copy link
Member Author

Keno commented May 17, 2018

Fixing LLVM was too hard, I updated this to use the loop inverted lowering, which should fix the test case.

@nanosoldier runbenchmarks(ALL, vs=":master")

Keno and others added 2 commits May 17, 2018 21:11
This changes the iteration protocol from `start`/`next`/`done` to `iterate`.
The new lowering of a for loop is as follows:

```
for x in itr
    ...
end
```

becomes

```
next = iterate(itr)
while next !== nothing
    x, state = next::Tuple{Any, Any}
    ...
    next = iterate(itr, state)
end
```

The semantics are as apparent from the above lowering. `iterate` returns
either `nothing` or a tuple of value and state. The state is passed
to any subsequent operation. The first iteration is indicated, by not passing
the second, state argument to the `iterate` method.

Adaptors in both directions are provided to keep the legacy iteration
protocol working for now. However, performance of the legacy iteration
protocol will be severely pessimized.

As an optional add-on for mutable iterators, a new `isdone` function is
provided. This function is intended as an O(1) approximate query for
iterator completion, where such a calculation is possible without mutation
and/or is significantly faster than attempting to obtain the element itself.
The function makes use of 3-value logic. `missing` is always an acceptable
answer, in which case the caller should go ahead and attempt the iteration
to obtain a definite result. If the result is not `missing`, it must be
exact (i.e. if true, the next call to iterate must return `nothing`, if
false it must not return nothing).
The primary idea of the new iteration protocol is that for
a function like:
```
function iterate(itr)
   done(itr) ? nothing : next(itr)
end
```
we can fuse the `done` comparison into the loop condition and
recover the same loop structure we had before (while retaining
the flexibility of not requiring the done function to be separate),
i.e. for
```
y = iterate(itr)
y === nothing && break
```
we want to have after inlining and early optimization:
```
done(itr) && break
y = next(itr)
```
LLVM performs this optimization in jump threading. However, we run
into a problem. At the top of the loop we have:
```
y = iterate
top:
%cond = y === nothing
br i1 %cond, %exit, %loop
....
```
We'd want to thread over the `top` block (this makes sense, since
by the discussion above, we need to merge our condition into the
loop exit condition). However, LLVM (quite sensibly) refuses to
thread over loop headers and since `top` is both a loop header
and a loop exit, we fail to perform the appropriate transformation.

However, there's a simple fix. Instead of emitting a foor loop as
```
y = iterate(itr)
while y !== nothing
    x, state = y
    ...
    y = iterate(itr, state)
end
```
we can emit it as
```
y = iterate(itr)
if y !== nothing
    while true
       x, state = y
       ...
       y = iterate(itr, state)
       y === nothing && break
    end
end
```
This transformation is known as `loop inversion` (or a special
case of `loop rotation`. In our case the primary benefit is
that we can fuse the condition contained in the initial `iterate`
call into the bypass if, which then lets LLVM understand our loop
structure.

Co-authored-by: Jeff Bezanson <jeff@juliacomputing.com>
@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

@Keno Keno merged commit aa301aa into master May 18, 2018
@martinholters martinholters deleted the kf/iterate branch May 18, 2018 04:07
@Sacha0
Copy link
Member

Sacha0 commented May 18, 2018

Boom!

@JeffBezanson
Copy link
Member

Do we, uh, have a handle on those 600x regressions?

@Keno
Copy link
Member Author

Keno commented May 18, 2018

Yes, they trigger the deprecation warning for the old iteration protocol.

maleadt added a commit to maleadt/LLVM.jl that referenced this pull request May 18, 2018
maleadt added a commit to JuliaGPU/CUDAdrv.jl that referenced this pull request May 18, 2018
martinholters added a commit to martinholters/DataStructures.jl that referenced this pull request May 18, 2018
LegacyIterationCompat{T, typeof(val), typeof(state)}(val, state)
end

function next(itr::I, state::LegacyIterationCompat{I,T,S}) where {I,T,S}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This (and similar for done) are real ambiguity magnets for all iterators out there that define next(it::MyIterator, state). Of course, next should never be called with arguments that hit the ambiguity, but detect_ambiguities is not amused. But I suppose there is no easy way to prevent that?

@StefanKarpinski StefanKarpinski added the needs news A NEWS entry is required for this change label Jun 1, 2018
@KristofferC KristofferC removed the needs news A NEWS entry is required for this change label Nov 13, 2018
Keno added a commit that referenced this pull request Jun 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deprecation This change introduces or involves a deprecation iteration Involves iteration or the iteration protocol
Projects
None yet
Development

Successfully merging this pull request may close these issues.