-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Julep: solving tricky iteration problems #15648
Comments
Thanks a lot for writing the blog post and this issue, I learned a lot from it. I have a question: You propose
If B is an In general, if you can come up with a So if possible I would prefer having a single I am also not sure if it is always a good idea to put one iterator in charge of the whole iteration order. How would you, for example formulate the loop for |
You're absolutely right; the intent was to keep all the shape information, and I overlooked the problem you pointed out. I was trying to avoid directly indexing arrays with something weird (the function Your two points also come together very nicely if one thinks about matrix-matrix (rather than matrix-vector) multiplication. First, you pretty much have to have a syntax that addresses the arrays directly, rather than some "master" iterator. Second, you're right that you probably don't want any iterator in charge. Something like this? for (idest, iA, jdest, jB, kA, kB) in couple((dest[:,?], A[:,?]), (dest[?,:], B[?,:]), (A[?,:], B[:,?]))
dest[idest,jdest] += A[idest,kA]*B[kB,jB]
end which could hopefully be simplified to @iterate dest[i,j] += A[i,k]*B[k,j] Here, there's not actually much magic in My big worry is that In some situations it may be difficult/impossible to make it type-stable (though all the function foo!(dest, A, B)
iters = couple(...)
_foo(dest, A, B, iters)
end
@noinline function _foo!(dest, A, B, iters)
for (idest, iA, jdest, jB, kA, kB) in iters
...
end
end However, for the function foo!(dest, A, B)
iters = couple(...)
@noinline function _foo!(dest, A, B, iters)
for (idest, iA, jdest, jB, kA, kB) in iters
...
end
end
_foo(dest, A, B, iters)
end I haven't yet tested whether |
On Tue, Mar 29, 2016 at 11:18 AM, Tim Holy notifications@github.com wrote:
Ooooooh. This is now awfully close to an abstract index notation, which for magic in joined(dest[:i,:j], A[;i,:k], B[:k,:j]) Obviously the syntax for the loop body can be improved, maybe with a macro, -erik
Erik Schnetter schnetter@gmail.com |
@eschnett, what you're suggesting is exactly what the |
I was trying to generalize the notation for couple((dest[:,?], A[:,?]), (dest[?,:], B[?,:]), (A[?,:], B[:,?])) one could write coupled(dest[:i,:j,:k], A[:i,:j,:l], B[:j,:k,:m], C[:m]) which allows for more than two indices that can be used in arbitrary order, allowing transposition. |
Let's look at it from the other side. It seems there are three things
@iterate (i, j, k) begin # specify iteration order
D[i,j] += A[i,k]*B[k,j]
end
# would expand to:
(sz1, sz2, sz3) = … # get loop lengths and validate array sizes
Ditr = eachindex(D, (1, 2)) # will want Val{(1,2)} or some such
Aitr = eachindex(A, (1, 2))
Bitr = eachindex(B, (2, 1))
Dstate = start(iD)
Astate = start(iA)
for _i=1:sz1
Bstate = start(iB) # B isn't dependent upon i, @iterate just restarts it every loop
for _j=1:sz2
(iD, Dstate) = next(Ditr, Dstate) # D isn't dependent on k, so it hoists it
Astate′ = Astate # A isn't dependent upon j, so we jump back to where we were
for _k=1:sz3
(iB, Bstate) = next(Bitr, Bstate)
(iA, Astate′) = next(Aitr, Astate′)
D[iD] += A[iA]*B[iB]
end
end
Astate = Astate′
end I believe this is sufficiently general that I'm sure you're aware, Tim, but I'd like to also point to TensorOperations.jl, which I think does some of |
@eschnett, the reason I think you need to set this up in a type-stable manner is that some array types aren't efficiently indexed by an @mbauman, I suspect you're right about some of the challenges. What you're describing is basically how the macros in KernelTools work currently; I also wrote a nifty little macro ( But for generic code (that might have to handle |
@mbauman, in your |
Yes, precisely. The core idea is that each array only needs to concern itself about the order of iteration, and the |
I guess the reason I didn't go this way is because I was attracted by the hope of solving the loop-ordering problem. Since the macro can't see the types, I was looking for a way for it to set things up in a form that would be amenable to type analysis. |
@timholy I don't understand the connection between the order in which the indices are traversed and the syntax to describe which indices need to be coupled. I agree that sparse (or reshaped, or symmetric where only part of the elements are stored, etc.) arrays require special care. Let's take this loop as example: for i,j,k in coupled(r[:i,:j,:k], a[:i,:j,:k], b[:j,:k,:i], c[:k,:i,:j])
r[i,j,k] = a[i,j,k] + b[j,k,i] + c[k,i,j]
end How would this look like in the |
Something like iblock = (r[:,?,?], a[:,?,?], b[?,?,:], c[?,:,?])
jblock = (r[?,:,?], a[?,:,?], b[:,?,?], c[?,?,:])
kblock = (r[?,?,:], a[?,?,:], b[?,:,?], c[:,?,?]) I.e., you match the indexes with the One of the points perhaps worth making is that I was also looking for an analog of All this is to simply explain my reasoning; while I'm attracted by solving the loop ordering problem, I recognize that this may not be possible/practical. (I don't know the literature here.) |
@timholy Okay; that's a straightforward transformation from the |
I just want to support @timholy 's point that I think (at least partially) solving the loop ordering problem would be very nice.Ootherwise one would have to accept that generic library functions like |
Perhaps I'm most interested in a stepwise approach: if possible, pick an API that doesn't prevent us from tackling this someday, but punt on loop-ordering for now. Revamping how we do array iteration is a big problem no matter how we slice it, and we might make faster progress by first choosing the smallest feasible unit and getting that working. |
I think that it still may be possible do something about the for loop ordering problem, but it's a bear. To do so, we'll need to use a bunch of flat iterators. Let's take a step back from syntax and type-stability, and just look at what kinds of iterators are required for a flat matrix multiplication loop: for i=1:ni, j=1:nj, k=1:nk
D[i,j] += A[i,k]*B[k,j]
end Consider a vector that's been reshaped into a matrix. How many different iterators does this type need to implement in order to efficiently participate as either
I think this demonstrates that there are two orthogonal concerns here. As far as the individual array goes, it's really only concerned about traversal ordering. So what kind of information do we need to dynamically construct those iterators with full generality? An
Now with this API, I can finally begin to imagine how I think this is the fully general, fully orthogonal design. If we break this orthogonality, I'm afraid that we'll never stop writing different flavors of |
I was also wondering if we'd need an optimizer. I doubt that's a direction we want to explore, unless it would also be useful for making more rational decisions about automatic inlining (which it surely would be). Punting to Polly, by marking a block or function with a future |
@mbauman, focusing on the reshaped-vector part of your comment above, a lot of the concerns can be resolved if we have an API for iterating along a particular dimension, aka #15459. This is a different form of orthogonality than you're talking about, and it would resolve the need for many different variants of A second issue is that fully-orthogonal iterators may not generalize well to efficient sparse iteration: you might want your sparse array iterator to "jump" to the next stored entry, and you might need to correspondingly advance any coupled iterators. This kind of thinking is what first suggested to me that something like I suspect that array-iterators need to be a specialized class, supporting more operations than just |
In my mind you would express |
Thanks to #13412, I'm noticing that this is also possible: @eval foo(::$(typeof(any))) = 1
@eval foo(::$(typeof(all))) = 2 |
To help see what different APIs "feel like" for writing code, as an alternative to endlessly creating gists I opened ArrayIterationPlayground and invite collaborators. This doesn't (yet) contain any implementations of any API, but there's one file that implements a couple of algorithms using a fictitious API. I'd recommend we keep discussion mostly here, though, since it's presumably of general interest. |
The design space here is just ginormous even before you begin considering sparse arrays. I'm trying to bring a dose of reality to all the magic that's flying around. :) My knee-jerk reaction is that trying to extract a single cartesian index out of a given iterator to share with another array will prove to be intractable to code generically. There's two operations that may or may not be possible: eachindex iterators may or may not be able to quickly compute the cartesian indices, and the array that index is shared with may or may not be able to use a cartesian index (especially if you consider OffsetArrays, but I'm still not convinced we should support them without requiring a special offset index type). So you've got to support independent iteration no matter what. I don't think you want to express this as As far as All that said, I'd be very happy to be proven wrong. |
Reality is good. I agree that the challenge here is to define the impactful contribution that's feasible and doesn't prevent more progress by future generations. The implementation of immutable AnyEntry end
const ? = AnyEntry()
# Typically AA<:AbstractArray, but not sure I yet see a reason to require that
immutable DimSelection{AA,D<:Tuple{Vararg{Union{Colon,AnyEntry}}}}
A::AA
end
@inline getindex(A, I::Union{Colon,AnyEntry}...) = DimSelection{typeof(A), I}(A) which is just basically a syntax for passing it on, as you say. I'm not sure whether it's a readability-enhancer, however, and the explicit And yes, writing |
To take out a little of the magic, I'm currently imagining something vaguely like this: immutable CoupledIterator{I,F<:Tuple{Vararg{Function}}}
iter::I
itemfuns::F
end This has a "master" iterator next(iter::CoupledIterator, state) = map(iter.itemfuns, state), next(iter.iter, state) or something. That As an example, for sparse matrix-vector multiplication function couple(::typeof(all), iter::SparseCSCColumnIterator, v::AbstractVector)
CoupledIterator(iter, state->state, state->state.row)
end This doesn't work out all the details, but it suggests that Now, if Does that help? Or did you already read my mind (so this is all old news to you) and then see another few miles ahead to big scary roadblocks? |
It does help some, yes. I had been thinking you were prioritizing reshaped and offset arrays over I think the couple(iter, ::Union{Val{:row}, Val{:col}}...) # or dim numbers, or in the value domain, doesn't matter Now you can also get the column index for the destination vector: Where I get all fuzzy is how to extend this to algorithms like gemm, where there's no one array in charge of the entire triplet of loops. Or in allowing the slave arrays to have a say in the kind of iterator they're indexed by. It's those cases where I think I can see a combinatorial explosion of methods with crazy ambiguity problems coming over the horizon. |
OK, over at ArrayIterationPlayground I've written out a number of algorithms in a proposed API, and that experience seem to suggest we're converging. (It's possible I wasn't consistent, as my API plans evolved as I tried new things.) I found that this API may suffice to implement things as complicated as Here's a summary of where I am now: inds(A, d) # index vector for dimension d (default 1:size(A, d))
inds(A) # tuple-of-inds
zeros(Int, -3:3, -1:5) # creates matrix of size 7-by-7 with the given inds
fill(val, indexes...) # likewise
a[first+1:(last+first)÷2] # copy (or view) of the first half of array, skipping the first element
# `index` and `stored` return "indexing hints," the laziest of wrappers
index(A, :, j) # lazy-wrapper indicating that one wants indexes associated with column j of A
stored(A, :, j) # just the stored values of A in column j
index(stored(A, :, j)) # just the row-indexes of the stored values in column j
index(A, :, ?) # row-index iterator for an arbitrary (unknown) column of A
index(A, Dimension{2}) # similar to index(A, ?, :, ?...)
# Likely: (notice this is different from sub, because this keeps original indexes)
index(A, first+1:last) # would iterate over 2:length(A) for a typical Array
index(A, first+1:last, j) # like above, but just over indexes for dimension 1
couple(iter1, iter2) # iterates over iter1, iter2 containers, keeping them in sync
# Do we want/need this? I suspect not (default would be `any`)
couple(any, stored(iter1), stored(iter2)) # visits i if either iter1[i] or iter2[i] has a value
couple(all, stored(iter1), stored(iter2)) # visits i if both iter1[i] and iter2[i] have values
icat(a, b) # index-preserving concatenation
# For example:
# icat(7:10, 3) 7:10 is indexed with 1:4, so this creates a vector indexed from 1:5 (numbers aren't tied to an index)
# icat(3, 7:10) creates a vector indexed from 0:4
# icat(5:7, 2:4) is an error, because they have overlapping indexes 1:3
# icat(5:7, OffsetArray(2:4, 4:6)) indexed from 1:6
# icat(5:7, OffsetArray(2:4, 5:7)) an error, non-contiguous indexes Other than the I've got a lot of teaching to do in the immediate future, so am unlikely to make much further progress for a while. But unless I hear otherwise I may slowly start implementing pieces of this. It's a lot of new exports, so obviously this deserves some thought. |
This looks interesting. I'll probably implement them as well for https://github.com/eschnett/FlexibleArrays.jl. Painting the bike shed: The name |
Thanks for reminding me about Anyway, on the bike shed: I agree |
I'm not sold on the complexity of
I'm fairly certain this is just special syntax for
My sense for
which is specifically not an iteration over dest (except in the common array case where the keys of dest were the same as the keys of src) |
The main place you need something like Yes, now that I think arrays should be thought of as associative containers, I agree that's the right meaning of |
|
I'm not against that, I'm only questioning whether |
I should have clarified that your Regarding Unless you're proposing that |
I also edited the top post to try to reduce confusion from the fact that the API proposal is being hammered out here, and hence the top post isn't very representative of my current thinking. |
I haven't been much involved in various stages of iterator development, but looking at #1032 sparked an idea: what if there was a callback function that each iterator gets passed to before get/setindex? This would abstract away the problem of iterators needing to know about things like whether the array indices start at 1, 0 or some other number. The iterator can say "I point to the 3rd element in the 5th row" and the callback function is responsible for transforming this into a value that get/setindex! will accept. This could solve:
The generic fallbacks for this callback function would define the "standard" array implementation (1-based indexing etc.), but users could provide specific methods for new |
That's basically what the status quo is with cartesian indices. In fact, when you index an That's not true, though, when The trouble is that you don't always want to iterate over all indices, and not always in column-major order. If you're able to describe the access pattern up-front when you request the index iterator, then it allows for cleverness of this nature, even in more complicated nested loops with multiple arrays. And if we're able to always ask arrays for index iterators, then maybe this can safely allow offset arrays to define non-standard |
I see, thanks for the clear explanation. |
Only if you want to avoid performance regressions. If you don't care about that, they're really easy 😄. |
I came across @Jutho's TensorOperations.jl. The correspondence between the APIs proposed here and that in TensorOperations is fairly clear. As far as the DSL part of the solution to this issue goes, TensorOperations' DSL, seems to be good at expressing loop-ridden algorithms without committing to any implementation scheme or even a commitment to what the allowed indices are. I wanted to point out this excerpt from the README that talks about a clever implementation strategy:
I think parallel implementations can exploit facilities of this divide-and-conquer strategy as well. Food for thought. |
Perhaps I am way off base on this one but it seems to me that in its most general form this is an attempt to re-invent database joins, especially in the case of sparse matrices. But it has been know for some time that for large numbers of indexes (i.e. rows) it will be hard to beat hash tables on the specific (fields/columns) indexes being joined. Essentially sparse matrix multiplication is a form of inner join, and sparse matrix addition is a form of outer join. Put another way, while it may seem you want to march both arrays in some form of "simultaneous" manner, in the most general case, where you give up all prior knowledge of the memory layout, you would want to read one array in the most efficient linear manner possible, and then the other array, while using a pair of hashes that are dual to one another (e.g. hash 1 maps in column major order and hash 2 maps in row major order). |
Ref.
SparseVector s and SparseMatrixCSC s sufficient for the purposes of sparse map and broadcast.) Best!
|
Is this issue still relevant after the revision of the iteration protocol? |
Skimming quickly, I don't think it had anything to do with the protocol, but may have been enabled by the changes to the I don't think we need an issue open though, if there's nothing to change specifically |
(Oops, I've been preparing to post a possible new solution here then I just noticed that it's closed. But please allow me to continue the discussion. I think "how to use the best/a good-enough iteration strategy for a given set of containers?" is still an issue.) I'd like to suggest an approach to this problem "dual" to the iterator-based APIs: use using NDReducibles: ndreducible
using Referenceables: referenceable
# Implementing C += A * B
foreach(
# Create a "virtual container" (reducible) where the indices are coupled
# based on the specification below:
ndreducible(
referenceable(C) => (:i, :j), # mark that we want to mutate C
A => (:i, :k),
B => (:k, :j),
),
) do (c, a, b)
c[] += a * b
end This tries to choose the best nesting order of the This API is based on Transducers.jl's version of One point of view is that this API is a generalization of the broadcasting API. That is to say, [1] the last few points are actually doable in |
The problem (a brief summary)
In a PR for a blog post, I described a number of challenges for several potential new
AbstractArray
types. Very briefly, the types and the challenges they exhibit are:ReshapedArrays
: the only efficient way of indexing some types is by indexing the parent array directly (i.e., with an iterator optimized for the pre-reshaping array). This means that indexing a 2D array asB[i,j]
, for integeri
andj
, may not always be efficient.TransposedMatrix
(for Taking vector transposes seriously #4774): iteration is most efficient when it produces a cache-friendly sequence of array accesses. Unfortunately, algorithms that mix column-major and row-major arrays imply that we can't naively alloweachindex(A)
to always return the most efficient iterator.OffsetArrays
(sometimes called "Fortran arrays"): these violate the assumptions that array indexes run from1:size(A,d)
, and more generally question how one should match elements of two differentarrays.
Possible APIs
In the blog post, only two concrete algorithms were considered: matrix-vector multiplication and
copy!
. At the risk of being insufficiently general, let's explore possible APIs in terms of these two functions.EDIT: note that the thoughts on the API are evolving as a consequence of this discussion. To jump ahead to an updated proposal, see #15648 (comment). There may be further updates even later. The material below is left for reference, but some of it is outdated.
The "raw" API
For the moment, let's worry about correctness and efficiency more than elegance. Correctness trumps all other considerations. Efficiency means that we have to be able to iterate over the matrix in
cache-friendly order. That means that the array
B
has to be "in charge" of the sequence of values and pick its most efficient iterator.I should emphasize that I haven't taken even the first step towards attempting to implement this API. So this is crystal-ball development, trying to imagine what might work while hoping to shake
out likely problems in advance via community feedback. That caveat understood, here's a possible implementation of matrix-vector multiplication:
eachindex(iter, obj)
should perform bounds-checking, to ensure that the indexes ofiter
align withobj
(even handling things likeOffsetArrays
). The notationRB[*,:]
means "corresponding to thesecond dimension of
RB
"; interestingly, this expression parses without error, attempting to dispatch togetindex
with argument typesTuple{typeof(RB), Function, Colon}
. Hopefully there isn't another pending meaning for indexing with aFunction
.Because either
idest
oriv
is constant over the "inner" loop (depending on storage order ofB
), ideally one would like to hoist the corresponding accessdest[idest]
orv[iv]
. I'd argue that's ajob for the compiler, and not something we should have to bake into the API.
Because
zip(X, Y)
does not allowY
to see the state ofX
, it seems quite possible that this could not be made efficient in general (consider the case described forReshapedArray
s, where theefficiently-parametrized iterator for the first dimension depends on the state of the index for the second dimension). If this proves to be the case, we have to abandon
zip
and write this more likewhere
couple
is a new exported function.One interesting thing about this API is that sparse multiplication might be implemented efficiently just by replacing the first line with
so that only "stored" elements of
B
are visited. (@tkelman may have been the first to propose aneachstored
iterator.) It should be pointed out that several recent discussions have addressed some of the complications that arise fromInf
andNaN
, and it remains to be determined whether a version that doesn't ignore these complications can be written with this (or similar) API.Likewise,
copy!
might be implementedAt least for the array types described here, this API seems plausibly capable of generating both correct and (fairly) efficient code. One might ideally want to write generic "tiled" implementations, but that seems beyond the scope of this initial effort.
Making it prettier with "sufficiently smart iteration"
This API is still more complicated than naive iteration. One might possibly be able to hide a lot of this with a magical
@iterate
macro,for matrix-vector multiplication and
for
copy!
. The idea is that the@iterate
macro would expand these expressions into their longer forms above. Note that the first argument to@iterate
specifies which object is "in charge" of thesequence of operations.
@iterate
might be too limited for many things, but may be worth exploring. The KernelTools repository may have some useful tidbits.The text was updated successfully, but these errors were encountered: