Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a compiler pass for iterate(::CartesianIndices) #35074

Closed
wants to merge 1 commit into from

Conversation

timholy
Copy link
Sponsor Member

@timholy timholy commented Mar 11, 2020

This replaces iteration over CartesianIndices{N} with a set of N nested one-dimensional loops. Proximally, this fixes #9080; bafflingly, it is now faster than the manual iteration: (EDIT: in final version they are the same)

function sumcart_iter(A)
    s = zero(eltype(A))
    @expandci for I in CartesianIndices(A)     # @expandci will be explained below
        @inbounds s += A[I]
    end
    s
end

function sumcart_manual(A::AbstractMatrix)
    s = zero(eltype(A))
    ax1, ax2 = axes(A)
    for j in ax2, i in ax1
        @inbounds s += A[i, j]
    end
    s
end

A = rand(4, 4)
using Test
@test sumcart_iter(A) == sumcart_manual(A)
using BenchmarkTools
@btime sumcart_iter($A)
@btime sumcart_manual($A)

yields

  7.910 ns (0 allocations: 0 bytes)
  11.928 ns (0 allocations: 0 bytes)

This makes little sense to me since the point of the PR is to turn sumcart_iter into sumcart_manual, but it's been very consistent across runs. Mind you, I'm not exactly complaining.

The more long-term goal is to pave the way for loop-reordering and other transformations. @chriselrod has posted some absolutely mind-bending performance improvements with LoopVectorization, including some I really want in order to make ImageFiltering.jl competitive with OpenCV for operations like blurring. (They've manually written a bunch of hand-SIMDed, loop-unrolled functions. I'd really rather not make our code that ugly, but currently they are just smoking us performance-wise and that makes me unhappy.) LoopVectorization has devdocs now, and if you're interested you should read them, but briefly it uses a two-stage process: (1) a DSL that operates during macroexpansion of @avx to encode the operations of the loop into a format that gets enough info into the type system so that (2) a generated function can then reconstruct enough about the loop using type info alone to set up a cost model for performing the operations and then choosing the loop ordering and vectorization semantics that give the lowest cost. It's very sophisticated, but I've wondered whether it might, in the long run, be easier to do as a compiler pass where you wouldn't need to go through this two-stage process. So a secondary motivation here was to pave the way and see what it would feel like.

Which comes to the immediate priority: this doesn't yet work. It's awesome when I implement it as an external pass using #35015, that's where those benchmark numbers above come from. @expandci is in this gist, and it calls methods defined in this PR. It was also pretty straightforward to write this, the only tricky part was getting the block-jumps (the various :goto expressions) pointing to the correct places. However, moving it into Core.Compiler (where one no longer needs @expandci) gives me trouble getting through bootstrap; while compiling this method I get an error which starts like this:

Internal error: encountered unexpected error in runtime:
UndefRefError()
getindex at ./array.jl:786 [inlined]
domsort_ssa! at ./compiler/ssair/slot2ssa.jl:500
construct_ssa! at ./compiler/ssair/slot2ssa.jl:869
slot2reg at ./compiler/ssair/driver.jl:127 [inlined]
run_passes at ./compiler/ssair/driver.jl:134
optimize at ./compiler/optimize.jl:175

The problematic line is here. It doesn't have trouble when I use this as an external pass, even when I use that pass on the same function that's giving trouble. Any ideas?

Copy link
Sponsor Member Author

@timholy timholy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few line comments for context

@@ -400,17 +406,29 @@ function renumber_ir_elements!(body::Vector{Any}, changemap::Vector{Int})
end

function renumber_ir_elements!(body::Vector{Any}, ssachangemap::Vector{Int}, labelchangemap::Vector{Int})
delta = abs(labelchangemap[1])
Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the changes here and below are cosmetic (like the i->j renaming I made to make it easier to interpret @show while debugging), but I was a bit worried about the early-exit criterion in this function. Previously it assumed that the sum of all changes wasn't 0; that's valid if you only ever grow the number of lines, but since this might someday be used for deletions I thought it would be better to be a bit more careful.

println(src)
iterator, iteratortype, itervar, bodystart, bodystop, loopexit = extract_iterate(src, id)
@assert iteratortype <: Main.Base.CartesianIndices
# slotname = isa(iterator, Slot) ? Main.Base.string("#", Main.Base.String(src.slotnames[iterator.id])) : Main.Base.String(gensym())
Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to be able to give things nice names but currently this causes bootstrap problems.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general Main.Base might not be available during bootstrap?

Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only gets called when it is, see the check in expand_cartesian_loops!(src::CodeInfo)

src.codelocs = whole.codelocs
src.ssavaluetypes = whole.ssavaluetypes
return src
# newsrc = CodeInfo(whole.stmts, whole.codelocs, whole.ssavaluetypes,
Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure whether it would be better to do in-place modification, but currently it doesn't seem possible to construct a CodeInfo from Julia. Obviously fixable via ccalls, but I took the lazy way out so far.

@@ -83,6 +83,20 @@ end
# MethodInstance/CodeInfo #
###########################

function CodeInfo(stmts::Vector{Any}, codelocs::Vector{Int32}, ssavaluetypes::Vector{Any},
Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not yet useful.

Copy link
Member

@vchuravy vchuravy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Some raw thoughts while reading through this.

  1. My first reaction was that this should operate upon the latter IRCode/SSAIR since that is nicer to work with, but this dependents on iterate not being inlined yet.
  2. This effectively elevates CartesianIndices to a compiler construct, as well as implementing the iteration protocol for them in the compiler. I feel like we should be honest about that and move CartesianIndices into the compiler, or even better take a good look at MLIR's affine maps which can be used to express cartesian iteration and blocked iteration.

For manipulating loops (and especially nested loops), it makes sense
to transiently go back to a nested block structure to avoid lots of
insert! and deleteat! shuffling.

I am strongly in favour of that, I think there are lessons to be learned from MLIR and we are throwing information away to early. Having a structured IR to work with makes working with loops much easier.

My biggest concern here is that one of the biggest draws of Julia for me was always that things are implemented in Julia itself and that there are very few special constructs. A lot of stuff can just be implemented in "userspace", whereas here we introduce a compiler construct and state: No users can't implement it :/ This is why I have been searching for solutions that are agnostic to the user-level input description and recover these patterns after the fact (admittedly a rather hard issue).

@timholy
Copy link
Sponsor Member Author

timholy commented Mar 11, 2020

I'm with you much of the way. I'd rather #9080 got fixed as a by-product of some more general transformation. But that hasn't happened, and given that it's been 6 years perhaps it's time to bring out the big guns. Were it not for the emergence of LoopVectorization I would not have submitted this, as just using the @simd trick to peel off the fastest dimension (#9080 (comment) and my reply below it) works well in practice most of the time. Or even better, switch to foldl as suggested by @tkf. But in my mind LoopVectorization has changed the game, leading me to believe that we want to be able to support some pretty serious code transformation. If simple cases like this can't be handled then it's hard to imagine more complex ones could be.

Another data point: wrapping sumcart_iter in @polly also doesn't fix #9080. I don't know if @polly is just disconnected, but if not then it should be looking at the LLVM code, which of course incorporates inlining all the CartesianIndices operations. If it can't handle this transformation, that's a bit discouraging given how much work went into it. (I used the past tense because it seems much less active than it once was.) Some wise person (was it you?) once said to me that Polly might be too late to apply such transformations; I don't understand that at a really deep level, but I take the point that having more structure can help you understand design/intent/purpose of a block of code. This PR takes that viewpoint to heart and intercepts CartesianIndices at the earliest possible stage.

I wasn't familiar with MLIR, perhaps it would fare better since the affine stuff seems useful in principle.

With regards to doing this transformation on IR code, currently one transformation made by convert_to_ircode is to peel out the meta statements. That's bad in the context of #35015 because you lose positional info. I also just have much less experience working with the SSA form, but with suitable incentive that could of course change. This uses the slots to good advantage to allow one to reassign the loop variable and state on each iteration, presumably you'd need to interact with the phi nodes in SSA form. I can learn to do that if it really would make other things easier. Note that inlining happens after the conversion to ircode so we could run this between the two.

@vchuravy
Copy link
Member

Another data point: wrapping sumcart_iter in @polly also doesn't fix #9080. I don't know if @polly is just disconnected, but if not then it should be looking at the LLVM code, which of course incorporates inlining all the CartesianIndices operations. If it can't handle this transformation, that's a bit discouraging given how much work went into it. (I used the past tense because it seems much less active than it once was.) Some wise person (was it you?) once said to me that Polly might be too late to apply such transformations; I don't understand that at a really deep level, but I take the point that having more structure can help you understand design/intent/purpose of a block of code. This PR takes that viewpoint to heart and intercepts CartesianIndices at the earliest possible stage.

Yes I believe much of our infrastructure around @polly is dead-code. (Following the mantra if it isn't tested it will regress). I have stared a fair amount at LLVM reasoning about CartesianIndices and the infrastructure that trips up is SCEV (https://llvm.org/devmtg/2009-10/ScalarEvolutionAndLoopOptimization.pdf) it is unable to determine that there are N-induction variables and that some are faster/slower than the other. Instead after code-simplification it sees a select and throws its hands. By doing your transformation you are emitting a loop-nest which allows LoopInfo/SCEV in llvm to reason about the N induction variables separately.

I wasn't familiar with MLIR, perhaps it would fare better since the affine stuff seems useful in principle.

One challenge with working with MLIR is that if you throw away information you have to recover it and while it is much easier to write language and domain specific optimisations there, recovering the structure is the issue that makes Polly's effectiveness limited.

This uses the slots to good advantage to allow one to reassign the loop variable and state on each iteration, presumably you'd need to interact with the phi nodes in SSA form. I can learn to do that if it really would make other things easier. Note that inlining happens after the conversion to ircode so we could run this between the two.

I personally would rather see one IR format instead of two ;), but the tools to work with either need more work.

@AriMKatz
Copy link

Could the hypothetical MLIR passes be written in Julia, or would they require C++?

@timholy
Copy link
Sponsor Member Author

timholy commented Mar 11, 2020

If we use MLIR presumably the IR generation would have to be done in C++.

LoopVectorization seems to play in the same space as MLIR (@vchuravy sent me a link to this paper), and these two are not the only ones (the above-mentioned Polly was the previous effort in LLVM-land, and other optimizers exist for other platforms). Based on the very limited benchmarks they show in that paper, LoopVectorization might even outperform MLIR for matrix multiplication in at least some cases, though it's really hard to tell because so much depends on matrix size and the specific testing hardware. Of course there is more to stencil-computation than just that, so the bigger picture is mostly unknown.

One difference is that MLIR was written by a big Google team, whereas LoopVectorization was written in spare time (AFAIK) by Chris Elrod using Julia 😄. One can react to that in two ways: one rational response is "of course we want to leverage the huge investment made by all those genuinely talented people," and another rational response is "wow, if one amazingly talented person can do the work of a small army using Julia, perhaps we should do more of this in Julia." I think it's too early to tell which would be the more productive path, and I applaud efforts to try both. But I don't think we should be shy about blowing past MLIR if we can really do better writing this stuff in Julia.

@vchuravy
Copy link
Member

Could the hypothetical MLIR passes be written in Julia, or would they require C++?

LLVM.jl has shown that you can write Julia passes for a C++ infrastructure, but I am not advocating for a whole-sale import of MLIR into Juli,
but rather that the ideas and concepts are well worth a hard look.

But I don't think we should be shy about blowing past MLIR if we can really do better writing this stuff in Julia.

MLIR is building a large infrastructure to solve a bigger problem in compiler design and is exploring interesting domains like affine loops and representation of parallelism as they go along. I think we haven't seen the full potential yet. Julia is a more productive language than C++, but for me LoopVectorization.jl is an attempt at solving a small subset of the problems the MLIR infrastructure does. One the other hand https://arxiv.org/pdf/2003.00532.pdf is a demonstration on how that infrastructure can be used by one talented engineer to write a flexible and generic loop scheduling infrastructure for matrix-multiply. Anyway this has digressed far enough.

The question I would like to see answered for the PR are:

  1. Are we okay with moving CartesianIndices into Core.Compiler
  2. What would it take to maintain Expr(:for) until the inlining stage so that we can write loop-passes easier?

@tkf
Copy link
Member

tkf commented Mar 11, 2020

Does this patch work (or can be made to work) with other "nested loops" like Iterators.product-of-AbstractArrays and zip-of-AbstractArrays? If you were to add a compiler construct, I suppose it's better to make it applicable to many use cases?

2. I feel like we should be honest about that and move CartesianIndices into the compiler, or even better take a good look at MLIR's affine maps which can be used to express cartesian iteration and blocked iteration.

A lot of stuff can just be implemented in "userspace", whereas here we introduce a compiler construct and state: No users can't implement it :/

Aren't they in direct tension? While I do hope foldl approach to be used to solve "difficult" iteration problems, if the compiler can do some more optimizations by knowing the structure in the code, I think it'd be great to have CartesianIndices-like construct to be exposed by the compiler.

This is why I have been searching for solutions that are agnostic to the user-level input description and recover these patterns after the fact (admittedly a rather hard issue).

@vchuravy This is yet another digression (sorry, @timholy!), but I'm interested in what do you think about foldl-based approach like #35036 or even lowering for loops to foldl rather than iterate. I think this allows much more flexible communication between the compiler and collection implementers. Perhaps you can comment this back in #35036, if you feel like it.

@timholy
Copy link
Sponsor Member Author

timholy commented Mar 11, 2020

Are we okay with moving CartesianIndices into Core.Compiler

It's not a whole-scale move into the compiler in the sense that we need to leave the current code there for the interpreter, but agreed with the sentiment of your question.

What would it take to maintain Expr(:for) until the inlining stage so that we can write loop-passes easier?

I don't know if that's needed; letting it lower allows us to catch while loops etc. it's not very hard to detect the iterate.

Does this patch work (or can be made to work) with other "nested loops" like Iterators.product-of-AbstractArrays and zip-of-AbstractArrays? If you were to add a compiler construct, I suppose it's better to make it applicable to many use cases?

Can be made to work. Does not currently. It would not be a big change.

what do you think about foldl-based approach

It's a fine approach. I am uncertain about lowering to foldl in general, I'd defer to others on that point. If you need reviewers I can take a look.

But as you say, this is(?) a bit orthogonal because the real underlying point here is a tentative step towards a world where we can run a full cost-based analysis on different loop orderings, unrollings, SIMDification, etc. Maybe one could write that with foldl in a manner that gets fully elided by the compiler, but that seems like an awfully tall order.

@tkf
Copy link
Member

tkf commented Mar 11, 2020

But as you say, this is(?) a bit orthogonal because the real underlying point here is a tentative step towards a world where we can run a full cost-based analysis on different loop orderings, unrollings, SIMDification, etc.

I know almost nothing about the compiler internal but that's my guess too. I think the discussion on iterate/foldl is about how to expose this compiler construct to collection implementers. My guess is that having something like this is beneficial either way.

@timholy timholy added the triage This should be discussed on a triage call label Apr 10, 2020
@timholy
Copy link
Sponsor Member Author

timholy commented Apr 10, 2020

This now works, and all tests pass locally. Two important changes were (1) when inferring the return type of one-dimensional iterate calls, pass the world age via _return_type rather than calling return_type, and (2) increase the strictness of pattern matching to ensure correctness. For (2), an example of a Base method that doesn't get its iterate(::CartesianIndex) expanded is _unsafe_getindex!, as its usage of iterate is not of a form handled by this PR. Paired with the strictness is some extra care to properly expand multi-iterator for loops like

julia/base/accumulate.jl

Lines 391 to 402 in a6a2d26

@noinline function _accumulaten!(op, B, A, R1, ind, R2, init::Nothing)
# Copy the initial element in each 1d vector along dimension `dim`
ii = first(ind)
@inbounds for J in R2, I in R1
B[I, ii, J] = reduce_first(op, A[I, ii, J])
end
# Accumulate
@inbounds for J in R2, i in first(ind)+1:last(ind), I in R1
B[I, i, J] = op(B[I, i-1, J], A[I, i, J])
end
B
end

where R1 and R2 are CartesianIndices.

I think this has gotten a certain amount of high-level discussion, but I'd be particularly grateful for a line-by-line code review. It might also be good to discuss this in the context of #35015, so I'm marking this for triage. (I'd be happy to join in if someone reminds me how.)

This replaces iteration over CartesianIndices{N} with a set of N nested
one-dimensional loops. Fixes #9080. It also paves the way for
loop-reordering and other transformations.
@chriselrod
Copy link
Contributor

chriselrod commented Apr 6, 2021

It does SIMD now without @simd:

julia> function sumcart_manual(A::AbstractMatrix)
           s = zero(eltype(A))
           ax1, ax2 = axes(A)
           for j in ax2, i in ax1
               @fastmath @inbounds s += A[i, j]
           end
           s
       end
sumcart_manual (generic function with 1 method)

julia> function sumcart_iter(A)
           s = zero(eltype(A))
           for I in CartesianIndices(A)
               @fastmath @inbounds s += A[I]
           end
           s
       end
sumcart_iter (generic function with 1 method)

julia> A = rand(4,4);

julia> @btime sumcart_iter($A)
  13.953 ns (0 allocations: 0 bytes)
8.557935619365121

julia> @btime sumcart_manual($A)
  13.040 ns (0 allocations: 0 bytes)
8.557935619365121

julia> A = rand(400,4);

julia> @btime sumcart_iter($A)
  128.912 ns (0 allocations: 0 bytes)
794.9728539317515

julia> @btime sumcart_manual($A)
  127.191 ns (0 allocations: 0 bytes)
794.9728539317515
# julia> @cn sumcart_iter(A)
        .text
        push    r15
        push    r14
        push    r12
        push    rbx
        mov     r11, qword ptr [rdi + 24]
        vpxor   xmm0, xmm0, xmm0
        test    r11, r11
        je      L338
        mov     r9, qword ptr [rdi + 32]
        test    r9, r9
        je      L338
        movabs  r10, 9223372036854775806
        mov     rsi, qword ptr [rdi]
        lea     rax, [r11 - 1]
        lea     r8, [r9 - 1]
        lea     r14, [r10 + 1]
        cmp     rax, r10
        cmovb   r14, r11
        lea     rcx, [rsi + 192]
        shl     r11, 3
        mov     rdi, r14
        and     rdi, -32
        lea     rax, [r14 + 1]
        add     rsi, -8
        vpxor   xmm0, xmm0, xmm0
        mov     r12d, 1
        jmp     L146
        nop
L112:
        lea     rdx, [r12 + 1]
        lea     rbx, [r10 + 1]
        cmp     r8, r10
        cmovb   rbx, r9
        add     rcx, r11
        add     rsi, r11
        cmp     r12, rbx
        mov     r12, rdx
        je      L338
L146:
        cmp     r14, 32
        jae     L176
        mov     edx, 1
        jmp     L320
        nop     word ptr cs:[rax + rax]
L176:
        lea     r15, [r10 - 30]
        and     r15, r14
        mov     rdx, r15
        or      rdx, 1
        vmovq   xmm0, xmm0                      # xmm0 = xmm0[0],zero
        vxorpd  xmm1, xmm1, xmm1
        xor     ebx, ebx
        vxorpd  xmm2, xmm2, xmm2
        vxorpd  xmm3, xmm3, xmm3
L208:
        vaddpd  zmm0, zmm0, zmmword ptr [rcx + 8*rbx - 192]
        vaddpd  zmm1, zmm1, zmmword ptr [rcx + 8*rbx - 128]
        vaddpd  zmm2, zmm2, zmmword ptr [rcx + 8*rbx - 64]
        vaddpd  zmm3, zmm3, zmmword ptr [rcx + 8*rbx]
        add     rbx, 32
        cmp     rdi, rbx
        jne     L208
        vaddpd  zmm0, zmm1, zmm0
        vaddpd  zmm0, zmm2, zmm0
        vaddpd  zmm0, zmm3, zmm0
        vextractf64x4   ymm1, zmm0, 1
        vaddpd  zmm0, zmm0, zmm1
        vextractf128    xmm1, ymm0, 1
        vaddpd  xmm0, xmm0, xmm1
        vpermilpd       xmm1, xmm0, 1           # xmm1 = xmm0[1,0]
        vaddsd  xmm0, xmm0, xmm1
        cmp     r14, r15
        je      L112
        nop     word ptr cs:[rax + rax]
L320:
        vaddsd  xmm0, xmm0, qword ptr [rsi + 8*rdx]
        inc     rdx
        cmp     rax, rdx
        jne     L320
        jmp     L112
L338:
        pop     rbx
        pop     r12
        pop     r14
        pop     r15
        vzeroupper
        ret
        nop     dword ptr [rax]

Note that the SIMD version is slower for the 4x4 case.

@KristofferC
Copy link
Sponsor Member

Since #37829 closes #9080 without the need for a specific compiler pass, I'll close this as not needed anymore. Feel free to reopen if this assessment is wrong.

@KristofferC KristofferC closed this Jun 9, 2021
@vtjnash vtjnash deleted the teh/fix_9080 branch June 9, 2021 14:32
@oscardssmith oscardssmith removed the triage This should be discussed on a triage call label Feb 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Code generation and cartesian iteration
7 participants