fix overflow in CartesianIndices iteration #31011

vchuravy · 2019-02-08T16:04:06Z

Make iteration over CartesianIndices blazing fast.
Fixes #30998, by avoiding the potential overflow.

LLVM was right..., based on the given semantics it
is not possible to determine the loop-bounds of the
loop in:

function vecfail(A)
     @inbounds begin
         acc = zero(eltype(A))
         N = length(A)
         for I in CartesianIndices(A)
              acc = Base.FastMath.add_fast(acc, A[I])
         end
     end
     return acc
end

Since it is possible for the iteration to overflow:

julia> I = CartesianIndices(1:typemax(Int64));

julia> i= last(I)
CartesianIndex(9223372036854775807,)

julia> iterate(I, i)
(CartesianIndex(-9223372036854775808,), CartesianIndex(-9223372036854775808,))

base/multidimensional.jl

vchuravy · 2019-02-09T03:18:54Z

I should add that this helps vectorize 1D CartesianIndices, but doesn't help with ND. The loop we lower two has two induction variables. instead of being a inner and a outer for-loop.

I would love for:

function vecfail(A)
            @inbounds begin
                acc = zero(eltype(A))
                for I in CartesianIndices(A)
                     acc = Base.FastMath.add_fast(acc, A[I])
                end
            end
            return acc
       end

to be more or less equivalent to

using Base.Cartesian
@generated function vecfail(A::AbstractArray{T, N}) where {T, N}
             quote
             @inbounds begin
                acc = zero(eltype(A))
                @nloops $N i A begin
                     acc = Base.FastMath.add_fast(acc, @nref($N, A, i))
                end
            end
            return acc
            end
       end

I know eachindex with arrays that support linear indexing will already get me there.

vchuravy · 2019-02-09T04:11:01Z

Ok updated once again after I realized that

state == last(iter) && return nothing

is basically equivalent to:

        idxend = map(last, iter.indices)
        all(map(>=, state.I, idxend)) && return nothing

Since the latter already accepts invalid state. I also experimented with checked arithmetic, but that didn't help with vectorisation.

timholy

Nice work! I think between this suggestion and my refinement in the comment below, we may have a fix (or close to it) for #9080! That's one of the few bugs whose number I know by memory, so to me that's a big deal.

See this gist and then:

julia> includet("/tmp/iterc.jl")

julia> using BenchmarkTools

julia> @btime sumcart_manual($A)
  102.337 ms (0 allocations: 0 bytes)
5.0001993217249475e7

julia> @btime sumcart_iter($A)
  131.513 ms (0 allocations: 0 bytes)
5.0001993217249475e7

julia> @btime sumcart_newiter($A)
  105.207 ms (0 allocations: 0 bytes)
5.0001993217249475e7

timholy · 2019-02-09T09:53:01Z

base/multidimensional.jl

@@ -334,9 +338,11 @@ module IteratorsMD
        iterfirst, iterfirst
    end
    @inline function iterate(iter::CartesianIndices, state)
+        # If we increment before the condition check, we run the
+        # risk of an integer overflow. This intentionally allows for invalid state.
+        state == last(iter) && return nothing


For the ND case, what if we weave this check together with inc? Here we check all of the entries of state and iter, and if there are any differences we bail. But then we potentially check them all again in inc. It seems you could be running this check on each entry as you go, and then bail if you've exhausted the tuple.

Something like this:

@inline function iterate(iter::CartesianIndices, state) inc((), state.I, first(iter).I, last(iter).I) end # increment & carry @inline inc(out, ::Tuple{}, ::Tuple{}, ::Tuple{}) = nothing @inline function inc(out, state, start, stop) if state[1] < stop[1] nextstate = CartesianIndex(out..., state[1]+1, tail(state)...) return nextstate, nextstate end return inc((out..., start[1]), tail(state), tail(start), tail(stop)) end

Now historically Julia's inference has not liked arguments that grow, but experimenting with @code_warntype up to 8 dimensions suggests it's OK in this case. Cross-checking with @vtjnash.

Thanks! I went a slightly different approach that constructs the tuple in anycase and propagates a boolean to check if it is valid.

Does this still solve #9080?

:/ No.

julia> @benchmark sumcart_iter($A) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 190.778 ms (0.00% GC) median time: 191.650 ms (0.00% GC) mean time: 191.665 ms (0.00% GC) maximum time: 192.456 ms (0.00% GC) -------------- samples: 27 evals/sample: 1 julia> @benchmark sumcart_manual($A) BenchmarkTools.Trial: memory estimate: 0 bytes allocs estimate: 0 -------------- minimum time: 102.647 ms (0.00% GC) median time: 103.522 ms (0.00% GC) mean time: 103.663 ms (0.00% GC) maximum time: 105.404 ms (0.00% GC) -------------- samples: 49 evals/sample: 1

timholy · 2019-02-09T10:10:21Z

With regards to the multidimensional case, note that @simd strips off the first dimension's iteration variable to create two loops, an outer n-1-dimensional iteration and an inner 1-dimensional iteration. So at least in the typical case you can vectorize across the first dimension. Would be nice to need that and get the equivalent of the @nloops approach.

vchuravy · 2019-02-09T15:08:52Z

Would be nice to need that and get the equivalent of the @nloops approach.

I wonder if Polly could do things like this, but right now both -loop-analysis and -scalar-evolution get confused about the 2D case. (I will punt on that though)

vchuravy · 2019-02-09T17:38:33Z

Interesting using Tim's variant I do see a speedup 200->150, but not the same speedup as if I am using _newiter from the gist 200->100.

timholy · 2019-02-09T22:11:19Z

but not the same speedup as if I am using _newiter from the gist 200->100

Just to check, are you saying that if you "wire this in" to Base.iterate and then rely on lowering to reduce the for loop to an iterate loop, it's slower than the version I produced by manual lowering? If so, what's the difference between the two @code_lowereds?

vchuravy · 2019-02-10T01:28:49Z

Just to check, are you saying that if you "wire this in" to Base.iterate and then rely on lowering to reduce the for loop to an iterate loop, it's slower than the version I produced by manual lowering? If so, what's the difference between the two @code_lowereds?

Exactly! I rewrote the newiter version to match more closely what the iter version produces:

using Base: tail

@inline function myiterate(iter::CartesianIndices)
    iterfirst, iterlast = first(iter), last(iter)
    if any(map(>, iterfirst.I, iterlast.I))
        return nothing
    end
    iterfirst, iterfirst
end
@inline function myiterate(iter::CartesianIndices, state)
    inc((), state.I, first(iter).I, last(iter).I)
end
# 0-d cartesian ranges are special-cased to iterate once and only once
myiterate(iter::CartesianIndices{0}, done=false) = done ? nothing : (CartesianIndex(), true)

# increment & carry
@inline inc(out, ::Tuple{}, ::Tuple{}, ::Tuple{}) = nothing
@inline function inc(out, state::Tuple{Int}, start::Tuple{Int}, stop::Tuple{Int})
    if state[1] < stop[1]
        nextstate = CartesianIndex(out..., state[1]+1)
        return nextstate, nextstate
    end
    return nothing
end
@inline function inc(out, state, start, stop)
    if state[1] < stop[1]
        nextstate = CartesianIndex(out..., state[1]+1, tail(state)...)
        return nextstate, nextstate
    end
    return inc((out..., start[1]), tail(state), tail(start), tail(stop))
end


function sumcart_newiter(A)
    s = 0.0
    iter = CartesianIndices(A)
    ret = myiterate(iter)
    @inbounds while !(ret === nothing)
        I, state = ret
        s += A[I]
        ret = myiterate(iter, state)
    end
    return s
end

The only difference is that this a head-loop and the iteration loop produces a check + tail-loop. For the curious I put both produced LLVM IR into godbolt and setup MCA for both https://godbolt.org/z/8owxJT

vchuravy · 2019-02-10T22:11:33Z

Ok... before a755c6a

julia> @benchmark sumcart_iter($A)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     158.716 ms (0.00% GC)
  median time:      163.086 ms (0.00% GC)
  mean time:        174.294 ms (0.00% GC)
  maximum time:     288.133 ms (0.00% GC)
  --------------
  samples:          29
  evals/sample:     1

after:

julia> @benchmark sumcart_iter($A)
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     105.531 ms (0.00% GC)
  median time:      106.831 ms (0.00% GC)
  mean time:        107.862 ms (0.00% GC)
  maximum time:     120.262 ms (0.00% GC)
  --------------
  samples:          47
  evals/sample:     1

This is a a bit odd since LLVM prefers do ... while loops, and definitely requires a nanosoldier run.

@nanosoldier runbenchmarks(ALL, vs = "@5a6b972282a91ceea422c479433e7d1413fea2b4")

jekbradbury · 2019-02-10T22:30:21Z

What does this lowering change look like in terms of the equivalent Julia syntax? I think there’s an example of for loop lowering in the manual that might need to be adjusted.

timholy

Really fascinating. I am not the best person to comment on LLVM's preferences for various lowered forms, and certainly not the changes to julia-synax.scm. It will be interesting to see the nanosoldier run.

timholy · 2019-02-10T22:31:43Z

base/multidimensional.jl

-    @inline function inc(state, start, stop)
+    # increment post check to avoid integer overflow
+    @inline inc(out, ::Tuple{}, ::Tuple{}, ::Tuple{}) = nothing
+    @inline function inc(out, state::Tuple{Int}, start::Tuple{Int}, stop::Tuple{Int})


Do we even need this method anymore? I think it's covered by the generic case.

I don't think so. I added it because this is a common pattern elsewhere (probably steming from a time when inference liked code like this more).

base/multidimensional.jl

vchuravy · 2019-02-10T22:48:18Z

What does this lowering change look like in terms of the equivalent Julia syntax? I think there’s an example of for loop lowering in the manual that might need to be adjusted.

Since we don't have a do ... while loop in Julia. Loop lowering on master currently looks like this:

next = iterate(iter)
if !(next === nothing)
   @label loop
   I, state = next
  # ...
  next = iterate(iter, state)
  if !(next === nothing)
    @goto loop
  end
end

my change changes it back to how it was on 0.6, but with the new iteration protocol

next = iterate(iter)
while !(next === nothing)
  I, state = next
  ...
  next = iterate(iter, state)
end

nanosoldier · 2019-02-11T04:53:31Z

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

vchuravy · 2019-02-11T14:19:39Z

The nanosoldier results are confusing and all over the board. How can this change lead to half the memory usage....

ararslan · 2019-02-11T20:03:01Z

base/multidimensional.jl

@@ -95,6 +95,10 @@ module IteratorsMD
    # access to index tuple
    Tuple(index::CartesianIndex) = index.I

+    # equality
+    import Base: ==
+    ==(a::CartesianIndex{N}, b::CartesianIndex{N}) where N = a.I == b.I


No need for the import, this can just be Base.:(==)(...), which is more explicit.

ararslan · 2019-02-11T20:04:16Z

The results will be noisier the more benchmarks are run at once, so if there are a few in particular you're interested in, you could do those specifically instead of ALL. That should give more reliable results.

vchuravy · 2019-03-16T16:39:47Z

sigh

Current Julia:

julia> A = zeros(3,3)
3×3 Array{Float64,2}:
 0.0  0.0  0.0
 0.0  0.0  0.0
 0.0  0.0  0.0

julia> nextind(A, CartesianIndex(3,3))
CartesianIndex(1, 4)

nextind(CartesianIndices((1:3, 1:typemax(Int))), CartesianIndex(3,typemax(Int)))
CartesianIndex(1, -9223372036854775808)

Looking at the docs for nextind the expected behaviour might be a BoundsError?

vchuravy · 2019-03-16T17:09:56Z

Okay. I decided not to experiment with loop-form and just focus on fixing the over/underflows.

Notes:

Lowering of cartesian loops with N>=2 is something I still want to further investigate, but the general realisation is that SCEV doesn't like it and can't figure out that the loops are bounded.
A potential idea might be to have an IR pass that converts loops of this form to loop nests (hopefully solving this issue in general) and avoiding the kind of strip-mining transformation @simd does.
I am unsure about the semantics for nextind usage generally seems to allow one illegal index to be returned and then calling nextind on that is supposed to give you a BoundsError.
I rewrote findnext to avoid the overflow, but it would be cleaner to directly call iterate, but it is not clear to me whether start can be expected to be a proper iterator state.

vchuravy · 2019-03-16T19:50:20Z

I am unsure about the semantics for nextind usage generally seems to allow one illegal index to be returned and then calling nextind on that is supposed to give you a BoundsError.

I went back to the form valid, i = inc(), which allows nextind/prevind to produce the nextindex. These might still underflow/overflow so I will leave the changes for findnext et.al, but I could also revert those and open a separate PR.

This also sets us up with a CGF that I believe to be amenable to be transformed into a loop nest.

vchuravy · 2019-03-19T17:01:25Z

There seems to be an inferrability issue:

  Test threw exception
  Expression: #= /tmp/julia/share/julia/test/reducedim.jl:55 =# @inferred(maximum(Areduc, dims=region)) ≈ safe_maximum(Areduc, region)
  return type Array{Float64,4} does not match inferred return type Array{_A,4} where _A
  Stacktrace:
   [1] error(::String) at ./error.jl:33
   [2] top-level scope at /tmp/julia/share/julia/test/reducedim.jl:55
   [3] top-level scope at /home/travis/build/JuliaLang/julia/usr/share/julia/stdlib/v1.2/Test/src/Test.jl:1186
   [4] include at ./boot.jl:325 [inlined]
   [5] include_relative(::Module, ::String) at ./loading.jl:1042
   [6] include at ./Base.jl:31 [inlined]
   [7] include(::String) at /tmp/julia/share/julia/test/testdefs.jl:13
   [8] top-level scope at /tmp/julia/share/julia/test/testdefs.jl:23
   [9] top-level scope at /home/travis/build/JuliaLang/julia/usr/share/julia/stdlib/v1.2/Test/src/Test.jl:1113
   [10] top-level scope at /tmp/julia/share/julia/test/testdefs.jl:22
   [11] top-level scope at util.jl:289
   [12] top-level scope at /tmp/julia/share/julia/test/testdefs.jl:20

Haven't had time to take a deeper look.

This allows LLVM to vectorize the 1D CartesianIndices case, as well as fixing an overflow bug for: ```julia CartesianIndices(((typemax(Int64)-2):typemax(Int64),)) ``` Co-authored-by: Yingbo Ma <mayingbo5@gmail.com>

vchuravy · 2019-03-28T17:15:32Z

I plan to merge this tomorrow.

mbauman · 2019-03-28T17:34:05Z

I wanna see another nanosoldier run first. There were some fairly major regressions in array iteration in the run in February.

@nanosoldier runbenchmarks("array" || "union", vs = ":master")

base/array.jl

vchuravy · 2019-03-28T18:25:26Z

Fair enough, the run in February was the experiment in changing the loop structure for the iteration protocol.

nanosoldier · 2019-03-28T21:27:54Z

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

Co-Authored-By: vchuravy <vchuravy@users.noreply.github.com>

vchuravy · 2019-03-29T14:17:49Z

@mbauman The nanosoldier results look clean. Are you okay with squash merging this as is?

vchuravy requested review from Keno and timholy February 8, 2019 16:56

Keno reviewed Feb 8, 2019

View reviewed changes

base/multidimensional.jl Outdated Show resolved Hide resolved

vchuravy force-pushed the vc/cartesian branch 2 times, most recently from 1aac477 to eebadbc Compare February 9, 2019 03:04

timholy reviewed Feb 9, 2019

View reviewed changes

vchuravy force-pushed the vc/cartesian branch from f79e937 to d7805f7 Compare February 9, 2019 14:52

timholy reviewed Feb 10, 2019

View reviewed changes

ararslan reviewed Feb 11, 2019

View reviewed changes

vchuravy force-pushed the vc/cartesian branch from 11299e8 to 7fe75d2 Compare March 4, 2019 02:17

vchuravy force-pushed the vc/cartesian branch from 7fe75d2 to 807a0c8 Compare March 16, 2019 16:56

vchuravy requested a review from timholy March 16, 2019 16:56

vchuravy added this to the 1.2 milestone Mar 18, 2019

vchuravy force-pushed the vc/cartesian branch from 2e31065 to 6acd751 Compare March 19, 2019 02:47

vchuravy mentioned this pull request Mar 20, 2019

while i <= L makes function >5x slower the while i < (L+1) #31416

Closed

vchuravy and others added 3 commits March 28, 2019 09:47

fix overflow in CartesianIndices iteration

4a3cea1

This allows LLVM to vectorize the 1D CartesianIndices case, as well as fixing an overflow bug for: ```julia CartesianIndices(((typemax(Int64)-2):typemax(Int64),)) ``` Co-authored-by: Yingbo Ma <mayingbo5@gmail.com>

avoid potential under/overflows

e31a748

change to accumulate through return for inference

185e6f4

vchuravy force-pushed the vc/cartesian branch from b31ba8d to 185e6f4 Compare March 28, 2019 13:47

mbauman reviewed Mar 28, 2019

View reviewed changes

base/array.jl Outdated Show resolved Hide resolved

base/array.jl Outdated Show resolved Hide resolved

Fix copy-pasta

953efd4

Co-Authored-By: vchuravy <vchuravy@users.noreply.github.com>

vchuravy merged commit 91151ab into master Mar 29, 2019

vchuravy deleted the vc/cartesian branch March 29, 2019 18:53

johnnychen94 mentioned this pull request May 17, 2022

Make for iter::CartesianIndices better vectorized for 1d/2d cases. #45338

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix overflow in CartesianIndices iteration #31011

fix overflow in CartesianIndices iteration #31011

vchuravy commented Feb 8, 2019

vchuravy commented Feb 9, 2019

vchuravy commented Feb 9, 2019

timholy left a comment

timholy Feb 9, 2019

vchuravy Feb 9, 2019

timholy Feb 9, 2019

vchuravy Feb 9, 2019

timholy commented Feb 9, 2019

vchuravy commented Feb 9, 2019

vchuravy commented Feb 9, 2019

timholy commented Feb 9, 2019

vchuravy commented Feb 10, 2019

vchuravy commented Feb 10, 2019

jekbradbury commented Feb 10, 2019

timholy left a comment

timholy Feb 10, 2019

vchuravy Feb 10, 2019

vchuravy commented Feb 10, 2019

nanosoldier commented Feb 11, 2019

vchuravy commented Feb 11, 2019

ararslan Feb 11, 2019

ararslan commented Feb 11, 2019

vchuravy commented Mar 16, 2019

vchuravy commented Mar 16, 2019

vchuravy commented Mar 16, 2019

vchuravy commented Mar 19, 2019

vchuravy commented Mar 28, 2019

mbauman commented Mar 28, 2019

vchuravy commented Mar 28, 2019

nanosoldier commented Mar 28, 2019

vchuravy commented Mar 29, 2019

fix overflow in CartesianIndices iteration #31011

fix overflow in CartesianIndices iteration #31011

Conversation

vchuravy commented Feb 8, 2019

vchuravy commented Feb 9, 2019

vchuravy commented Feb 9, 2019

timholy left a comment

Choose a reason for hiding this comment

timholy Feb 9, 2019

Choose a reason for hiding this comment

vchuravy Feb 9, 2019

Choose a reason for hiding this comment

timholy Feb 9, 2019

Choose a reason for hiding this comment

vchuravy Feb 9, 2019

Choose a reason for hiding this comment

timholy commented Feb 9, 2019

vchuravy commented Feb 9, 2019

vchuravy commented Feb 9, 2019

timholy commented Feb 9, 2019

vchuravy commented Feb 10, 2019

vchuravy commented Feb 10, 2019

jekbradbury commented Feb 10, 2019

timholy left a comment

Choose a reason for hiding this comment

timholy Feb 10, 2019

Choose a reason for hiding this comment

vchuravy Feb 10, 2019

Choose a reason for hiding this comment

vchuravy commented Feb 10, 2019

nanosoldier commented Feb 11, 2019

vchuravy commented Feb 11, 2019

ararslan Feb 11, 2019

Choose a reason for hiding this comment

ararslan commented Feb 11, 2019

vchuravy commented Mar 16, 2019

vchuravy commented Mar 16, 2019

vchuravy commented Mar 16, 2019

vchuravy commented Mar 19, 2019

vchuravy commented Mar 28, 2019

mbauman commented Mar 28, 2019

vchuravy commented Mar 28, 2019

nanosoldier commented Mar 28, 2019

vchuravy commented Mar 29, 2019