RFC: more efficient `∇getindex` #962

mcabbott · 2021-05-04T17:23:23Z

This is a small improvement to the gradient for getindex in the easiest case, by returning a very simple one-nonzero-element array.

It also changes how accum works: adding two of these will produce an Array, but accumulating the next one may as well mutate that. Maybe it is safe to do that in general, I am not 100% sure? Maybe CI will tell us? [It is not, and it didn't, see below]

Note that it will never produce a SparseArray. My take is that if you call getindex once then this OneElement is fine, and if you call it on every element in the array, then accumulating in an Array is optimal; in-between cases where you call it on 1% of the elements seem likely to be very rare, and sparse arrays add all sorts of complication. [This won't apply anymore]

And, are there CuArray concerns? ∇getindex is careful about making a similar zero to write into, but this change is only for scalar indexing, which is (usually) disallowed anyway.

This doesn't give a huge speedup, but it does save a lot of memory. I'm not too sure what the "10089 allocations" here actually are, maybe there is some further trick to remove them? (Might also be useful for someone to time this on a different computer too -- these are M1 mac, which has unusual memory, and Julia 1.7 native.)

f4(x) = sum([x[i]^2 for i in eachindex(x)])
@btime gradient(f4, $(collect(1:1000)))
#  1.997 ms (12829 allocations: 16.05 MiB)  originally
#  1.502 ms (11087 allocations: 8.46 MiB)   with OneElement alone
#  1.520 ms (11830 allocations: 8.31 MiB)   with accum(x::DenseArray,...) alone
#    992.917 μs (10089 allocations: 738.44 KiB)  this PR

@btime gradient(x -> sum(x * x'), $(rand(10,1000)));  # no getindex
#  79.625 μs (30 allocations: 237.06 KiB)  original
#  79.334 μs (28 allocations: 158.86 KiB)  with accum

On the example from #644 -- discussion there had other comparable proposals:

function _evalpoly(x, p)
    N = length(p)
    ex = p[end]
    for i in N-1:-1:1
        ex = muladd(x, ex, p[i])
    end
    ex
end
x, p = rand(), randn(10000);

@btime _evalpoly(x, p);
# 12.416 μs (1 allocation: 16 bytes)

@btime Zygote.gradient(_evalpoly, x, p);
# 170.356 ms (730104 allocations: 1.53 GiB)     original
# 139.957 ms (710106 allocations: 800.45 MiB)   with accum(x::DenseArray,...) alone
# 136.468 ms (740104 allocations: 801.90 MiB)   with OneElement alone
#  87.931 ms (720108 allocations: 38.35 MiB)    with both
#  59.191 ms (720108 allocations: 38.35 MiB)    with accum(..., y::OneElement) too

Xref also JuliaLang/julia#365 (getindex) and JuliaLang/julia#905 (accumulation).

oxinabox · 2021-05-04T17:58:16Z

in the longer term this will be resolved once we support ChainRules inplacable thunks.
Til then this might be a good idea.

It also changes how accum works: adding two of these will produce an Array, but accumulating the next one may as well mutate that. Maybe it is safe to do that in general, I am not 100% sure? Maybe CI will tell us?

We can define a custom rule for accum that changes the primal so that it doesn't mutate if you are doing nested AD.

oxinabox · 2021-05-04T18:15:58Z

On my computer

julia> versioninfo()
Julia Version 1.6.2-pre.0
Commit dd122918ce* (2021-04-23 21:21 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin20.3.0)
  CPU: Intel(R) Core(TM) i7-8559U CPU @ 2.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, skylake)

here are the benchmarks

@btime gradient(f4, $(collect(1:1000)))
  2.027 ms (7068 allocations: 15.77 MiB)    current master
  90.018 μs (2073 allocations: 307.52 KiB)  this PR before inplace was removed
  682.171 μs (3071 allocations: 8.04 MiB)    this PR after inplace was removed

@btime gradient(x -> sum(x * x'), $(rand(10,1000)));  # no getindex
  108.439 μs (29 allocations: 237.16 KiB)   current master
  105.164 μs (27 allocations: 158.95 KiB)   this PR before inplace was removed
  108.524 μs (29 allocations: 237.16 KiB)   this PR after inplace was removed

julia> @btime _evalpoly(x, p);
  9.589 μs (1 allocation: 16 bytes)

julia> @btime Zygote.gradient(_evalpoly, x, p);
  942.210 ms (680106 allocations: 1.52 GiB)         current master
  67.668 ms (640110 allocations: 29.20 MiB)         this PR before inplace was removed
  564.833 ms (660106 allocations: 792.75 MiB)    this PR after inplace was removed

Notable differences from yours in the top post:
All allocations are lower (especially in the collect example)
but the allocations are much lower still for this PR.

This PR makes the collect example waaaay faster than what you saw: an order of magnitude faster.

The current maseter timing fro the eval poly example is waaay worse for me.

This is great. We should have been doing this for ages.
This is going to be the single biggest quality of life improvement in quite a while.

mcabbott · 2021-05-04T18:29:23Z

Those times look great, thanks. I should try on 1.6 + rosetta too, maybe there are other bugs involved.

We can define a custom rule for accum that changes the primal so that it doesn't mutate if you are doing nested AD.

Good point, hadn't thought about nested AD at all. There appear to be no tests, so perhaps that can be looked at in another PR. Nested getindex specifically might also be easier in this case, xref JuliaLang/julia#944 which I have not read closely.

DhairyaLGandhi · 2021-05-04T18:33:26Z

Is that correct in general? Aso, could we replicate this with only instead?

src/lib/lib.jl

oxinabox · 2021-05-04T19:32:42Z

Is that correct in general?

Is what correct in general?
This looks like a correct inplementation of a n-dimentional one-hot array.
And inplace accumulation of gradients is safe and correct in general due to the face that AD is logically linear (which means each differential is only ever used once so safe to mutate since it is not going to be used anywhere ekse at the point you are calling accum on it, at least during the pullback (less safe if it is something the user is inputting though, since then it might be used more than once.)

Aso, could we replicate this with only instead?

Which bit?

willtebbutt · 2021-05-05T09:43:47Z

Should we be concerned about race conditions?

@DhairyaLGandhi and @MikeInnes know more about this than I do (I know basically nothing), but I would imagine that if you have parallel code somewhere, this modification could be problematic if Zygote winds up trying to accumulate to the same location at the same time (I'm not sure whether this can ever happen in practice)? My impression is that Zygote currently works fine with threads / parallelism generally -- would this change that?

oxinabox · 2021-05-05T11:23:18Z

Should we be concerned about race conditions?

I don't think so.
From what I can see Task related things never call accum directly (most pullbacks never call accum). They return something that gets accumulated outside of that.
I suspect also there is something about structure of AD programs that also mean we don't have to worry about it.
Parallelism in reverse must follow parallelism in forward but backwards, which should mean that primals are free from race conditions in forward must have differentials that are free from race conditions in reverse (thoguh that might not apply since we are turning a copy into a mutation).

I think we can assume it is safe and wait for evidence to the contrary (though debugging race conditions sucks)

src/lib/lib.jl

mcabbott · 2021-05-08T04:00:29Z

Times with rosetta as promised:

julia> @btime gradient(f4, $(collect(1:1000)));
  1.033 ms (7068 allocations: 15.77 MiB)      # tagged
     96.625 μs (2074 allocations: 472.20 KiB) # this PR

julia> @btime gradient(x -> sum(x * x'), $(rand(10,1000)));  # no getindex
  101.084 μs (29 allocations: 237.16 KiB)  # tagged
   96.000 μs (27 allocations: 159.09 KiB)  # this PR

julia> @btime Zygote.gradient(_evalpoly, x, p);
  275.883 ms (680106 allocations: 1.52 GiB)    # tagged
   59.369 ms (640110 allocations: 35.76 MiB)   # this PR

julia> versioninfo()
Julia Version 1.6.0
  OS: macOS (x86_64-apple-darwin19.6.0)
  CPU: Apple M1

Also some here: #905 (comment)

Edit -- without mutation, same computer:

julia> @btime gradient(f4, $(collect(1:1000)));
  433.084 μs (3072 allocations: 8.20 MiB)

julia> @btime Zygote.gradient(_evalpoly, x, p);
  141.938 ms (660106 allocations: 799.31 MiB)

mcabbott · 2021-05-09T20:42:00Z

It turns out that JuliaLang/julia#905 is the way to produce sort of bug I was a bit worried about. Something is doing Δ -> (Δ, Δ, Δ), which results in accum(x, y, zs...) = accum(accum(x, y), zs...) running with x === zs[1], which leads to:

julia> gradient(x -> sum(abs, x + x + x), ones(2))
([4.0, 4.0],)

julia> ForwardDiff.gradient(x -> sum(abs, x + x + x), ones(2))
2-element Vector{Float64}:
 3.0
 3.0

oxinabox · 2021-05-09T22:03:12Z

Yeah fair. Good catch.
I think this would be fixed by having a restriction on pullbacks that says you can't return aliased differentials for non-aliased primal inputs.
Which for practical implementations means that if you are returning more than one instance of the input to your pullback you need to make copies.

Presumably this came out of if a +.

Enforcing this rule against aliasing would add some copies but not as many as inplace accumulation would solve.

I think there might be something in literature about this and a dup operation?

mcabbott · 2021-05-09T22:34:41Z

Yes that might be enough. I don't know the literature (obviously!) but it's hard to then picture it going wrong. Would there be cases where this hurts, as well as ones where it helps overall?

This policy of no aliasing would probably be necessary for any of the in-place updates planned in ChainRules, too, right? Not just accum.

If that were to be the policy, you could surely hack in a test to check for aliasing, here, I mean pirate something during testing. ChainRules could easily do that too.

Just making the update vararg like x .= accum.(x, ys...) in fact avoids the problem here, but obviously won't in general. Doing that without mutation seems like a good idea, saves a lot of memory in JuliaLang/julia#905. Maybe less in real cases, but e.g. gradient(x -> x[1] + x[2] + x[3], ones(3)) does hit that method.

mcabbott · 2021-05-10T00:33:21Z

OK, I think I see how to ensure broadcasting takes preventative copies. For times on this page it only costs 10% off the fastest unsafe way, but in #905 (comment) it saves no memory, and is slower than master.

oxinabox · 2021-05-10T18:34:24Z

I have updated my timings in #962 (comment)
Which still leaves this PR as a ~2-3x improvement over current master.
I say we should merge this one as is and come back to workout how we make inplace accumulation safe in a follow up.

Yes that might be enough. I don't know the literature (obviously!) but it's hard to then picture it going wrong.

Yeah it is hard to picture it going wrong, and putting differential memory addresses into acorrespondence in quanity with primal feels right.

I don't know the literature that well either, not really.
The literature is a utter mess of stuff that appears super niche to fortran and stuff that is super theoretical.
So much of it is older than closures and talking about AD without using the closure representaion of pullbacks is kinda suffering.
A lot of the more modern literature isn't in literature at all, but in like github issues and twitter threads.
but I recall dup being mentioned in a talk by Tom Minka on formulating AD as message passing.
but not well, so i could be getting confused.
This should show up somewhere in some intersection of linear types, message passing, and effects.
idk though maybe people talking about that don't believe in mutation?

Would there be cases where this hurts, as well as ones where it helps overall?

Yes, cases where you don't do anything with the output that you just copied.
If you don't any up accumulating inplace against it then the copy has been done for no reason.
Ideally we would have copy-on-write arrays to fix this.

This policy of no aliasing would probably be necessary for any of the in-place updates planned in ChainRules, too, right? > Not just accum.

I think you mean not just getindex, all the in-lace updates in ChainRules are pretty much for accum,
and yes it would be needed.

If that were to be the policy, you could surely hack in a test to check for aliasing, here, I mean pirate something during testing. ChainRules could easily do that too.

yeah. We could hack that into wrap_chainrules_output.
and something like that might be what we should do, rather than the rule author having to do it?
I want to think about this more still

mcabbott · 2021-05-10T19:34:35Z

Re this PR as it stands, one comment is that it does a bit less to preserve the types of arrays, e.g.:

julia> using StaticArrays

julia> gradient(x -> x[1], SA[1,2,3])[1]
3-element Zygote.OneElement{Int64, 1, Tuple{Int64}, Tuple{SOneTo{3}}} with indices SOneTo(3):
 1
 0
 0

julia> gradient(x -> x[1] + x[2], SA[1,2,3])[1]
3-element SizedVector{3, Int64, Vector{Int64}} with indices SOneTo(3):
 1
 1
 0

compared to master:

julia> gradient(x -> x[1] + x[2], SA[1,2,3])[1]   # better than a SizedVector on similar case
3-element MVector{3, Int64} with indices SOneTo(3):
 1
 1
 0

julia> gradient(x -> sum(x .+ x'), SA[1,2,3])[1]  # but many other cases don't preserve this
3×1 Matrix{Int64}:
 6
 6
 6

julia> gradient(x -> sum(abs, x .+ x'), SA[1,2,3])[1]  # and many simple things don't work at all
ERROR: MethodError: no method matching _mapfoldl(::typeof(identity), ::typeof(+), ::Tuple{Int64, Int64}, ::StaticArrays._InitialValue, ::Size{(3, 3)}, ::SMatrix{3, 3, Int64, 9})

I wondered a bit whether OneElement should capture the type of the array, but the method of similar I think you'd want to use when turning back into a dense array doesn't seem to exist, surprisingly?

julia> similar(typeof([1,2]), axes([1,2]))
2-element Vector{Int64}:
 12
 57

julia> similar(typeof(SA[1,2]), axes(SA[1,2]))
2-element MVector{2, Int64} with indices SOneTo(2):
 4585456624
          3

julia> similar(typeof([1,2]), Float64, axes([1,2]))  # with eltype
ERROR: MethodError: no method matching similar(::Type{Vector{Int64}}, ::Type{Float64}, ::Tuple{Base.OneTo{Int64}})

julia> similar([1,2], Float64, axes([1,2]))  # with an instance it's fine
2-element Vector{Float64}:
 0.0
 2.2655161774e-314

julia> similar(typeof(SA[1,2]), Float64, Size(SA[1,2]))  # this exists, but not for Base types
2-element MVector{2, Float64} with indices SOneTo(2):
 2.162767274e-314
 0.0

oxinabox · 2021-05-25T10:08:52Z

Are you ready for this to be merged?
Given all existing tests pass, and new ones, this feels safe.

mcabbott

Yes, let's do this.

Failure on master ought to be fixed by FluxML/IRTools.jl#86

(Minor aesthetic changes in the review.)

src/lib/array.jl

RFC: more efficient `∇getindex`

mcabbott · 2021-05-25T15:20:19Z

Oh I see. I think we both hit the merge button at the same time, yours succeeded & mine gave a confusing error... which I tried to rebase to fix... but this is merged & can be closed.

DhairyaLGandhi · 2021-06-17T10:34:48Z

Most of the work for static arrays is pretty orthogonal to the concerns of accumulation. It seems odd to make future work harder.

mcabbott added 2 commits May 4, 2021 13:14

sparse getindex gradient

30419f4

a step towards in-place accumulation

8fafdf0

mcabbott mentioned this pull request May 4, 2021

Improvements to rules for norm JuliaDiff/ChainRules.jl#337

Merged

oxinabox reviewed May 4, 2021

View reviewed changes

src/lib/lib.jl Outdated Show resolved Hide resolved

mcabbott commented May 5, 2021

View reviewed changes

src/lib/lib.jl Outdated Show resolved Hide resolved

four seven seven four

431d3d4

mcabbott mentioned this pull request May 8, 2021

Incremental accumulation of gradients? #905

Open

avoid mutation

e95ba74

oxinabox mentioned this pull request May 11, 2021

Aliased differentials and Inplace Accumulation JuliaDiff/ChainRulesCore.jl#350

Open

Merge branch 'master' into getindex2

701f7d9

oxinabox approved these changes May 25, 2021

View reviewed changes

mcabbott commented May 25, 2021

View reviewed changes

src/lib/array.jl Show resolved Hide resolved

src/lib/array.jl Outdated Show resolved Hide resolved

add docstring for OneElement

1f492a7

mcabbott pushed a commit to mcabbott/Zygote.jl that referenced this pull request May 25, 2021

Merge pull request FluxML#962 from mcabbott/getindex2

7c66eff

RFC: more efficient `∇getindex`

mcabbott mentioned this pull request May 25, 2021

RFC: in-place accum #981

Closed

mcabbott closed this May 25, 2021

mcabbott mentioned this pull request Jun 9, 2021

Section sampler SciML/Surrogates.jl#279

Merged

marcoq added a commit to marcoq/Surrogates.jl that referenced this pull request Jun 9, 2021

Add workaround for <FluxML/Zygote.jl#962>

5771aa3

marcoq added a commit to marcoq/Surrogates.jl that referenced this pull request Jun 9, 2021

Add workaround suggested by @mcabbott for <FluxML/Zygote.jl#962>

dea6560

mcabbott mentioned this pull request Jun 14, 2021

Flux & Zygote's AD slower than ForwardDiff #994

Open

mcabbott mentioned this pull request Jun 21, 2021

Scalar indexing error from GPU matmul against Zygote.OneElement #1005

Closed

ToucheSir mentioned this pull request May 26, 2023

Bring over OneElement for scalar getindex JuliaDiff/ChainRules.jl#717

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: more efficient `∇getindex` #962

RFC: more efficient `∇getindex` #962

mcabbott commented May 4, 2021 •

edited

Loading

oxinabox commented May 4, 2021

oxinabox commented May 4, 2021 •

edited

Loading

mcabbott commented May 4, 2021

DhairyaLGandhi commented May 4, 2021

oxinabox commented May 4, 2021

willtebbutt commented May 5, 2021 •

edited

Loading

oxinabox commented May 5, 2021

mcabbott commented May 8, 2021 •

edited

Loading

mcabbott commented May 9, 2021

oxinabox commented May 9, 2021

mcabbott commented May 9, 2021 •

edited

Loading

mcabbott commented May 10, 2021

oxinabox commented May 10, 2021

mcabbott commented May 10, 2021 •

edited

Loading

oxinabox commented May 25, 2021

mcabbott left a comment •

edited

Loading

mcabbott commented May 25, 2021

DhairyaLGandhi commented Jun 17, 2021

RFC: more efficient ∇getindex #962

RFC: more efficient ∇getindex #962

Conversation

mcabbott commented May 4, 2021 • edited Loading

oxinabox commented May 4, 2021

oxinabox commented May 4, 2021 • edited Loading

mcabbott commented May 4, 2021

DhairyaLGandhi commented May 4, 2021

oxinabox commented May 4, 2021

willtebbutt commented May 5, 2021 • edited Loading

oxinabox commented May 5, 2021

mcabbott commented May 8, 2021 • edited Loading

mcabbott commented May 9, 2021

oxinabox commented May 9, 2021

mcabbott commented May 9, 2021 • edited Loading

mcabbott commented May 10, 2021

oxinabox commented May 10, 2021

mcabbott commented May 10, 2021 • edited Loading

oxinabox commented May 25, 2021

mcabbott left a comment • edited Loading

Choose a reason for hiding this comment

mcabbott commented May 25, 2021

DhairyaLGandhi commented Jun 17, 2021

RFC: more efficient `∇getindex` #962

RFC: more efficient `∇getindex` #962

mcabbott commented May 4, 2021 •

edited

Loading

oxinabox commented May 4, 2021 •

edited

Loading

willtebbutt commented May 5, 2021 •

edited

Loading

mcabbott commented May 8, 2021 •

edited

Loading

mcabbott commented May 9, 2021 •

edited

Loading

mcabbott commented May 10, 2021 •

edited

Loading

mcabbott left a comment •

edited

Loading