changed broadcast! into bitarray algorithm #32048

chethega · 2019-05-16T10:54:17Z

Cf https://discourse.julialang.org/t/broadcast-vs-slow-performance-allocations/24259/6 for some more discussion and #32047 for the question of validity in view of exceptions.

@mbauman is this valid with respect to eachindex, etc?

Before:

julia> using BenchmarkTools, Random
julia> y=1; xsmall=[1]; Random.seed!(42); xlarge=rand(1:4, 100_003);
julia> @btime broadcast(==, $xsmall, $y); @btime  broadcast(==, $xlarge, $y); @show hash(broadcast(==, xlarge, y).chunks);
  860.500 ns (3 allocations: 4.31 KiB)
  152.971 μs (3 allocations: 16.59 KiB)
hash((broadcast(==, xlarge, y)).chunks) = 0xaa3b5a692968e128

After:

julia> @btime broadcast(==, $xsmall, $y); @btime  broadcast(==, $xlarge, $y); @show hash(broadcast(==, xlarge, y).chunks);
  65.466 ns (2 allocations: 128 bytes)
  42.927 μs (2 allocations: 12.41 KiB)
hash((broadcast(==, xlarge, y)).chunks) = 0xaa3b5a692968e128

Monkeypatch:

julia> @eval Base.Broadcast @inline function copyto!(dest::BitArray, bc::Broadcasted{Nothing})
           axes(dest) == axes(bc) || throwdm(axes(dest), axes(bc))
           ischunkedbroadcast(dest, bc) && return chunkedcopyto!(dest, bc)
           destc = dest.chunks
           ind = cind = 1
           bcp = preprocess(dest, bc)
           length(bcp)<=0 && return dest
           @inbounds for i = 0:Base.num_bit_chunks(length(bcp))-2
               z = UInt64(0)
               for j=0:63
                  z |= (bcp[i*64 + j + 1]::Bool) << (j&63)
               end
               destc[i+1] = z
           end
           i = Base.num_bit_chunks(length(bcp))-1
           z = UInt64(0)
           @inbounds for j=0:(length(bcp)-1)&63
                z |= (bcp[i*64 + j + 1]::Bool) << (j&63)
           end
           @inbounds destc[i+1] = z
           return dest
       end

chethega · 2019-05-16T12:43:12Z

Ok, this does not work for cartesian indices. Multidimensional variants I tried:

julia> @eval Base.Broadcast @inline function copyto!(dest::BitArray, bc::Broadcasted{Nothing})
           axes(dest) == axes(bc) || throwdm(axes(dest), axes(bc))
           ischunkedbroadcast(dest, bc) && return chunkedcopyto!(dest, bc)
           destc = dest.chunks
           bcp = preprocess(dest, bc)
           length(destc)<=0 && return dest
           ea = eachindex(bcp)
           i1 = 1
           i2 = 0
           z = UInt64(0)
           @inbounds for idx in ea # = 0:Base.num_bit_chunks(length(bcp))-2
               z |= (bcp[idx]::Bool) << (i2&63)
               i2 += 1
               if (i2 & 63) == 0
                   destc[i1]=z
                   i2 = 0
                   i1 += 1
                   z = UInt64(0)
               end
           end
           i2 != 0 && @inbounds destc[i1] = z
           return dest
       end

and

julia> @eval Base.Broadcast @inline function copyto!(dest::BitArray, bc::Broadcasted{Nothing})
           axes(dest) == axes(bc) || throwdm(axes(dest), axes(bc))
           ischunkedbroadcast(dest, bc) && return chunkedcopyto!(dest, bc)
           destc = dest.chunks
           bcp = preprocess(dest, bc)
           length(bcp)<=0 && return dest
           ea = eachindex(bcp)
           idx = first(ea)
           @inbounds for i = 0:Base.num_bit_chunks(length(bcp))-2
               z = UInt64(0)
               for j=0:63
                  z |= (bcp[idx]::Bool) << (j&63)
                  idx = nextind(ea, idx)
               end
               destc[i+1] = z
           end
           i = Base.num_bit_chunks(length(bcp))-1
           z = UInt64(0)
           @inbounds for j=0:(length(bcp)-1)&63
                z |= (bcp[idx]::Bool) << (j&63)
                idx = nextind(ea, idx)
           end
           @inbounds destc[i+1] = z
           return dest
       end

None of these present an unambiguous win on my machine. So I restricted the new alg to bitvectors, for the time being. Any ideas how to get the fast variant for higher-dimensional arrays?

The best way would be to propagate IndexLinear information. The second best way would be to find a way to only cache 64 values in a register, instead of an array of Bools (I think the main perf reason for caching is to avoid memory-carried dependencies). Both of my above variants do this, but fail to consistently outperform the current implementation.

mbauman · 2019-05-16T15:32:38Z

This is great — thanks. Definitely an improvement over the bitcache. And I do think it should be possible to get this fast for n-dimensional broadcasts, too, but it is indeed tricky. This gets us a bit closer:

    @inbounds for i in 1:Base.num_bit_chunks(length(bcp))-1
        z = UInt64(0)
        for j in 0:63
            z |= bcp[idx] << j
            (idx, s) = iterate(ea, s)
        end
        destc[i]=z
    end
    z = UInt64(0)
    for j in 0:63
        z |= bcp[idx] << j
        r = iterate(ea, s)
        r === nothing && break
        (idx, s) = r
    end
    destc[end]=z

But the trouble is that we're still introducing a branch with (idx, s) = iterate(ea, s) since that could return nothing and we can't really turn that off. We could potentially just use nextind instead of iterate, which I think should give us the performance we want.

chethega · 2019-05-16T18:15:45Z

If we somehow managed to get linear indexing working for common constructions, then I think that we could maybe remove the bitcache logic entirely (also from bitarray constructors), replacing it by caching a single UInt64 (reducing complexity and linecount, yay!). If linear indexing is genuinely unsupported, then the current slight perf regression for that should not matter (would need benchmarking). And linear indexing would be a giant win for more important broadcast operations than bitarrays, cf. 32051 for a 20 x speedup of arr .+ const between N and 1 x N sizes.

julia> @eval Base.Broadcast @inline function copyto!(dest::BitArray, bc::Broadcasted{Nothing})
           axes(dest) == axes(bc) || throwdm(axes(dest), axes(bc))
           ischunkedbroadcast(dest, bc) && return chunkedcopyto!(dest, bc)
           destc = dest.chunks
           bcp = preprocess(dest, bc)
           length(destc)<=0 && return dest
           ea = eachindex(bcp)
           i1 = 1
           i2 = 0
           z = UInt64(0)
           @inbounds for idx in ea
               z |= (bcp[idx]::Bool) << (i2&63)
               i2 += 1
               if i2 == 64
                   destc[i1]=z
                   i2 = 0
                   i1 += 1
                   z = UInt64(0)
               end
           end
           i2 != 0 && @inbounds destc[i1] = z
           return dest
       end

base/broadcast.jl

Follows JuliaLang#32048.

…52736) Follows #32048. This PR fully avoids the allocation thus make nd logical broadcast better scaled for small inputs. --------- Co-authored-by: Matt Bauman <mbauman@gmail.com>

changed broadcast! into bitarray algorithm

8f923a6

KristofferC added the performance Must go faster label May 16, 2019

fredrikekre requested a review from mbauman May 16, 2019 11:52

fredrikekre added the broadcast Applying a function over a collection label May 16, 2019

chethega added 2 commits May 16, 2019 14:48

restrict new alg to bitvector

b9c6942

fix length checks

19ba49d

mbauman reviewed May 16, 2019

View reviewed changes

base/broadcast.jl Outdated Show resolved Hide resolved

vtjnash added 3 commits October 30, 2023 20:31

Update broadcast.jl

95a593a

Update broadcast.jl

1c361c5

Merge branch 'master' into broadcast_bitarray

15ad7ea

vtjnash added the merge me PR is reviewed. Merge when all tests are passing label Oct 31, 2023

Update broadcast.jl

4e9feeb

vtjnash reviewed Nov 1, 2023

View reviewed changes

base/broadcast.jl Outdated Show resolved Hide resolved

vtjnash added 2 commits November 1, 2023 10:16

Update base/broadcast.jl

1f78553

Update broadcast.jl

b72db5c

vtjnash merged commit f3ae44c into JuliaLang:master Nov 4, 2023
7 checks passed

giordano removed the merge me PR is reviewed. Merge when all tests are passing label Nov 4, 2023

N5N3 added a commit to N5N3/julia that referenced this pull request Dec 21, 2023

Generalize broadcast!(f, ::BitVector) optimization to BitArray.

f7c5052

Follows JuliaLang#32048.

N5N3 added a commit to N5N3/julia that referenced this pull request Jan 4, 2024

Generalize broadcast!(f, ::BitVector) optimization to BitArray.

4e359cf

Follows JuliaLang#32048.

N5N3 mentioned this pull request Jan 4, 2024

Generalize broadcast!(f, ::BitVector) optimization to BitArray. #52736

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

changed broadcast! into bitarray algorithm #32048

changed broadcast! into bitarray algorithm #32048

chethega commented May 16, 2019

chethega commented May 16, 2019

mbauman commented May 16, 2019

chethega commented May 16, 2019

changed broadcast! into bitarray algorithm #32048

changed broadcast! into bitarray algorithm #32048

Conversation

chethega commented May 16, 2019

chethega commented May 16, 2019

mbauman commented May 16, 2019

chethega commented May 16, 2019