Use of array wrappers & unions regress load time #453

maleadt · 2020-09-28T13:01:52Z

Previously we were using a CuArray type that could represent a view, reshape, reinterpret, etc. For the sake of simplicity, I switched to a simpler CuArray type while reusing Base.SubArray, Base.ReshapeArray, etc. That requires use of type unions to, e.g., represent all dense or strided CuArrays:

CUDA.jl/src/array.jl

Lines 146 to 164 in 75f7d30

    
           ContiguousSubCuArray{T,N,A<:CuArray} = Base.FastContiguousSubArray{T,N,A} 
        
           # dense arrays: stored contiguously in memory 
        
           DenseReinterpretCuArray{T,N,A<:Union{CuArray,ContiguousSubCuArray}} = Base.ReinterpretArray{T,N,S,A} where S 
        
           DenseReshapedCuArray{T,N,A<:Union{CuArray,ContiguousSubCuArray,DenseReinterpretCuArray}} = Base.ReshapedArray{T,N,A} 
        
           DenseSubCuArray{T,N,A<:Union{CuArray,DenseReshapedCuArray,DenseReinterpretCuArray}} = Base.FastContiguousSubArray{T,N,A} 
        
           DenseCuArray{T,N} = Union{CuArray{T,N}, DenseSubCuArray{T,N}, DenseReshapedCuArray{T,N}, DenseReinterpretCuArray{T,N}} 
        
           DenseCuVector{T} = DenseCuArray{T,1} 
        
           DenseCuMatrix{T} = DenseCuArray{T,2} 
        
           DenseCuVecOrMat{T} = Union{DenseCuVector{T}, DenseCuMatrix{T}} 
        
           # strided arrays 
        
           StridedSubCuArray{T,N,A<:Union{CuArray,DenseReshapedCuArray,DenseReinterpretCuArray}, 
        
                             I<:Tuple{Vararg{Union{Base.RangeIndex, Base.ReshapedUnitRange, 
        
                                                   Base.AbstractCartesianIndex}}}} = SubArray{T,N,A,I} 
        
           StridedCuArray{T,N} = Union{CuArray{T,N}, StridedSubCuArray{T,N}, DenseReshapedCuArray{T,N}, DenseReinterpretCuArray{T,N}} 
        
           StridedCuVector{T} = StridedCuArray{T,1} 
        
           StridedCuMatrix{T} = StridedCuArray{T,2} 
        
           StridedCuVecOrMat{T} = Union{StridedCuVector{T}, StridedCuMatrix{T}}

These definitions are almost identical to how Base defines StridedArray. However, using them significantly regresses load time. For example, #450 adds them to a bunch of LinearAlgebra.mul! methods which badly affects time of using CUDA: +25%, https://speed.juliagpu.org/timeline/#/?exe=4&ben=latency/import&env=1&revs=50&base=3+96&equid=off&quarts=on&extr=on

In a similar vein, Adapt.jl defines a union that captures all array instances that can be used on the GPU (i.e. not necessarily dense or strided, but an Adjoint or PermuteDimsArray): https://github.com/JuliaGPU/Adapt.jl/blob/11d96a531cb70359e88ed2ad0d0a13a85727a204/src/wrappers.jl#L73-L92
Using these unions makes load time go crazy, e.g. with mul!(::CuArray, ::AnyCuArray...) (where AnyCuArray uses the Adapt.WrappedArray union) it goes from 5 to 25s.

I can understand how the large union from Adapt.jl is needlessly taxing on inference, and I guess we may need something like an AbstractWrappedArray here (JuliaLang/julia#31563). However, with StridedCuArray I had not expected these regressions, as Base uses similar patterns. Am I doing anything especially bad here? I'd like to start using StridedCuArray much more, in order to cover APIs that take stride inputs (which there are quite some).

The text was updated successfully, but these errors were encountered:

maleadt · 2020-10-20T09:17:45Z

#498

maleadt · 2020-10-20T09:18:11Z

Actually, let's leave this open as I'd want to go back to the wrapper-based approach one day.

timholy · 2022-06-07T16:32:55Z

Has anyone checked for invalidation? That's the first step.

maleadt · 2022-06-08T12:36:14Z

I didn't, I'm not sure those tools existed or I was aware of them back then.

maleadt mentioned this issue Sep 28, 2020

Proposal for AbstractWrappedArray in order to mitigate "AbstractArray-fallback" JuliaLang/julia#31563

Closed

maleadt closed this as completed Oct 20, 2020

maleadt reopened this Oct 20, 2020

maleadt added cuda array Stuff about CuArray. upstream Somebody else's problem. labels Oct 29, 2020

ToucheSir mentioned this issue Jan 27, 2023

Reshape of a view breaks batched_mul on GPU (scalar indexing happens) FluxML/NNlib.jl#466

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use of array wrappers & unions regress load time #453

Use of array wrappers & unions regress load time #453

maleadt commented Sep 28, 2020

maleadt commented Oct 20, 2020

maleadt commented Oct 20, 2020

timholy commented Jun 7, 2022

maleadt commented Jun 8, 2022

Use of array wrappers & unions regress load time #453

Use of array wrappers & unions regress load time #453

Comments

maleadt commented Sep 28, 2020

maleadt commented Oct 20, 2020

maleadt commented Oct 20, 2020

timholy commented Jun 7, 2022

maleadt commented Jun 8, 2022