-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scalar getindex when shared index changes order #28
Comments
I think this is similar to #25, which is JuliaGPU/CUDA.jl#228. When the broadcast has one naked CuArray, and others reshaped / permuted, then it hits CUDA's broadcasting. But when all elements are too deeply wrapped, somehow the BroadcastStyle doesn't get passed outwards, and you get the default CPU broadcasting. Two of these examples are: julia> @pretty @reduce C[a,b,d] := sum(c) A[a,b,c] * B[a,c,d] # Fast
begin
local raven = orient(B, (:, *, :, :))
C = dropdims(sum(@__dot__(A * raven), dims = 3), dims = 3)
end
julia> @pretty @reduce C[b,a,d] := sum(c) A[a,b,c] * B[a,c,d] # Not fast
begin
local raven = orient(PermutedDimsArray(A, (2, 1, 3)), (:, :, *, :))
local deer = orient(PermutedDimsArray(B, (1, 3, 2)), (*, :, :, :))
C = dropdims(sum(@__dot__(raven * deer), dims = 4), dims = 4)
end The workaround from #25 (comment) might be useful? Other things which might work (but I haven't tried) are: julia> @pretty @reduce C[b,a,d] := sum(c) A[a,b,c] * B[a,c,d] nolazy
begin
local raven = orient(permutedims(A, (2, 1, 3)), (:, :, *, :))
local deer = orient(permutedims(B, (1, 3, 2)), (*, :, :, :))
C = dropdims(sum(@__dot__(raven * deer), dims = 4), dims = 4)
end
julia> @pretty @matmul C[b,a,d] := sum(c) A[a,b,c] * B[a,c,d]
begin
local (sz_a, sz_b, sz_c, sz_d) = (size(A, 1), size(A, 2), size(A, 3), size(B, 3))
local raven = permutedims(B, (2, 1, 3))
local deer = reshape(A, (star(sz_a, sz_b), sz_c))
local seal = reshape(raven, (sz_c, star(sz_a, sz_d)))
local goat = reshape(deer * seal, (sz_a, sz_b, sz_a, sz_d))
local skunk = orient(permutedims(goat, (2, 1, 3, 4)), (:, :, :))
C = skunk
end |
On the CPU, Another possible plan I had is to replace things like |
Fwiw my project worked just fine with me just data-wrangling until it wasn't in a (b,a,d) ordering anymore. This doesn't seem like a huge problem to me now that I understand how it works. It could just be enough to add something to the docs to warn about it (at least as a quick patch) Let me take a day to recover from my project and I'll start looking at this in-depth. I seriously owe you for making this library. |
Sorry, it turns out there was much more work to be done. Have you tested the edit: Oh, of course you wrote that library too haha. So are there any issues with using that method that you know about? |
I haven't tried it out, it means diving deep into the belly of this package's logic, as it still needs to call |
Gotcha. I thought it would be easier than that, but if I'm being honest your code's a little above my paygrade. I ran some benchmarks on the easy cases though For small arrays, the trick and solution have a pretty bad worst case, but otherwise are pretty comparable to the case where the index doesn't change. And the difference in the worst case basically disappears as the arrays get large. Either seems viable. EDIT: Corrected tests and reran them using TensorCast
using CUDA
CUDA.allowscalar(true)
function easy(A, B)
CUDA.@sync @reduce C[a,b,d] := sum(c) A[a,b,c] * B[a,c,d]
return C
end
function simple(A, B)
CUDA.@sync @reduce C[b,a,d] := sum(c) A[a,b,c] * B[a,c,d]
return C
end
function trick(A, B)
trick = cu(fill(false))
CUDA.@sync @reduce C[b,a,d] := sum(c) A[a,b,c] * B[a,c,d] + trick
return C
end
using TensorCast: orient, star
function solution(A, B)
local raven = orient(permutedims(A, (2, 1, 3)), (:, :, *, :))
local deer = orient(permutedims(B, (1, 3, 2)), (*, :, :, :))
CUDA.@sync C = dropdims(sum(@__dot__(raven * deer), dims = 4), dims = 4)
end Benchmarks for small arrays julia> @benchmark easy($(cu(rand(1,2,3))), $(cu(rand(1,3,4))))
BenchmarkTools.Trial:
memory estimate: 2.53 KiB
allocs estimate: 97
--------------
minimum time: 57.300 μs (0.00% GC)
median time: 75.201 μs (0.00% GC)
mean time: 80.069 μs (0.00% GC)
maximum time: 4.932 ms (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia>
julia> @benchmark simple($(cu(rand(1,2,3))), $(cu(rand(1,3,4))))
┌ Warning: Performing scalar operations on GPU arrays: This is very slow, consider disallowing these operations with `allowscalar(false)`
└ @ GPUArrays C:\Users\Lepre\.julia\packages\GPUArrays\uaFZh\src\host\indexing.jl:43
BenchmarkTools.Trial:
memory estimate: 7.94 KiB
allocs estimate: 191
--------------
minimum time: 1.756 ms (0.00% GC)
median time: 1.849 ms (0.00% GC)
mean time: 1.905 ms (0.00% GC)
maximum time: 3.293 ms (0.00% GC)
--------------
samples: 2629
evals/sample: 1
julia>
julia> @benchmark trick($(cu(rand(1,2,3))), $(cu(rand(1,3,4))))
BenchmarkTools.Trial:
memory estimate: 12.84 KiB
allocs estimate: 201
--------------
minimum time: 88.200 μs (0.00% GC)
median time: 108.900 μs (0.00% GC)
mean time: 116.488 μs (1.44% GC)
maximum time: 19.184 ms (45.69% GC)
--------------
samples: 10000
evals/sample: 1
julia>
julia> @benchmark solution($(cu(rand(1,2,3))), $(cu(rand(1,3,4))))
BenchmarkTools.Trial:
memory estimate: 4.42 KiB
allocs estimate: 166
--------------
minimum time: 66.600 μs (0.00% GC)
median time: 85.000 μs (0.00% GC)
mean time: 87.695 μs (0.00% GC)
maximum time: 7.036 ms (0.00% GC)
--------------
samples: 10000
evals/sample: 1 Benchmarks for big arrays julia> @benchmark easy($(cu(rand(50,100,150))), $(cu(rand(50,150,200))))
BenchmarkTools.Trial:
memory estimate: 2.72 KiB
allocs estimate: 106
--------------
minimum time: 105.321 ms (0.00% GC)
median time: 106.172 ms (0.00% GC)
mean time: 110.391 ms (0.07% GC)
maximum time: 172.219 ms (0.65% GC)
--------------
samples: 46
evals/sample: 1
julia> # Skipped the allow scalar one because I got impatient.
julia> @benchmark trick($(cu(rand(50,100,150))), $(cu(rand(50,150,200))))
BenchmarkTools.Trial:
memory estimate: 13.03 KiB
allocs estimate: 210
--------------
minimum time: 131.732 ms (0.00% GC)
median time: 132.681 ms (0.00% GC)
mean time: 137.899 ms (0.07% GC)
maximum time: 198.738 ms (0.65% GC)
--------------
samples: 37
evals/sample: 1
julia>
julia> @benchmark solution($(cu(rand(50,100,150))), $(cu(rand(50,150,200))))
BenchmarkTools.Trial:
memory estimate: 4.61 KiB
allocs estimate: 175
--------------
minimum time: 104.896 ms (0.00% GC)
median time: 106.823 ms (0.00% GC)
mean time: 111.285 ms (0.08% GC)
maximum time: 176.250 ms (0.72% GC)
--------------
samples: 45
evals/sample: 1 |
Thanks for testing. I think these might need But if these are accurate, then I'm a little surprised that "trick" costs as much as 50 μs. The variance of the large cases also seems huge, I guess they all allocate memory. Note that if index |
Whoops. I went ahead and made those corrections, but they did not make a huge difference, other than the minimum time increasing to be closer to the median, and increasing the maximum time slightly. The ratios between techniques stayed about the same. Does the results seem more in-line with your thinking now? P.S. I attempted implementing the second solution you posted, but I'm getting errors from the final |
Oh that's embarrassing, my example of The times don't seem crazy to me. Another thing you might try is OMEinsum.jl, this will call
|
Just tried Small test: memory estimate: 4.84 KiB
allocs estimate: 56
--------------
minimum time: 1.720 ms (0.00% GC)
median time: 1.739 ms (0.00% GC)
mean time: 1.774 ms (0.00% GC)
maximum time: 2.989 ms (0.00% GC)
--------------
samples: 2818
evals/sample: 1 I don't know what is necessary for this solution begin
local raven = orient(permutedims(A, (2, 1, 3)), (:, :, *, :))
local deer = orient(permutedims(B, (1, 3, 2)), (*, :, :, :))
C = dropdims(sum(@__dot__(raven * deer), dims = 4), dims = 4)
end But it's looking like a clear winner. Are there other things you want me to run? I'm happy to test any code you send my way. |
It might be worth opening an issue with OMEinsum.jl, I see Leo was working to update CUDA things recently. I'm out of ideas for now, it just seems a bit crazy that the best way to do this is broadcasting. For matrix multiplication on the CPU this is 50 times slower. And many operations involving |
All the examples from the first message run without scalar indexing now, TensorCast v0.4.5, which uses TransmuteDims, #31. The expansions from here have become: julia> @pretty @reduce C[a,b,d] := sum(c) A[a,b,c] * B[a,c,d] # Fast
begin
local spider = transmute(B, Val((1, nothing, 2, 3)))
C = dropdims(sum(@__dot__(A * spider), dims = 3), dims = 3)
end
julia> @pretty @reduce C[b,a,d] := sum(c) A[a,b,c] * B[a,c,d] # Not fast
begin
local spider = transmute(A, Val((2, 1, nothing, 3)))
local tapir = transmute(B, Val((nothing, 1, 3, 2)))
C = dropdims(sum(@__dot__(spider * tapir), dims = 4), dims = 4)
end So I think this can be closed. |
I was asked to make this an issue, but I'm not sure there's anything to be done about it.
TensorCast uses scalar indexing when a shared index changes position, which is nightmarishly slow for big calculations on gpu.
The text was updated successfully, but these errors were encountered: