Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow simple 2D copy kernel with Metal backend #464

Open
LaurentPlagne opened this issue Feb 27, 2024 · 2 comments
Open

Slow simple 2D copy kernel with Metal backend #464

LaurentPlagne opened this issue Feb 27, 2024 · 2 comments

Comments

@LaurentPlagne
Copy link

Hi,

I try to use KA for the first time and I wonder about the performance I obtain for a simple kernel copying 2 2D matrices of Float32 (I know that I could copy them as vectors) :

using Metal
using KernelAbstractions
using Random
using BenchmarkTools

@kernel function copy2D_kernel!(b, a)
    i, j = @index(Global, NTuple)
    @inbounds b[i, j] = a[i, j]
end

function copy2D!(b, a)
    backend = get_backend(a)
    groupsize = KernelAbstractions.isgpu(backend) ? 256 : 1024
    kernel! = copy2D_kernel!(backend, groupsize)
    kernel!(b, a, ndrange=size(a))
end

function go()

    res = 2^14
    # creating initial cpu arrays
    a_cpu = rand(Float32, res, res)
    b_cpu = zeros(Float32, res, res)
    @info("size of a,b (GB) :",2sizeof(a_cpu)/(1.e9))

    # creating initial gpu arrays
    a = MtlArray(a_cpu)
    b = MtlArray(b_cpu)

    backend = get_backend(a)
    gpu_elapsed = @belapsed begin
        copy2D!($b,$a)
        KernelAbstractions.synchronize($backend)
    end

    cpu_elapsed = @belapsed $a_cpu .= $b_cpu

    bandwidth_GBs(res,t,T) = sizeof(T)*res*res*2/(t*1.e9) 
    @info(cpu_elapsed,bandwidth_GBs(res,cpu_elapsed,Float32))
    @info(gpu_elapsed,bandwidth_GBs(res,gpu_elapsed,Float32))

    nothing
end

And I obtain (mbp M1Max) a cpu simple copy twice as fast at the KA GPU one...

┌ Info: size of a,b (GB) :
└ (2 * sizeof(a_cpu)) / 1.0e9 = 2.147483648
┌ Info: 0.022282291
└ bandwidth_GBs(res, cpu_elapsed, Float32) = 96.37625000050488
┌ Info: 0.047214875
└ bandwidth_GBs(res, gpu_elapsed, Float32) = 45.48320096156137

Any hint ?

Laurent

@bjarthur
Copy link
Contributor

bjarthur commented Jun 7, 2024

how do your benchmarks vary with groupsize and res? are there regions in that space for which the GPU is faster??

@LaurentPlagne
Copy link
Author

It looks rather stable for res in {2^15,2^16} and groupsize in {126,256,512,1024}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants