Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[cuda] Improve kernel return value performance when unified memory is available #965

Merged
merged 3 commits into from
May 13, 2020

Conversation

yuanming-hu
Copy link
Member

@KLozes
Copy link
Collaborator

KLozes commented May 13, 2020

test_pointer3 definitely running faster. Around a minute or so but... is failing intermittently!
seems to fail about 25% of the time

klozes@system76-pc:~/Documents/software/taichi-dev/tests/python$ ti test test_sparse_deactivate.py 
[Taichi] mode=development
[Taichi] preparing sandbox at /tmp/taichi-62s4a5_c
[Taichi] sandbox prepared
[Taichi] <dev mode>, supported archs: [cpu, cuda], commit 81148358, python 3.6.9

 *******************************************
 **     Taichi Programming Language       **
 *******************************************


Running python tests...

ERROR: usage: ti [options] [file_or_dir] [file_or_dir] [...]
ti: error: unrecognized arguments: -n1
  inifile: None
  rootdir: /home/klozes/Documents/software/taichi-dev

======================================================================== test session starts ========================================================================
platform linux -- Python 3.6.9, pytest-5.3.5, py-1.8.1, pluggy-0.13.1
rootdir: /home/klozes/Documents/software/taichi-dev
collected 4 items                                                                                                                                                   

test_sparse_deactivate.py ..F.                                                                                                                                [100%]

============================================================================= FAILURES ==============================================================================
___________________________________________________________________________ test_pointer3 ___________________________________________________________________________

    @ti.archs_support_sparse
    def test_pointer3():
        x = ti.var(ti.f32)
        x_temp = ti.var(ti.f32)
    
        n = 16
    
        ptr1 = ti.root.pointer(ti.ij, n)
        ptr1.dense(ti.ij, n).place(x)
        ptr2 = ti.root.pointer(ti.ij, n)
        ptr2.dense(ti.ij, n).place(x_temp)
    
        @ti.kernel
        def fill():
            for j in range(n * n):
                for i in range(n * n):
                    x[i, j] = i + j
    
        @ti.kernel
        def fill2():
            for i, j in x_temp:
                if x_temp[i, j] < 100:
                    x[i, j] = x_temp[i, j]
    
        @ti.kernel
        def copy_to_temp():
            for i, j in x:
                x_temp[i, j] = x[i, j]
    
        @ti.kernel
        def copy_from_temp():
            for i, j in x_temp:
                x[i, j] = x_temp[i, j]
    
        @ti.kernel
        def clear():
            for i, j in ptr1:
                ti.deactivate(ptr1, [i, j])
    
        @ti.kernel
        def clear_temp():
            for i, j in ptr2:
                ti.deactivate(ptr2, [i, j])
    
        fill()
        copy_to_temp()
        clear()
        fill2()
        clear_temp()
    
        for iter in range(100):
            copy_to_temp()
            clear()
            copy_from_temp()
            clear_temp()
    
            for j in range(n * n):
                for i in range(n * n):
                    if i + j < 100:
>                       assert x[i, j] == i + j
E                       assert 0.0 == (1 + 0)

test_sparse_deactivate.py:134: AssertionError
----------------------------------------------------------------------- Captured stdout call ------------------------------------------------------------------------
Running test on arch=Arch.x64
Running test on arch=Arch.cuda

@KLozes
Copy link
Collaborator

KLozes commented May 13, 2020

Perhaps a cudaDeviceSynchronize() is necessary?

@yuanming-hu
Copy link
Member Author

Thanks! You are right. It's also interesting to see that CUDA synchronization seems to be much cheaper than CUDA memcpy...

@KLozes
Copy link
Collaborator

KLozes commented May 13, 2020

No problem! LGTM. super fast now

@KLozes KLozes merged commit 979ec63 into taichi-dev:master May 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants