Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential memleak in TL/CUDA #641

Closed
vspetrov opened this issue Sep 30, 2022 · 1 comment
Closed

Potential memleak in TL/CUDA #641

vspetrov opened this issue Sep 30, 2022 · 1 comment
Labels
bug Something isn't working

Comments

@vspetrov
Copy link
Collaborator

Repro 2 nodes (vulcan):

mpirun -x UCC_CONFIG_FILE= -x UCC_TLS=ucp,cuda -np 48 --display-map --mca coll_ucc_enable 1 --mca coll_ucc_priority 100 --map-by node -x UCX_NET_DEVICES=mlx5_0:1 --bind-to core /hpc/mtr_scrap/users/anatolyv/scratch/ucc/20220929_025813_22676_24405_vulcan01.swx.labs.mlnx/installs/Xw9w/tests/mpich_tests/mtt-tests.git/mpich/test/mpi/comm/ctxsplit

Output:

[1,45]<stdout>:[1664467148.837561] [vulcan02:9669 :0]    tl_cuda_team.c:58   UCC  ERROR cudaMalloc(&self->scratch.loc, scratch_size)() failed: 2(out of memory)
[1,45]<stdout>:[1664467148.837603] [vulcan02:9669 :0]    tl_cuda_team.c:60   TL_CUDA ERROR failed to alloc scratch buffer, 16777216

The test works with UCC_TLS=ucp, ie w/o TL_CUDA. The test creates multiple communicators in a tight loop:

    for (i = 0; i < nLoop; i++) {
        randval = rand();

        if (randval % (rank + 2) == 0) {
            MPI_Comm_split(MPI_COMM_WORLD, 1, rank, &newcomm);
            MPI_Comm_free(&newcomm);
        }
        else {
            MPI_Comm_split(MPI_COMM_WORLD, MPI_UNDEFINED, rank, &newcomm);
            if (newcomm != MPI_COMM_NULL) {
                errs++;
                printf("Created a non-null communicator with MPI_UNDEFINED\n");
            }
        }
    }

Each time MPI comm is created , the corresponding UCC team is created. If its size <= 8 and it is entirely within single node, then TL/CUDA team is created. The corresponding tl_cuda_team_destroy is correctly called for each MPI_Comm_free. But looks like some cuda memory is not released.

If i change ENABLE_RCACHE from 1 to 0 in tl_cuda_cache.c, the problem goes away.

@Sergei-Lebedev could you plz have a look

@vspetrov vspetrov added the bug Something isn't working label Sep 30, 2022
@vspetrov
Copy link
Collaborator Author

Internal RM: https://redmine.mellanox.com/issues/3220488

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants