Potential memleak in TL/CUDA #641

vspetrov · 2022-09-30T07:08:21Z

Repro 2 nodes (vulcan):

mpirun -x UCC_CONFIG_FILE= -x UCC_TLS=ucp,cuda -np 48 --display-map --mca coll_ucc_enable 1 --mca coll_ucc_priority 100 --map-by node -x UCX_NET_DEVICES=mlx5_0:1 --bind-to core /hpc/mtr_scrap/users/anatolyv/scratch/ucc/20220929_025813_22676_24405_vulcan01.swx.labs.mlnx/installs/Xw9w/tests/mpich_tests/mtt-tests.git/mpich/test/mpi/comm/ctxsplit

Output:

[1,45]<stdout>:[1664467148.837561] [vulcan02:9669 :0]    tl_cuda_team.c:58   UCC  ERROR cudaMalloc(&self->scratch.loc, scratch_size)() failed: 2(out of memory)
[1,45]<stdout>:[1664467148.837603] [vulcan02:9669 :0]    tl_cuda_team.c:60   TL_CUDA ERROR failed to alloc scratch buffer, 16777216

The test works with UCC_TLS=ucp, ie w/o TL_CUDA. The test creates multiple communicators in a tight loop:

    for (i = 0; i < nLoop; i++) {
        randval = rand();

        if (randval % (rank + 2) == 0) {
            MPI_Comm_split(MPI_COMM_WORLD, 1, rank, &newcomm);
            MPI_Comm_free(&newcomm);
        }
        else {
            MPI_Comm_split(MPI_COMM_WORLD, MPI_UNDEFINED, rank, &newcomm);
            if (newcomm != MPI_COMM_NULL) {
                errs++;
                printf("Created a non-null communicator with MPI_UNDEFINED\n");
            }
        }
    }

Each time MPI comm is created , the corresponding UCC team is created. If its size <= 8 and it is entirely within single node, then TL/CUDA team is created. The corresponding tl_cuda_team_destroy is correctly called for each MPI_Comm_free. But looks like some cuda memory is not released.

If i change ENABLE_RCACHE from 1 to 0 in tl_cuda_cache.c, the problem goes away.

@Sergei-Lebedev could you plz have a look

The text was updated successfully, but these errors were encountered:

vspetrov · 2022-09-30T07:10:19Z

Internal RM: https://redmine.mellanox.com/issues/3220488

vspetrov added the bug Something isn't working label Sep 30, 2022

Sergei-Lebedev mentioned this issue Oct 4, 2022

TL/CUDA: fix cache unmap #642

Merged

Sergei-Lebedev closed this as completed Oct 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential memleak in TL/CUDA #641

Potential memleak in TL/CUDA #641

vspetrov commented Sep 30, 2022

vspetrov commented Sep 30, 2022

Potential memleak in TL/CUDA #641

Potential memleak in TL/CUDA #641

Comments

vspetrov commented Sep 30, 2022

vspetrov commented Sep 30, 2022