🐛[BUG]: MeshGraphNet multiGPU test failure #278

akshaysubr · 2023-12-13T16:17:14Z

Version

main

On which installation method(s) does this occur?

Docker

Describe the issue

Reported by @azrael417 here, pasting the failure log

Minimum reproducible example

No response

Relevant log output

I can run some of the tests but the meshgraphnet one fails:

`models/meshgraphnet/test_meshgraphnet_snmg.py FFF [100%]

=================================== FAILURES ===================================
____________________ test_distributed_meshgraphnet[dtype0] _____________________

dtype = torch.float32

@pytest.mark.multigpu
@pytest.mark.parametrize("dtype", [torch.float32, torch.float16, torch.bfloat16])
def test_distributed_meshgraphnet(dtype):
    num_gpus = torch.cuda.device_count()
    assert num_gpus >= 2, "Not enough GPUs available for test"
    world_size = num_gpus
  torch.multiprocessing.spawn(
        run_test_distributed_meshgraphnet,
        args=(world_size, dtype),
        nprocs=world_size,
        start_method="spawn",
    )
models/meshgraphnet/test_meshgraphnet_snmg.py:193:

../../.conda/envs/modulus/lib/python3.10/site-packages/torch/multiprocessing/spawn.py:246: in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
../../.conda/envs/modulus/lib/python3.10/site-packages/torch/multiprocessing/spawn.py:202: in start_processes
while not context.join():

self = <torch.multiprocessing.spawn.ProcessContext object at 0x7f7f41a258a0>
timeout = None

def join(self, timeout=None):
    r"""
    Tries to join one or more processes in this spawn context.
    If one of them exited with a non-zero exit status, this function
    kills the remaining processes and raises an exception with the cause
    of the first process exiting.

    Returns ``True`` if all processes have been joined successfully,
    ``False`` if there are more processes that need to be joined.

    Args:
        timeout (float): Wait this long before giving up on waiting.
    """
    # Ensure this function can be called even when we're done.
    if len(self.sentinels) == 0:
        return True

    # Wait for any process to fail or all of them to succeed.
    ready = multiprocessing.connection.wait(
        self.sentinels.keys(),
        timeout=timeout,
    )

    error_index = None
    for sentinel in ready:
        index = self.sentinels.pop(sentinel)
        process = self.processes[index]
        process.join()
        if process.exitcode != 0:
            error_index = index
            break

    # Return if there was no error.
    if error_index is None:
        # Return whether or not all processes have been joined.
        return len(self.sentinels) == 0

    # Assume failure. Terminate processes that are still alive.
    for process in self.processes:
        if process.is_alive():
            process.terminate()
        process.join()
`

Environment details

No response

akshaysubr · 2023-12-13T16:17:36Z

@mnabian @stadlmax Can one of you take a look at this?

stadlmax · 2023-12-14T14:30:16Z

#171 forgot to update a few multi-gpu tests. I forgot that I ran into the same issue when working on #249 and fixed it there. Since #243 could be merged, I'll also fix this behavior in there. Either of these two PRs then also should fix the issue on the main branch.

mnabian · 2024-10-18T01:01:59Z

@stadlmax could you please verify that the issue is fixed?

mnabian · 2024-10-18T17:53:07Z

@stadlmax could you please verify that the issue is fixed?

Max confirmed this has been fixed. Closing as completed.

akshaysubr added bug Something isn't working ? - Needs Triage Need team to review and classify distributed Distributed and model parallel tools labels Dec 13, 2023

akshaysubr mentioned this issue Dec 13, 2023

Tkurth/extended distributed primitives #273

Merged

5 tasks

NickGeneva added 0 - Backlog In queue waiting for assignment and removed ? - Needs Triage Need team to review and classify labels Jan 18, 2024

mnabian closed this as completed Oct 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛[BUG]: MeshGraphNet multiGPU test failure #278

🐛[BUG]: MeshGraphNet multiGPU test failure #278

akshaysubr commented Dec 13, 2023

akshaysubr commented Dec 13, 2023

stadlmax commented Dec 14, 2023

mnabian commented Oct 18, 2024

mnabian commented Oct 18, 2024

🐛[BUG]: MeshGraphNet multiGPU test failure #278

🐛[BUG]: MeshGraphNet multiGPU test failure #278

Comments

akshaysubr commented Dec 13, 2023

Version

On which installation method(s) does this occur?

Describe the issue

Minimum reproducible example

Relevant log output

Environment details

akshaysubr commented Dec 13, 2023

stadlmax commented Dec 14, 2023

mnabian commented Oct 18, 2024

mnabian commented Oct 18, 2024