Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛[BUG]: MeshGraphNet multiGPU test failure #278

Closed
akshaysubr opened this issue Dec 13, 2023 · 4 comments
Closed

🐛[BUG]: MeshGraphNet multiGPU test failure #278

akshaysubr opened this issue Dec 13, 2023 · 4 comments
Labels
0 - Backlog In queue waiting for assignment bug Something isn't working distributed Distributed and model parallel tools

Comments

@akshaysubr
Copy link
Collaborator

Version

main

On which installation method(s) does this occur?

Docker

Describe the issue

Reported by @azrael417 here, pasting the failure log

Minimum reproducible example

No response

Relevant log output

I can run some of the tests but the meshgraphnet one fails:

`models/meshgraphnet/test_meshgraphnet_snmg.py FFF [100%]

=================================== FAILURES ===================================
____________________ test_distributed_meshgraphnet[dtype0] _____________________

dtype = torch.float32

@pytest.mark.multigpu
@pytest.mark.parametrize("dtype", [torch.float32, torch.float16, torch.bfloat16])
def test_distributed_meshgraphnet(dtype):
    num_gpus = torch.cuda.device_count()
    assert num_gpus >= 2, "Not enough GPUs available for test"
    world_size = num_gpus
  torch.multiprocessing.spawn(
        run_test_distributed_meshgraphnet,
        args=(world_size, dtype),
        nprocs=world_size,
        start_method="spawn",
    )
models/meshgraphnet/test_meshgraphnet_snmg.py:193:

../../.conda/envs/modulus/lib/python3.10/site-packages/torch/multiprocessing/spawn.py:246: in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
../../.conda/envs/modulus/lib/python3.10/site-packages/torch/multiprocessing/spawn.py:202: in start_processes
while not context.join():

self = <torch.multiprocessing.spawn.ProcessContext object at 0x7f7f41a258a0>
timeout = None

def join(self, timeout=None):
    r"""
    Tries to join one or more processes in this spawn context.
    If one of them exited with a non-zero exit status, this function
    kills the remaining processes and raises an exception with the cause
    of the first process exiting.

    Returns ``True`` if all processes have been joined successfully,
    ``False`` if there are more processes that need to be joined.

    Args:
        timeout (float): Wait this long before giving up on waiting.
    """
    # Ensure this function can be called even when we're done.
    if len(self.sentinels) == 0:
        return True

    # Wait for any process to fail or all of them to succeed.
    ready = multiprocessing.connection.wait(
        self.sentinels.keys(),
        timeout=timeout,
    )

    error_index = None
    for sentinel in ready:
        index = self.sentinels.pop(sentinel)
        process = self.processes[index]
        process.join()
        if process.exitcode != 0:
            error_index = index
            break

    # Return if there was no error.
    if error_index is None:
        # Return whether or not all processes have been joined.
        return len(self.sentinels) == 0

    # Assume failure. Terminate processes that are still alive.
    for process in self.processes:
        if process.is_alive():
            process.terminate()
        process.join()
`

Environment details

No response

@akshaysubr akshaysubr added bug Something isn't working ? - Needs Triage Need team to review and classify distributed Distributed and model parallel tools labels Dec 13, 2023
@akshaysubr
Copy link
Collaborator Author

@mnabian @stadlmax Can one of you take a look at this?

@stadlmax
Copy link
Collaborator

#171 forgot to update a few multi-gpu tests. I forgot that I ran into the same issue when working on #249 and fixed it there. Since #243 could be merged, I'll also fix this behavior in there. Either of these two PRs then also should fix the issue on the main branch.

@NickGeneva NickGeneva added 0 - Backlog In queue waiting for assignment and removed ? - Needs Triage Need team to review and classify labels Jan 18, 2024
@mnabian
Copy link
Collaborator

mnabian commented Oct 18, 2024

@stadlmax could you please verify that the issue is fixed?

@mnabian
Copy link
Collaborator

mnabian commented Oct 18, 2024

@stadlmax could you please verify that the issue is fixed?

Max confirmed this has been fixed. Closing as completed.

@mnabian mnabian closed this as completed Oct 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment bug Something isn't working distributed Distributed and model parallel tools
Projects
None yet
Development

No branches or pull requests

4 participants