Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: CommClosedError on heartbeat during Dask Client shutdown #2026

Open
2 tasks done
cwharris opened this issue Oct 31, 2024 · 0 comments
Open
2 tasks done

[BUG]: CommClosedError on heartbeat during Dask Client shutdown #2026

cwharris opened this issue Oct 31, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@cwharris
Copy link
Contributor

Version

24.10

Which installation method(s) does this occur on?

Source

Describe the bug.

Original issue: #1990
Related Dask issue: dask/distributed#7891

When shutting down a Dask Client in Morpheus, an intermittent CommClosedError occurs, specifically during the heartbeat process. This error happens because a coroutine initiates a heartbeat communication after the scheduler has already been closed.

Expected Behavior:

The Dask Client should close gracefully without any errors indicating communication failures.

Observed Behavior:

An error is logged due to a StreamClosedError in tornado, which dask relies on for communication. This error appears intermittently during shutdown of stages that use dask.

Root Cause:

This issue appears to be a race condition in dask.distributed where the coroutine initiating a heartbeat is not aware that the scheduler has already closed, causing it to attempt communication and fail.

Proposed Solutions:

  1. Improve Exception Handling in Dask: Enhance dask.distributed to handle this scenario by catching the StreamClosedError within heartbeat operations and either logging it or ignoring it based on the client state.
  2. Override asyncio Event Loop Exception Handling: Provide a custom handler for the asyncio event loop Dask uses (should be default) that detects this exception and ignores it if it occurred due to a known client closure.

Minimum reproducible example

Run the ransomware_detection pipeline.

Relevant log output

Click here to see error details
2024-10-31 18:49:51,926 - distributed.worker - ERROR - Failed to communicate with scheduler during heartbeat.
Traceback (most recent call last):
  File "/home/coder/.conda/envs/cyber/lib/python3.10/site-packages/distributed/comm/tcp.py", line 225, in read
    frames_nosplit_nbytes_bin = await stream.read_bytes(fmt_size)
tornado.iostream.StreamClosedError: Stream is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/coder/.conda/envs/cyber/lib/python3.10/site-packages/distributed/worker.py", line 1251, in heartbeat
response = await retry_operation(
File "/home/coder/.conda/envs/cyber/lib/python3.10/site-packages/distributed/utils_comm.py", line 461, in retry_operation
return await retry(
File "/home/coder/.conda/envs/cyber/lib/python3.10/site-packages/distributed/utils_comm.py", line 440, in retry
return await coro()
File "/home/coder/.conda/envs/cyber/lib/python3.10/site-packages/distributed/core.py", line 1256, in send_recv_from_rpc
return await send_recv(comm=comm, op=key, **kwargs)
File "/home/coder/.conda/envs/cyber/lib/python3.10/site-packages/distributed/core.py", line 1015, in send_recv
response = await comm.read(deserializers=deserializers)
File "/home/coder/.conda/envs/cyber/lib/python3.10/site-packages/distributed/comm/tcp.py", line 236, in read
convert_stream_closed_error(self, e)
File "/home/coder/.conda/envs/cyber/lib/python3.10/site-packages/distributed/comm/tcp.py", line 142, in convert_stream_closed_error
raise CommClosedError(f"in {obj}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) ConnectionPool.heartbeat_worker local=tcp://127.0.0.1:54960 remote=tcp://127.0.0.1:38031>: Stream is closed

Full env printout

Click here to see environment details

[Paste the results of print_env.sh here, it will be hidden by default]

Other/Misc.

No response

Code of Conduct

  • I agree to follow Morpheus' Code of Conduct
  • I have searched the open bugs and have found no duplicates for this bug report
@cwharris cwharris added the bug Something isn't working label Oct 31, 2024
rapids-bot bot pushed a commit that referenced this issue Nov 1, 2024
See #2026

## By Submitting this PR I confirm:
- I am familiar with the [Contributing Guidelines](https://github.com/nv-morpheus/Morpheus/blob/main/docs/source/developer_guide/contributing.md).
- When the PR is ready for review, new or existing tests cover these changes.
- When the PR is ready for review, the documentation is up to date with these changes.

Authors:
  - Christopher Harris (https://github.com/cwharris)

Approvers:
  - David Gardner (https://github.com/dagardner-nv)

URL: #2027
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Todo
Development

No branches or pull requests

1 participant