Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Harden pantsd memory usage restarts #11618

Closed
stuhood opened this issue Mar 1, 2021 · 5 comments
Closed

Harden pantsd memory usage restarts #11618

stuhood opened this issue Mar 1, 2021 · 5 comments
Assignees

Comments

@stuhood
Copy link
Member

stuhood commented Mar 1, 2021

When pantsd intentionally exits due to low memory, a few different issues have been observed.

  1. connection races - It's possible for a newly connecting client to attempt to connect to the exiting instance, which will cause it to wait to connect, and then get a "exceeded timeout of 60 seconds while waiting for pantsd to start" error.
  2. clean shutdown of processes - when pantsd has already decided to exit and a run is canceled due to Ctrl+C, pantsd shutdown will not necessarily wait long enough for cancellation to have propagated through the Graph: the call to sys.exit in pantsd prevents Drop from necessarily running for the Graph, leading to orphaned processes (which would otherwise be killed by their Drop handlers).
  3. clean shutdown of threads - When pantsd exits, it does not wait for StreamingWorkunitHandler threads to complete, which means that callbacks which allow async completion can be cut off by the restart.
@tdyas
Copy link
Contributor

tdyas commented Apr 13, 2021

when pantsd has already decided to exit and a run is canceled due to Ctrl+C, pantsd shutdown will not necessarily wait long enough for cancellation to have propagated through the Graph: the call to sys.exit in pantsd prevents Drop from necessarily running for the Graph, leading to orphaned processes (which would otherwise be killed by their Drop handlers).

Does Pants put the child processes into their own process group? If not, might be good to do so because pantsd could just send a SIGKILL to the entire process group.

@tdyas
Copy link
Contributor

tdyas commented Apr 13, 2021

(or a process group per graph run or some other reasonable grouping)

@tdyas
Copy link
Contributor

tdyas commented Apr 13, 2021

It's possible for a newly connecting client to attempt to connect to the exiting instance, which will cause it to wait to connect, and then get a "exceeded timeout of 60 seconds while waiting for pantsd to start" error.

Maybe the exiting instance should unbind the endpoint and gracefully exit in the background while the new instance of pantsd takes over the endpoint immediately?

@stuhood
Copy link
Member Author

stuhood commented Apr 14, 2021

It looks like this will also (unsurprisingly) cause a run to exit before the StreamingWorkunitHandler thread has shut down, which can lead to orphaned metrics. Have added to the description. Fixing this issue blocks #11833.

@stuhood stuhood self-assigned this Apr 14, 2021
stuhood added a commit to stuhood/pants that referenced this issue Apr 16, 2021
# Rust tests and lints will be skipped. Delete if not intended.
[ci skip-rust]

# Building wheels and fs_util will be skipped. Delete if not intended.
[ci skip-build-wheels]
stuhood added a commit that referenced this issue Apr 16, 2021
As described in #11618, when `pantsd` intentionally exits due to low memory, a few types of work can be cut short:
1. if the run ends in Ctrl+C, processes that were cancelled may not have had time to be dropped before `pantsd exits.
2. async StreamingWorkunitHandler threads might still be running.

This change adds orderly-shutdown mechanisms to the `Scheduler`/`Core` to join all ongoing `Sessions` (including the SWH), and improves tests to ensure that the SWH is waited for.

Additionally, in the last commit, added purging of the `pantsd` metadata as soon as we decide to restart, which should reduce (but probably not eliminate) the incidence of item 1. from #11618. Work for #11831 will likely further harden this path.

[ci skip-build-wheels]
@stuhood
Copy link
Member Author

stuhood commented Apr 16, 2021

Fixed by #11929.

stuhood added a commit to stuhood/pants that referenced this issue Apr 16, 2021
As described in pantsbuild#11618, when `pantsd` intentionally exits due to low memory, a few types of work can be cut short:
1. if the run ends in Ctrl+C, processes that were cancelled may not have had time to be dropped before `pantsd exits.
2. async StreamingWorkunitHandler threads might still be running.

This change adds orderly-shutdown mechanisms to the `Scheduler`/`Core` to join all ongoing `Sessions` (including the SWH), and improves tests to ensure that the SWH is waited for.

Additionally, in the last commit, added purging of the `pantsd` metadata as soon as we decide to restart, which should reduce (but probably not eliminate) the incidence of item 1. from pantsbuild#11618. Work for pantsbuild#11831 will likely further harden this path.

[ci skip-build-wheels]
@stuhood stuhood closed this as completed Apr 16, 2021
stuhood added a commit that referenced this issue Apr 16, 2021
…11934)

As described in #11618, when `pantsd` intentionally exits due to low memory, a few types of work can be cut short:
1. if the run ends in Ctrl+C, processes that were cancelled may not have had time to be dropped before `pantsd` exits.
2. async StreamingWorkunitHandler threads might still be running.

This change adds orderly-shutdown mechanisms to the `Scheduler`/`Core` to join all ongoing `Sessions` (including the SWH), and improves tests to ensure that the SWH is waited for.

Additionally, in the last commit, added purging of the `pantsd` metadata as soon as we decide to restart, which should reduce (but probably not eliminate) the incidence of item 1. from #11618. Work for #11831 will likely further harden this path.

[ci skip-build-wheels]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants