Harden pantsd memory usage restarts #11618

stuhood · 2021-03-01T23:51:30Z

When pantsd intentionally exits due to low memory, a few different issues have been observed.

connection races - It's possible for a newly connecting client to attempt to connect to the exiting instance, which will cause it to wait to connect, and then get a "exceeded timeout of 60 seconds while waiting for pantsd to start" error.
clean shutdown of processes - when pantsd has already decided to exit and a run is canceled due to Ctrl+C, pantsd shutdown will not necessarily wait long enough for cancellation to have propagated through the Graph: the call to sys.exit in pantsd prevents Drop from necessarily running for the Graph, leading to orphaned processes (which would otherwise be killed by their Drop handlers).
clean shutdown of threads - When pantsd exits, it does not wait for StreamingWorkunitHandler threads to complete, which means that callbacks which allow async completion can be cut off by the restart.

The text was updated successfully, but these errors were encountered:

tdyas · 2021-04-13T00:42:16Z

when pantsd has already decided to exit and a run is canceled due to Ctrl+C, pantsd shutdown will not necessarily wait long enough for cancellation to have propagated through the Graph: the call to sys.exit in pantsd prevents Drop from necessarily running for the Graph, leading to orphaned processes (which would otherwise be killed by their Drop handlers).

Does Pants put the child processes into their own process group? If not, might be good to do so because pantsd could just send a SIGKILL to the entire process group.

tdyas · 2021-04-13T00:43:00Z

(or a process group per graph run or some other reasonable grouping)

tdyas · 2021-04-13T00:44:21Z

It's possible for a newly connecting client to attempt to connect to the exiting instance, which will cause it to wait to connect, and then get a "exceeded timeout of 60 seconds while waiting for pantsd to start" error.

Maybe the exiting instance should unbind the endpoint and gracefully exit in the background while the new instance of pantsd takes over the endpoint immediately?

stuhood · 2021-04-14T23:07:28Z

It looks like this will also (unsurprisingly) cause a run to exit before the StreamingWorkunitHandler thread has shut down, which can lead to orphaned metrics. Have added to the description. Fixing this issue blocks #11833.

# Rust tests and lints will be skipped. Delete if not intended. [ci skip-rust] # Building wheels and fs_util will be skipped. Delete if not intended. [ci skip-build-wheels]

As described in #11618, when `pantsd` intentionally exits due to low memory, a few types of work can be cut short: 1. if the run ends in Ctrl+C, processes that were cancelled may not have had time to be dropped before `pantsd exits. 2. async StreamingWorkunitHandler threads might still be running. This change adds orderly-shutdown mechanisms to the `Scheduler`/`Core` to join all ongoing `Sessions` (including the SWH), and improves tests to ensure that the SWH is waited for. Additionally, in the last commit, added purging of the `pantsd` metadata as soon as we decide to restart, which should reduce (but probably not eliminate) the incidence of item 1. from #11618. Work for #11831 will likely further harden this path. [ci skip-build-wheels]

stuhood · 2021-04-16T22:11:22Z

Fixed by #11929.

As described in pantsbuild#11618, when `pantsd` intentionally exits due to low memory, a few types of work can be cut short: 1. if the run ends in Ctrl+C, processes that were cancelled may not have had time to be dropped before `pantsd exits. 2. async StreamingWorkunitHandler threads might still be running. This change adds orderly-shutdown mechanisms to the `Scheduler`/`Core` to join all ongoing `Sessions` (including the SWH), and improves tests to ensure that the SWH is waited for. Additionally, in the last commit, added purging of the `pantsd` metadata as soon as we decide to restart, which should reduce (but probably not eliminate) the incidence of item 1. from pantsbuild#11618. Work for pantsbuild#11831 will likely further harden this path. [ci skip-build-wheels]

…11934) As described in #11618, when `pantsd` intentionally exits due to low memory, a few types of work can be cut short: 1. if the run ends in Ctrl+C, processes that were cancelled may not have had time to be dropped before `pantsd` exits. 2. async StreamingWorkunitHandler threads might still be running. This change adds orderly-shutdown mechanisms to the `Scheduler`/`Core` to join all ongoing `Sessions` (including the SWH), and improves tests to ensure that the SWH is waited for. Additionally, in the last commit, added purging of the `pantsd` metadata as soon as we decide to restart, which should reduce (but probably not eliminate) the incidence of item 1. from #11618. Work for #11831 will likely further harden this path. [ci skip-build-wheels]

stuhood mentioned this issue Mar 4, 2021

pantsd hang in CI #10877

Closed

stuhood self-assigned this Apr 14, 2021

stuhood mentioned this issue Apr 16, 2021

Wait for all Sessions during pantsd shutdown #11929

Merged

stuhood mentioned this issue Apr 16, 2021

The anonymous-telemetry subsystem ~doubles Pants no-op latency when non-async callbacks are enabled. #11833

Closed

stuhood mentioned this issue Apr 16, 2021

Wait for all Sessions during pantsd shutdown (cherrypick of #11929) #11934

Merged

stuhood closed this as completed Apr 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Harden pantsd memory usage restarts #11618

Harden pantsd memory usage restarts #11618

stuhood commented Mar 1, 2021 •

edited

Loading

tdyas commented Apr 13, 2021

tdyas commented Apr 13, 2021

tdyas commented Apr 13, 2021

stuhood commented Apr 14, 2021

stuhood commented Apr 16, 2021

Harden pantsd memory usage restarts #11618

Harden pantsd memory usage restarts #11618

Comments

stuhood commented Mar 1, 2021 • edited Loading

tdyas commented Apr 13, 2021

tdyas commented Apr 13, 2021

tdyas commented Apr 13, 2021

stuhood commented Apr 14, 2021

stuhood commented Apr 16, 2021

stuhood commented Mar 1, 2021 •

edited

Loading