-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix runners to do their best to gracefully stop on pod eviction #1759
Fix runners to do their best to gracefully stop on pod eviction #1759
Conversation
dcfb4f2
to
94b6d89
Compare
e2006d4
to
10cb56d
Compare
c82c99c
to
214f5c6
Compare
runner/graceful-stop.bash
Outdated
# Cannot connect to server, because config files are missing. Skipping removing runner from the server. | ||
# Does not exist. Skipping Removing .credentials | ||
# Does not exist. Skipping Removing .runner | ||
cd /runner |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you want to use pushd /runner
? so you can popd
to your original pwd
after the ./config remove
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely! Addressed in f5610db
runner/actions-runner.sh
Outdated
#!/bin/bash | ||
source logger.bash | ||
source graceful-stop.bash | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want to move trap graceful_stop TERM
in here to make it more readable since it looks like actopns-runner.sh
is the new main entrypoint script?
So, we can add trap - TERM
at the end of the file to remove the trap?
You can ignore me, not a big deal. 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds much cleaner! Let me try. Thanks for your feedback 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in a666a2f
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be nice to add shellcheck linting into the pipeline as part of this change seen as we doing major surgery
214f5c6
to
f5610db
Compare
I am about to revisit #1517, #1454, #1561, and #1560 as a part of our on-going effort to a major enhancement to the runner entrypoints being made in #1759. This change is a counterpart of #1852. #1852 enables you to easily run shellcheck locally. This enables you to automatically run shellcheck on every pull request. Currently, any shellcheck error does not result in failing the workflow job. Once we addressed all the shellcheck findings, we can flip the fail_on_error option to true and let jobs start failing on pull requests that introduces invalid or suspicious bash code.
I am about to revisit #1517, #1454, #1561, and #1560 as a part of our on-going effort to a major enhancement to the runner entrypoints being made in #1759. This change is a counterpart of #1852. #1852 enables you to easily run shellcheck locally. This enables you to automatically run shellcheck on every pull request. Currently, any shellcheck error does not result in failing the workflow job. Once we addressed all the shellcheck findings, we can flip the fail_on_error option to true and let jobs start failing on pull requests that introduces invalid or suspicious bash code.
…eriodSeconds and RUNNER_GRACEFUL_STOP_TIMEOUT when runner pod was terminated externally too early after its creation While I was running E2E tests for #1759, I discovered a potential issue that ARC can terminate runner pods without waiting for the registration timouet of 10 minutes. You won't be affected by this in normal circumstances, as this failure scenario can be triggered only when you or another K8s controller like cluster-autoscaler deleted the runner or the runner pod immediately after the runner or the runner pod has been created. But probably is it worth fixing it anyway because it's not impossible to trigger it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks for reviewing! |
Wow I was gonna open a discussion for this because I've been suffering from this since 2 weeks ago and wow it just merged 15 hours ago what a timing 👍 |
This fixes an issue discovered while I was testing #1759. Please see the new comment in code for more information.
While testing #1759, I found an issue in the rootless dind entrypoint that it was not respecting the configured MTU for dind docker due to a permission issue. This fixes that.
This fixes the said issue I have found while testing #1759.
While testing #1759, I found an issue in the rootless dind entrypoint that it was not respecting the configured MTU for dind docker due to a permission issue. This fixes that.
* Fix permission issue when you use PV for rootless dind cache This fixes the said issue I have found while testing #1759. Signed-off-by: Yusuke Kuoka <ykuoka@gmail.com>
…decar prestop hook The runner graceful stop timeout has never been propagated to the dind sidecar due to configuration error in E2E. This fixes it, so that we can verify that the dind sidecar prestop can respect the graceful stop timeout. Related to #1759
Please see the updated README for more details.
Perhaps this may already work with RunnerDeployment-managed rootless-dind runners too. I'll give it a shot and update the PR title and README if it works.Worked. This works for rootless-dind pods managed by either RunnerDeployment or RunnerSet.UPDATE: Added support for rootful dind runners too.
UPDATE: Added support for runners with dind sidecars.
Since this change, the runner which received SIGTERM while still running a job will show log like the below:
NOTICE
messages are from our new graceful termination handler and other messages are fromactions/runner
's Runner.Listener(main agent) orconfig.sh remove
.As you could see, this change enables you to configure
RUNNER_GRACEFUL_STOP_TIMEOUT
.Configure the timeout along with the existing
terminationGracePeriodSeconds
, so that the new graceful termination handler does two things for you:I had to change the entrypoint of the runner images from
dumb-init
tobash
along the way.If not done correctly, it might have resulted in leaving zombie processes in the container or the host (I think) in case either dockerd or
actions/runner
had some bugs. However, I've modified the entrypoint scripts to do internally usedumb-init
for running dockerd andactions/runner
.The process tree after this change now looks like the ones below.
For runner with dind sidecar:
For rootful dind runner:
For rootless dind runner:
As long as our entrypoint (The new
actions-runner.sh
, the modifiedstartup.sh
androotless-startup.sh
respectively) sends SIGTERM to dumb-init, it should reap any remaining zombie processes (if any). The updated entrypoint scripts do exactly that and therefore you should encounter no zombie processes issues even if our init (pid 1) is no longerdumb-init
.Note also that runner with dind sidecar can still fail to stop gracefully due to that the dind sidecar has no feature to wait until its client(=actions/runner agent) stops first. You'll see messages like the ones seen in the below log when this happens. The fix for this should be made in another pull request.I've already addressed it in this PR.Ref #1535
Ref #1581