Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runner update end fails #33

Closed
jorge07 opened this issue Apr 15, 2020 · 24 comments
Closed

Runner update end fails #33

jorge07 opened this issue Apr 15, 2020 · 24 comments
Assignees

Comments

@jorge07
Copy link
Contributor

jorge07 commented Apr 15, 2020

Problem

Runner update fails and stop working

Logs

Runner update in progress, do not shutdown runner.
Downloading 2.169.0 runner
Waiting for current job finish running.
Generate and execute update script.
Runner will exit shortly for update, should back online within 10 seconds.
/runner/run.sh: line 47: /runner/bin/Runner.Listener: No such file or directory
@chenrui333
Copy link
Contributor

This might be related to the action new release today?

@jorge07
Copy link
Contributor Author

jorge07 commented Apr 16, 2020

Yes, it is. The question is if the container should be able to self update or not.

@summerwind
Copy link
Contributor

Thank you for using and reporting!
I think that this is a self-hosted runner's default behavior. Let me find out how to stop self-updating.

@summerwind
Copy link
Contributor

When I looked into the code, I couldn't find a way to stop self-updating. I've created the following Issue.
actions/runner#431

I'll also look into how to make self-updating a success.

@mumoshu
Copy link
Collaborator

mumoshu commented Apr 16, 2020

@jorge07 Btw, does the runner pod really stop forever on this error?

I thought it would just exit with 3 or one of other non-zero exit codes as implemented in https://github.com/actions/runner/blob/6c70d53eead402ba5d53676d6ed649a04e219c9b/src/Runner.Listener/Program.cs#L31, which causes the pod to automatically restart the container as we set RestartPolicy at https://github.com/summerwind/actions-runner-controller/blob/5714459c24e7cad727eeefe2d188aedcadb47c39/controllers/runner_controller.go#L275

@jorge07
Copy link
Contributor Author

jorge07 commented Apr 16, 2020

It stops consuming the queue, no jobs are taken.

@aweris
Copy link

aweris commented May 5, 2020

@summerwind, what do you think of running runner as a service in the container? There is a couple of options for doing it.

I'm using https://github.com/gdraheim/docker-systemctl-replacement in my docker images

@mumoshu
Copy link
Collaborator

mumoshu commented May 6, 2020

@aweris We discussed and experimented about that in #40.

I don't understand every detail and implication of the work, but I had a few questions:

  • On pod deletion, how would you forward SIGTERM sent to the PID 1(systemd) to the dockerd and the runner agent?
  • How should we surface continued failures from dockerd or the actions runnner to the pod? In a regular pod, we usually see CrashLoopback when one of containers in a pod failed continuously. If systemd handled that and retried forever, how should we notify the pod that the process is failing continuously?
  • How should we expose dockerd and runner logs via kubectl logs?

WDYT?

@mumoshu
Copy link
Collaborator

mumoshu commented May 6, 2020

After reading actions/runner#246, I got to think that it isn't always a good idea to disable the self-update mechanism.

Maybe it's true that we need a kind of real init system(systemd et al.), that hopefully addresses my above points without too much hack.

@aweris
Copy link

aweris commented May 6, 2020

@aweris We discussed and experimented about that in #40.

I checked the PR; unfortunately, it's not possible to use the original systemd in docker without hacking. So that's why I used https://github.com/gdraheim/docker-systemctl-replacement as a replacement. It's a simple python script that replicates the basic functionality of the systemd.

When you configure runner as a service, runner creates a unit file, and I used that unit file as a template.

On pod deletion, how would you forward SIGTERM sent to the PID 1(systemd) to the dockerd and the runner agent?

KillMode=process
KillSignal=SIGTERM
TimeoutStopSec=5min

It's a standard unit file. You can use KillMode and KillSignal.

How should we surface continued failures from dockerd or the actions runner to the pod? In a regular pod, we usually see CrashLoopback when one of the containers in a pod failed continuously. If systemd handled that and retried forever, how should we notify the pod that the process is failing continuously?

This is one of the main issues I couldn't address properly. I'm using a custom bash script. It starts service and watching its status.

How should we expose dockerd and runner logs via kubectl logs?

I'm tracking log files under the _diag directory.

What I do is not an ideal solution, but it's working in production for the last 5 months without any problem.

This is my hack to the problem.

@aweris
Copy link

aweris commented May 6, 2020

@mumoshu As I said mimic systemctl in container solving the update problem but it introduces additional complexity.

I am currently looking for other solutions like ephemeral runners or systemctl alternatives to manage lifecycle of the runner.

@summerwind
Copy link
Contributor

I think it might be a good idea to launch a Runner as a service. I actually tried using systemd to run
a Runner in #40.

The problem is that since systemd manages the state of the Runner, it's hard to know when the Controller should recreate the Pod. Currently, the Controller has to recreate the Pod every time Runner runs a job, so the Controller needs to know the status of Runner.

I might be able to use something like Readiness Probe to figure out when Controller should recreate the Pod, but I haven't tried that method yet.

@aweris
Copy link

aweris commented May 8, 2020

@mumoshu, @summerwind Just debugged update process and it seems problem is generated update script _update.sh. After self-update runner calling this script and it's causing container exit, not self-update.

Then I used a small go app instead of a bash script for running runner and it worked perfectly since it's waiting for exit code in the background.

package main

import (
	"context"
	"log"
	"os"
	"os/exec"
)

func main() {
	ctx := context.Background()

	cmd := exec.CommandContext(ctx, "./run.sh", "--once")

	cmd.Stdout = os.Stdout
	cmd.Stderr = os.Stderr
	err := cmd.Run()
	if err != nil {
		log.Fatal(err)
	}
}

I hope it helps 😄

@mumoshu
Copy link
Collaborator

mumoshu commented May 9, 2020

@aweris Thanks for your support!

I'm still trying to understand what you said. Does it mean your primary runner process should firstly run run.sh AND try to rerun it whenever it exits with 0?

In other words, we already delegate things to run.sh(but by replacing the primary process with that of run.sh with exec) https://github.com/summerwind/actions-runner-controller/blob/master/runner/entrypoint.sh#L32. Probably your point is that we shouldn't use exec and just keep retrying run.sh until it fails(with non-zero exit code). Am I correct?

@aweris
Copy link

aweris commented May 9, 2020

I'm still trying to understand what you said.

Let me explain this way—the main difference between entrypoint.sh and cmd.Run() is Run() starts run.sh as a background process and wait for the process without blocking outputs.

This post also used sleep to prevent container exit in self-update.

Does it mean your primary runner process should firstly run run.sh AND try to rerun it whenever it exists with 0?

I think you can try starting run.sh as a background process and wait for the termination signals or exit code from the process.

Probably your point is that we shouldn't use exec and just keep retrying run.sh until it fails(with non-zero exit code). Am I correct?

Yes, technically, when the runner process finishes without any error, it's returning exit code 0. However, I didn't try this approach with bash.

@mumoshu
Copy link
Collaborator

mumoshu commented Jun 21, 2020

@aweris Thanks for the info! It did help me understand the problem.

Today I managed to take some time to reproduce this on my machine, and realized that this might have been resolved in upstream.

Our run.sh provided by the upstream now looks like the below, which seems to gracefully handle a newly added exit status 4 which is returned on updating a run-once runner:

runner@mumoshu-runnertest-hzvfg-qw6zl:/$ cat ./runner/run.sh
#!/bin/bash

# Validate not sudo
user_id=`id -u`
if [ $user_id -eq 0 -a -z "$RUNNER_ALLOW_RUNASROOT" ]; then
    echo "Must not run interactively with sudo"
    exit 1
fi

# Change directory to the script root directory
# https://stackoverflow.com/questions/59895/getting-the-source-directory-of-a-bash-script-from-within
SOURCE="${BASH_SOURCE[0]}"
while [ -h "$SOURCE" ]; do # resolve $SOURCE until the file is no longer a symlink
  DIR="$( cd -P "$( dirname "$SOURCE" )" && pwd )"
  SOURCE="$(readlink "$SOURCE")"
  [[ $SOURCE != /* ]] && SOURCE="$DIR/$SOURCE" # if $SOURCE was a relative symlink, we need to resolve it relative to the path where the symlink file was located
done
DIR="$( cd -P "$( dirname "$SOURCE" )" && pwd )"

# Do not "cd $DIR". For localRun, the current directory is expected to be the repo location on disk.

# Run
shopt -s nocasematch
if [[ "$1" == "localRun" ]]; then
    "$DIR"/bin/Runner.Listener $*
else
    "$DIR"/bin/Runner.Listener run $*

# Return code 4 means the run once runner received an update message.
# Sleep 5 seconds to wait for the update process finish and run the runner again.
    returnCode=$?
    if [[ $returnCode == 4 ]]; then
        if [ ! -x "$(command -v sleep)" ]; then
            if [ ! -x "$(command -v ping)" ]; then
                COUNT="0"
                while [[ $COUNT != 5000 ]]; do
                    echo "SLEEP" >nul
                    COUNT=$[$COUNT+1]
                done
            else
                ping -n 5 127.0.0.1 >nul
            fi
        else
            sleep 5 >nul
        fi

        "$DIR"/bin/Runner.Listener run $*
    else
        exit $returnCode
    fi
fi

Probably we can safely close this as resolved now? Could anyone confirm?

@mumoshu
Copy link
Collaborator

mumoshu commented Jun 21, 2020

Well, run.sh seems to be unchanged for months. So is it that run.sh isn't working as intended? I'm pretty confused.

https://github.com/actions/runner/blob/master/src/Misc/layoutroot/run.sh

@mumoshu
Copy link
Collaborator

mumoshu commented Jun 21, 2020

Back the original errror reported by @jorge07, and its preceding log message "Runner will exit shortly for update, should back online within 10 seconds.", can it be that in some env sleep 5 that is hard coded in run.sh isn't enough?

@stackdumper
Copy link

I have the same issue. I'm using a custom image built on top of the base one, and every time a new version of the base image is released, runners try to auto-update and then exit with an error.

√ Connected to GitHub

2020-09-15 18:11:56Z: Runner reconnected.
2020-09-15 18:11:56Z: Listening for Jobs
Runner update in progress, do not shutdown runner.
Downloading 2.273.2 runner
Waiting for current job finish running.
Generate and execute update script.
Runner will exit shortly for update, should back online within 10 seconds.
/runner/run.sh: line 47: /runner/bin/Runner.Listener: No such file or directory

There are some possible solutions proposed above:

I'll try to integrate one of them and open a PR.

@igorbrigadir
Copy link

I've been experiencing similar issues - also using a custom image and also having to update manually every time there's a new release. In my case however, i've been making other breaking changes so it took me a while to pin it down to something wrong with the runner.

It would be fantastic if this was fixed, as it defeats the purpose of self hosted runners (having to manually tend to them after updates). This one actions/runner#484 (comment) seems to be the best way to me.

@ytsarev
Copy link
Contributor

ytsarev commented Sep 16, 2020

@igorbrigadir thanks a ton for entrypoint hint, implemented here #99 . Worked like a charm.

@jorge07
Copy link
Contributor Author

jorge07 commented Sep 17, 2020

Tested #99 10/10, nice job @ytsarev

@jef
Copy link

jef commented Sep 25, 2020

This is what was happening with me too. I found what @igorbrigadir posted and it's never failed since.

@mumoshu
Copy link
Collaborator

mumoshu commented Apr 25, 2021

It seems to have worked fine since then. Closing as resolved. Thanks everyone for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants