Runner update end fails #33

jorge07 · 2020-04-15T16:04:57Z

Problem

Runner update fails and stop working

Logs

Runner update in progress, do not shutdown runner.
Downloading 2.169.0 runner
Waiting for current job finish running.
Generate and execute update script.
Runner will exit shortly for update, should back online within 10 seconds.
/runner/run.sh: line 47: /runner/bin/Runner.Listener: No such file or directory

The text was updated successfully, but these errors were encountered:

chenrui333 · 2020-04-16T02:07:57Z

This might be related to the action new release today?

jorge07 · 2020-04-16T08:06:43Z

Yes, it is. The question is if the container should be able to self update or not.

summerwind · 2020-04-16T08:23:29Z

Thank you for using and reporting!
I think that this is a self-hosted runner's default behavior. Let me find out how to stop self-updating.

summerwind · 2020-04-16T08:49:08Z

When I looked into the code, I couldn't find a way to stop self-updating. I've created the following Issue.
actions/runner#431

I'll also look into how to make self-updating a success.

mumoshu · 2020-04-16T11:12:47Z

@jorge07 Btw, does the runner pod really stop forever on this error?

I thought it would just exit with 3 or one of other non-zero exit codes as implemented in https://github.com/actions/runner/blob/6c70d53eead402ba5d53676d6ed649a04e219c9b/src/Runner.Listener/Program.cs#L31, which causes the pod to automatically restart the container as we set RestartPolicy at https://github.com/summerwind/actions-runner-controller/blob/5714459c24e7cad727eeefe2d188aedcadb47c39/controllers/runner_controller.go#L275

jorge07 · 2020-04-16T15:38:41Z

It stops consuming the queue, no jobs are taken.

aweris · 2020-05-05T13:51:42Z

@summerwind, what do you think of running runner as a service in the container? There is a couple of options for doing it.

I'm using https://github.com/gdraheim/docker-systemctl-replacement in my docker images

mumoshu · 2020-05-06T03:35:16Z

@aweris We discussed and experimented about that in #40.

I don't understand every detail and implication of the work, but I had a few questions:

On pod deletion, how would you forward SIGTERM sent to the PID 1(systemd) to the dockerd and the runner agent?
How should we surface continued failures from dockerd or the actions runnner to the pod? In a regular pod, we usually see CrashLoopback when one of containers in a pod failed continuously. If systemd handled that and retried forever, how should we notify the pod that the process is failing continuously?
How should we expose dockerd and runner logs via kubectl logs?

WDYT?

mumoshu · 2020-05-06T03:47:59Z

After reading actions/runner#246, I got to think that it isn't always a good idea to disable the self-update mechanism.

Maybe it's true that we need a kind of real init system(systemd et al.), that hopefully addresses my above points without too much hack.

aweris · 2020-05-06T07:34:14Z

@aweris We discussed and experimented about that in #40.

I checked the PR; unfortunately, it's not possible to use the original systemd in docker without hacking. So that's why I used https://github.com/gdraheim/docker-systemctl-replacement as a replacement. It's a simple python script that replicates the basic functionality of the systemd.

When you configure runner as a service, runner creates a unit file, and I used that unit file as a template.

On pod deletion, how would you forward SIGTERM sent to the PID 1(systemd) to the dockerd and the runner agent?

KillMode=process
KillSignal=SIGTERM
TimeoutStopSec=5min

It's a standard unit file. You can use KillMode and KillSignal.

How should we surface continued failures from dockerd or the actions runner to the pod? In a regular pod, we usually see CrashLoopback when one of the containers in a pod failed continuously. If systemd handled that and retried forever, how should we notify the pod that the process is failing continuously?

This is one of the main issues I couldn't address properly. I'm using a custom bash script. It starts service and watching its status.

How should we expose dockerd and runner logs via kubectl logs?

I'm tracking log files under the _diag directory.

What I do is not an ideal solution, but it's working in production for the last 5 months without any problem.

This is my hack to the problem.

aweris · 2020-05-06T07:40:48Z

@mumoshu As I said mimic systemctl in container solving the update problem but it introduces additional complexity.

I am currently looking for other solutions like ephemeral runners or systemctl alternatives to manage lifecycle of the runner.

summerwind · 2020-05-06T13:42:06Z

I think it might be a good idea to launch a Runner as a service. I actually tried using systemd to run
a Runner in #40.

The problem is that since systemd manages the state of the Runner, it's hard to know when the Controller should recreate the Pod. Currently, the Controller has to recreate the Pod every time Runner runs a job, so the Controller needs to know the status of Runner.

I might be able to use something like Readiness Probe to figure out when Controller should recreate the Pod, but I haven't tried that method yet.

aweris · 2020-05-08T15:40:51Z

@mumoshu, @summerwind Just debugged update process and it seems problem is generated update script _update.sh. After self-update runner calling this script and it's causing container exit, not self-update.

Then I used a small go app instead of a bash script for running runner and it worked perfectly since it's waiting for exit code in the background.

package main

import (
	"context"
	"log"
	"os"
	"os/exec"
)

func main() {
	ctx := context.Background()

	cmd := exec.CommandContext(ctx, "./run.sh", "--once")

	cmd.Stdout = os.Stdout
	cmd.Stderr = os.Stderr
	err := cmd.Run()
	if err != nil {
		log.Fatal(err)
	}
}

I hope it helps 😄

mumoshu · 2020-05-09T04:50:52Z

@aweris Thanks for your support!

I'm still trying to understand what you said. Does it mean your primary runner process should firstly run run.sh AND try to rerun it whenever it exits with 0?

In other words, we already delegate things to run.sh(but by replacing the primary process with that of run.sh with exec) https://github.com/summerwind/actions-runner-controller/blob/master/runner/entrypoint.sh#L32. Probably your point is that we shouldn't use exec and just keep retrying run.sh until it fails(with non-zero exit code). Am I correct?

aweris · 2020-05-09T10:25:14Z

I'm still trying to understand what you said.

Let me explain this way—the main difference between entrypoint.sh and cmd.Run() is Run() starts run.sh as a background process and wait for the process without blocking outputs.

This post also used sleep to prevent container exit in self-update.

Does it mean your primary runner process should firstly run run.sh AND try to rerun it whenever it exists with 0?

I think you can try starting run.sh as a background process and wait for the termination signals or exit code from the process.

Probably your point is that we shouldn't use exec and just keep retrying run.sh until it fails(with non-zero exit code). Am I correct?

Yes, technically, when the runner process finishes without any error, it's returning exit code 0. However, I didn't try this approach with bash.

mumoshu · 2020-06-21T05:08:38Z

@aweris Thanks for the info! It did help me understand the problem.

Today I managed to take some time to reproduce this on my machine, and realized that this might have been resolved in upstream.

Our run.sh provided by the upstream now looks like the below, which seems to gracefully handle a newly added exit status 4 which is returned on updating a run-once runner:

runner@mumoshu-runnertest-hzvfg-qw6zl:/$ cat ./runner/run.sh
#!/bin/bash

# Validate not sudo
user_id=`id -u`
if [ $user_id -eq 0 -a -z "$RUNNER_ALLOW_RUNASROOT" ]; then
    echo "Must not run interactively with sudo"
    exit 1
fi

# Change directory to the script root directory
# https://stackoverflow.com/questions/59895/getting-the-source-directory-of-a-bash-script-from-within
SOURCE="${BASH_SOURCE[0]}"
while [ -h "$SOURCE" ]; do # resolve $SOURCE until the file is no longer a symlink
  DIR="$( cd -P "$( dirname "$SOURCE" )" && pwd )"
  SOURCE="$(readlink "$SOURCE")"
  [[ $SOURCE != /* ]] && SOURCE="$DIR/$SOURCE" # if $SOURCE was a relative symlink, we need to resolve it relative to the path where the symlink file was located
done
DIR="$( cd -P "$( dirname "$SOURCE" )" && pwd )"

# Do not "cd $DIR". For localRun, the current directory is expected to be the repo location on disk.

# Run
shopt -s nocasematch
if [[ "$1" == "localRun" ]]; then
    "$DIR"/bin/Runner.Listener $*
else
    "$DIR"/bin/Runner.Listener run $*

# Return code 4 means the run once runner received an update message.
# Sleep 5 seconds to wait for the update process finish and run the runner again.
    returnCode=$?
    if [[ $returnCode == 4 ]]; then
        if [ ! -x "$(command -v sleep)" ]; then
            if [ ! -x "$(command -v ping)" ]; then
                COUNT="0"
                while [[ $COUNT != 5000 ]]; do
                    echo "SLEEP" >nul
                    COUNT=$[$COUNT+1]
                done
            else
                ping -n 5 127.0.0.1 >nul
            fi
        else
            sleep 5 >nul
        fi

        "$DIR"/bin/Runner.Listener run $*
    else
        exit $returnCode
    fi
fi

Probably we can safely close this as resolved now? Could anyone confirm?

mumoshu · 2020-06-21T05:10:21Z

Well, run.sh seems to be unchanged for months. So is it that run.sh isn't working as intended? I'm pretty confused.

https://github.com/actions/runner/blob/master/src/Misc/layoutroot/run.sh

mumoshu · 2020-06-21T05:15:14Z

Back the original errror reported by @jorge07, and its preceding log message "Runner will exit shortly for update, should back online within 10 seconds.", can it be that in some env sleep 5 that is hard coded in run.sh isn't enough?

stackdumper · 2020-09-15T19:22:53Z

I have the same issue. I'm using a custom image built on top of the base one, and every time a new version of the base image is released, runners try to auto-update and then exit with an error.

√ Connected to GitHub

2020-09-15 18:11:56Z: Runner reconnected.
2020-09-15 18:11:56Z: Listening for Jobs
Runner update in progress, do not shutdown runner.
Downloading 2.273.2 runner
Waiting for current job finish running.
Generate and execute update script.
Runner will exit shortly for update, should back online within 10 seconds.
/runner/run.sh: line 47: /runner/bin/Runner.Listener: No such file or directory

There are some possible solutions proposed above:

I'll try to integrate one of them and open a PR.

igorbrigadir · 2020-09-16T11:19:51Z

I've been experiencing similar issues - also using a custom image and also having to update manually every time there's a new release. In my case however, i've been making other breaking changes so it took me a while to pin it down to something wrong with the runner.

It would be fantastic if this was fixed, as it defeats the purpose of self hosted runners (having to manually tend to them after updates). This one actions/runner#484 (comment) seems to be the best way to me.

ytsarev · 2020-09-16T16:11:13Z

@igorbrigadir thanks a ton for entrypoint hint, implemented here #99 . Worked like a charm.

jorge07 · 2020-09-17T22:21:35Z

Tested #99 10/10, nice job @ytsarev

jef · 2020-09-25T16:26:10Z

This is what was happening with me too. I found what @igorbrigadir posted and it's never failed since.

mumoshu · 2021-04-25T07:39:28Z

It seems to have worked fine since then. Closing as resolved. Thanks everyone for your help!

…lash (#33)

mumoshu mentioned this issue Apr 16, 2020

Runner scale up failed and the error is not quite clear #34

Closed

summerwind self-assigned this Apr 18, 2020

summerwind mentioned this issue Apr 25, 2020

Prevent auto-update of runner version #46

Closed

summerwind mentioned this issue May 4, 2020

Runner pod always comes back in Terminating state #37

Closed

mumoshu mentioned this issue Jul 7, 2020

self hosted runner stuck on queued #69

Closed

onelapahead mentioned this issue Sep 15, 2020

Runner v2.273.2 #97

Closed

mumoshu closed this as completed Apr 25, 2021

TingluoHuang pushed a commit that referenced this issue Jan 12, 2023

github/actions: fix url constant runnerEndpoint removing trailing s…

46ba7b2

…lash (#33)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runner update end fails #33

Runner update end fails #33

jorge07 commented Apr 15, 2020

chenrui333 commented Apr 16, 2020

jorge07 commented Apr 16, 2020

summerwind commented Apr 16, 2020

summerwind commented Apr 16, 2020

mumoshu commented Apr 16, 2020 •

edited

Loading

jorge07 commented Apr 16, 2020 •

edited

Loading

aweris commented May 5, 2020 •

edited

Loading

mumoshu commented May 6, 2020

mumoshu commented May 6, 2020

aweris commented May 6, 2020

aweris commented May 6, 2020

summerwind commented May 6, 2020

aweris commented May 8, 2020 •

edited

Loading

mumoshu commented May 9, 2020 •

edited

Loading

aweris commented May 9, 2020

mumoshu commented Jun 21, 2020

mumoshu commented Jun 21, 2020

mumoshu commented Jun 21, 2020

stackdumper commented Sep 15, 2020

igorbrigadir commented Sep 16, 2020

ytsarev commented Sep 16, 2020

jorge07 commented Sep 17, 2020

jef commented Sep 25, 2020

mumoshu commented Apr 25, 2021

Runner update end fails #33

Runner update end fails #33

Comments

jorge07 commented Apr 15, 2020

Problem

Logs

chenrui333 commented Apr 16, 2020

jorge07 commented Apr 16, 2020

summerwind commented Apr 16, 2020

summerwind commented Apr 16, 2020

mumoshu commented Apr 16, 2020 • edited Loading

jorge07 commented Apr 16, 2020 • edited Loading

aweris commented May 5, 2020 • edited Loading

mumoshu commented May 6, 2020

mumoshu commented May 6, 2020

aweris commented May 6, 2020

aweris commented May 6, 2020

summerwind commented May 6, 2020

aweris commented May 8, 2020 • edited Loading

mumoshu commented May 9, 2020 • edited Loading

aweris commented May 9, 2020

mumoshu commented Jun 21, 2020

mumoshu commented Jun 21, 2020

mumoshu commented Jun 21, 2020

stackdumper commented Sep 15, 2020

igorbrigadir commented Sep 16, 2020

ytsarev commented Sep 16, 2020

jorge07 commented Sep 17, 2020

jef commented Sep 25, 2020

mumoshu commented Apr 25, 2021

mumoshu commented Apr 16, 2020 •

edited

Loading

jorge07 commented Apr 16, 2020 •

edited

Loading

aweris commented May 5, 2020 •

edited

Loading

aweris commented May 8, 2020 •

edited

Loading

mumoshu commented May 9, 2020 •

edited

Loading