Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unhandled race condition in DockerExecutor.Kill #1884

Closed
frs3 opened this issue Dec 20, 2019 · 0 comments · Fixed by #2208
Closed

Unhandled race condition in DockerExecutor.Kill #1884

frs3 opened this issue Dec 20, 2019 · 0 comments · Fixed by #2208
Labels

Comments

@frs3
Copy link

frs3 commented Dec 20, 2019

Is this a BUG REPORT or FEATURE REQUEST?:
BUG REPORT

What happened:
Random failure of workstep with sidecar when sidecar terminates with main container.

If DockerExecutor.Kill (docker.go:99) is called while container is terminating, docker --kill can fail with "Error response from daemon: Cannot kill container ... is not running".

Potential fix in docker.go:

func (d *DockerExecutor) Kill(containerIDs []string) error {
  killArgs := append([]string{"kill", "--signal", "TERM"}, containerIDs...)
  err := common.RunCommand("docker", killArgs...)
  if err != nil {
    if strings.Contains(err.Error(), "is not running") {
      return nil
    }
    return errors.InternalWrapError(err)
  }

Or just ignore error.

What you expected to happen:
Workstep should not fail.

How to reproduce it (as minimally and precisely as possible):

Run workflow below until race condition is triggered. If no luck, reproduce the error in a bash shell by calling docker kill on an already terminated container e.g.

docker run alpine
docker ps -a
docker kill <container id from docker ps -a>

Workflow:

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: sidecar-
spec:
  entrypoint: sidecar-bug
  templates:
  - name: sidecar-bug
    container:
      image: alpine:latest
      command: [sh, -c]
      args: ["
        ls
      "]
    sidecars:
    - name: test
      image: alpine

Anything else we need to know?:

Would be nice to have enhancement KillGracePeriod configurable:

Configurable graceful shutdown period for container and sidecar #1012

Environment:

  • Argo version:
argo: v2.4.3
  BuildDate: 2019-12-06T03:36:38Z
  GitCommit: cfe5f377bc3552fba90afe6db7a76edd92c753cd
  GitTreeState: clean
  GitTag: v2.4.3
  GoVersion: go1.11.5
  Compiler: gc
  Platform: darwin/amd64
  • Kubernetes version :
clientVersion:
  buildDate: 2018-02-07T12:22:21Z
  compiler: gc
  gitCommit: d2835416544f298c919e2ead3be3d0864b52323b
  gitTreeState: clean
  gitVersion: v1.9.3
  goVersion: go1.9.2
  major: "1"
  minor: "9"
  platform: linux/amd64
serverVersion:
  buildDate: 2019-11-07T19:12:22Z
  compiler: gc
  gitCommit: 56d89863d1033f9668ddd6e1c1aea81cd846ef88
  gitTreeState: clean
  gitVersion: v1.13.11-gke.14
  goVersion: go1.12.11b4
  major: "1"
  minor: 13+
  platform: linux/amd64

Other debugging information (if applicable):

  • workflow result:
Name:                sidecar-fc6g2
Namespace:           default
ServiceAccount:      default
Status:              Error
Message:             failed to save outputs: Error response from daemon: Cannot kill container: a542e06b69fcf66c7b82bf9416d91a7adefc3f0046a1c54a909432a9907c6cd8: Container a542e06b69fcf66c7b82bf9416d91a7adefc3f0046a1c54a909432a9907c6cd8 is not running
Created:             Fri Dec 20 14:01:11 +0000 (17 minutes ago)
Started:             Fri Dec 20 14:01:11 +0000 (17 minutes ago)
Finished:            Fri Dec 20 14:01:14 +0000 (17 minutes ago)
Duration:            3 seconds

STEP              PODNAME        DURATION  MESSAGE
 ⚠ sidecar-fc6g2  sidecar-fc6g2  2s        failed to save outputs: Error response from daemon: Cannot kill container: a542e06b69fcf66c7b82bf9416d91a7adefc3f0046a1c54a909432a9907c6cd8: Container a542e06b69fcf66c7b82bf9416d91a7adefc3f0046a1c54a909432a9907c6cd8 is not running
  • executor logs:
STEP              PODNAME        DURATION  MESSAGE
 ⚠ sidecar-fc6g2  sidecar-fc6g2  2s        failed to save outputs: Error response from daemon: Cannot kill container: a542e06b69fcf66c7b82bf9416d91a7adefc3f0046a1c54a909432a9907c6cd8: Container a542e06b69fcf66c7b82bf9416d91a7adefc3f0046a1c54a909432a9907c6cd8 is not running

time="2019-12-20T14:01:12Z" level=info msg="Creating a docker executor"
time="2019-12-20T14:01:12Z" level=info msg="Executor (version: v2.4.1, build_date: 2019-10-08T23:14:37Z) initialized (pod: default/sidecar-fc6g2) with template:\n{\"name\":\"sidecar-bug\",\"arguments\":{},\"inputs\":{},\"outputs\":{},\"metadata\":{},\"container\":{\"name\":\"\",\"image\":\"alpine:latest\",\"command\":[\"sh\",\"-c\"],\"args\":[\" ls \"],\"resources\":{}},\"sidecars\":[{\"name\":\"test\",\"image\":\"alpine\",\"resources\":{}}]}"
time="2019-12-20T14:01:12Z" level=info msg="Waiting on main container"
time="2019-12-20T14:01:13Z" level=info msg="main container started with container ID: 0cd29792d981f17b6c19c13678d3c7350c0eba0cd37cd2891ca64255fabc656f"
time="2019-12-20T14:01:13Z" level=info msg="Starting annotations monitor"
time="2019-12-20T14:01:13Z" level=info msg="docker wait 0cd29792d981f17b6c19c13678d3c7350c0eba0cd37cd2891ca64255fabc656f"
time="2019-12-20T14:01:13Z" level=info msg="Starting deadline monitor"
time="2019-12-20T14:01:13Z" level=info msg="Main container completed"
time="2019-12-20T14:01:13Z" level=info msg="No output parameters"
time="2019-12-20T14:01:13Z" level=info msg="No output artifacts"
time="2019-12-20T14:01:13Z" level=info msg="Killing sidecars"
time="2019-12-20T14:01:13Z" level=info msg="Annotations monitor stopped"
time="2019-12-20T14:01:13Z" level=info msg="Killing sidecar test (a542e06b69fcf66c7b82bf9416d91a7adefc3f0046a1c54a909432a9907c6cd8)"
time="2019-12-20T14:01:13Z" level=info msg="docker kill --signal TERM a542e06b69fcf66c7b82bf9416d91a7adefc3f0046a1c54a909432a9907c6cd8"
time="2019-12-20T14:01:13Z" level=error msg="`docker kill --signal TERM a542e06b69fcf66c7b82bf9416d91a7adefc3f0046a1c54a909432a9907c6cd8` failed: Error response from daemon: Cannot kill container: a542e06b69fcf66c7b82bf9416d91a7adefc3f0046a1c54a909432a9907c6cd8: Container a542e06b69fcf66c7b82bf9416d91a7adefc3f0046a1c54a909432a9907c6cd8 is not running\n"
time="2019-12-20T14:01:13Z" level=error msg="executor error: Error response from daemon: Cannot kill container: a542e06b69fcf66c7b82bf9416d91a7adefc3f0046a1c54a909432a9907c6cd8: Container a542e06b69fcf66c7b82bf9416d91a7adefc3f0046a1c54a909432a9907c6cd8 is not running\ngh.neting.cc/argoproj/argo/errors.Wrap\n\t/go/src/github.com/argoproj/argo/errors/errors.go:88\ngh.neting.cc/argoproj/argo/errors.InternalWrapError\n\t/go/src/github.com/argoproj/argo/errors/errors.go:71\ngh.neting.cc/argoproj/argo/workflow/executor/docker.(*DockerExecutor).Kill\n\t/go/src/github.com/argoproj/argo/workflow/executor/docker/docker.go:103\ngh.neting.cc/argoproj/argo/workflow/executor.(*WorkflowExecutor).KillSidecars\n\t/go/src/github.com/argoproj/argo/workflow/executor/executor.go:1107\ngh.neting.cc/argoproj/argo/cmd/argoexec/commands.waitContainer\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/commands/wait.go:56\ngh.neting.cc/argoproj/argo/cmd/argoexec/commands.NewWaitCommand.func1\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/commands/wait.go:16\ngh.neting.cc/spf13/cobra.(*Command).execute\n\t/go/src/github.com/spf13/cobra/command.go:766\ngh.neting.cc/spf13/cobra.(*Command).ExecuteC\n\t/go/src/github.com/spf13/cobra/command.go:852\ngh.neting.cc/spf13/cobra.(*Command).Execute\n\t/go/src/github.com/spf13/cobra/command.go:800\nmain.main\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/main.go:17\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:201\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1333"
time="2019-12-20T14:01:13Z" level=info msg="Alloc=4747 TotalAlloc=11461 Sys=70078 NumGC=4 Goroutines=9"
time="2019-12-20T14:01:13Z" level=fatal msg="Error response from daemon: Cannot kill container: a542e06b69fcf66c7b82bf9416d91a7adefc3f0046a1c54a909432a9907c6cd8: Container a542e06b69fcf66c7b82bf9416d91a7adefc3f0046a1c54a909432a9907c6cd8 is not running\ngh.neting.cc/argoproj/argo/errors.Wrap\n\t/go/src/github.com/argoproj/argo/errors/errors.go:88\ngh.neting.cc/argoproj/argo/errors.InternalWrapError\n\t/go/src/github.com/argoproj/argo/errors/errors.go:71\ngh.neting.cc/argoproj/argo/workflow/executor/docker.(*DockerExecutor).Kill\n\t/go/src/github.com/argoproj/argo/workflow/executor/docker/docker.go:103\ngh.neting.cc/argoproj/argo/workflow/executor.(*WorkflowExecutor).KillSidecars\n\t/go/src/github.com/argoproj/argo/workflow/executor/executor.go:1107\ngh.neting.cc/argoproj/argo/cmd/argoexec/commands.waitContainer\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/commands/wait.go:56\ngh.neting.cc/argoproj/argo/cmd/argoexec/commands.NewWaitCommand.func1\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/commands/wait.go:16\ngh.neting.cc/spf13/cobra.(*Command).execute\n\t/go/src/github.com/spf13/cobra/command.go:766\ngh.neting.cc/spf13/cobra.(*Command).ExecuteC\n\t/go/src/github.com/spf13/cobra/command.go:852\ngh.neting.cc/spf13/cobra.(*Command).Execute\n\t/go/src/github.com/spf13/cobra/command.go:800\nmain.main\n\t/go/src/github.com/argoproj/argo/cmd/argoexec/main.go:17\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:201\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1333"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant