Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster Autoscaler conflict with volumneClaim and or affinity-assistant #4699

Open
Tracked by #6990
grid-dev opened this issue Mar 22, 2022 · 10 comments
Open
Tracked by #6990
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.

Comments

@grid-dev
Copy link

grid-dev commented Mar 22, 2022

Not enough "slots" for pods when affinity assistants allocates together with Cluster Autoscaler

Expected Behavior

  1. EKS cluster exists and has the following setup

    • Cluster Autoscaler and
    • 6 x t3.medium nodes in a node_group.
    • The nodes are spread across 3 availability zones in one AWS region.
    • Every node has labels assigned (see "K8s node labels")
  2. Cluster nodes are packed and have between 12 and 17 pods – whereas 17 is the maximum for this instance type

  3. PipelineRun is started consisting of 2 tasks which both share a workspace aka a volumneClaim (see "Pipeline YAML code")

  4. affinity-assistant-... allocates needed pods including itself on a single node or at least region so volumneClaim can be shared.

  5. If there is not enough space left for the needed "pods" the Cluster Autoscaler provisions a new node

  6. All task's start and can bind to the volumneClaim - one after the other.

  7. Pipeline finished successfully

  8. If Cluster Autoscaler created a new node, this node is terminated again after the run was successfull.

Actual Behavior

Steps to Reproduce the Problem

Step 1 - 4 are the same as in "Actual Behavior"

  1. affinity-assistant-3a0bc57d00-0 pod is started and persistentVolumneClaim is bound, but the pod for the first task go-lang-8txd7-git-pod is stuck in (see "Pod stuck event log")
  2. Pod timeout as deadlock → failed

Additional Info

  • Pod stuck event log (go-lang-8txd7-git-pod)
Events:                                                                                                           																																					│
│   Type     Reason             Age                From                Message                                    																																					│
│   ----     ------             ----               ----                -------                                    																																					│
│   Warning  FailedScheduling   37s (x3 over 42s)  default-scheduler   0/6 nodes are available: 2 Too many pods, 4 node(s) didn't find available persistent volumes to bind. 																						│
│   Normal   NotTriggerScaleUp  37s                cluster-autoscaler  pod didn't trigger scale-up: 1 node(s) didn't find available persistent volumes to bind																										│
│   Warning  FailedScheduling   24s (x2 over 32s)  default-scheduler   0/6 nodes are available: 1 node(s) didn't match pod affinity rules, 1 node(s) didn't match pod affinity/anti-affinity rules, 2 Too many pods, 3 node(s) had volume node affinity conflict.	│
  • Kubernetes version
Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.3", GitCommit:"816c97ab8cff8a1c72eccca1026f7820e93e0d25", GitTreeState:"clean", BuildDate:"2022-01-25T21:17:57Z", GoVersion:"go1.17.6", Compiler:"gc", Platform:"darwin/arm64"}
Server Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.5-eks-bc4871b", GitCommit:"5236faf39f1b7a7dabea8df12726f25608131aa9", GitTreeState:"clean", BuildDate:"2021-10-29T23:32:16Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"}
  • Tekton Pipeline version: v0.33.2
  • Tekton Triggers version: v0.17.0
  • Tekton Dashboard version: v0.24.1
  • K8s node labels
beta.kubernetes.io/arch=amd64                                                  
beta.kubernetes.io/instance-type=t3.medium                                  
beta.kubernetes.io/os=linux                                                 
capacity_type=ON_DEMAND                                                     
eks.amazonaws.com/capacityType=ON_DEMAND                                    
eks.amazonaws.com/nodegroup=managed-group-ondemand20220302164423693200000001
eks.amazonaws.com/nodegroup-image=ami-04d4d5e816895f43e                     
eks.amazonaws.com/sourceLaunchTemplateId=lt-0322b2edf9c5fb9f3               
eks.amazonaws.com/sourceLaunchTemplateVersion=1                             
environment=dev                                                             
failure-domain.beta.kubernetes.io/region=eu-central-1                       
failure-domain.beta.kubernetes.io/zone=eu-central-1a                        
kubernetes.io/arch=amd64                                                    
kubernetes.io/hostname=ip-172-.....eu-central-1.compute.internal        
kubernetes.io/os=linux                                                      
node.kubernetes.io/instance-type=t3.medium                                  
org=dev                                                                     
tenant=fooBar                                                             
topology.ebs.csi.aws.com/zone=eu-central-1a                                 
topology.kubernetes.io/region=eu-central-1                                  
topology.kubernetes.io/zone=eu-central-1a                   
  • Pipeline YAML code
---
apiVersion: tekton.dev/v1beta1
kind: Task
metadata:
  name: git
spec:
  params:
    - name: url
      description: Git repository URL to clone from
      type: string
    - name: branch
      description: Branch to use for cloning
      type: string
      default: main
  workspaces:
    - name: output
      description: The workspace containing the source code
  results:
    - name: GIT_COMMIT_HASH
      description: SHA hash for current commit from git
    - name: DATE_STRING
      description: Current date as hash
  steps:
    - name: clone
      image: alpine/git:v2.32.0@sha256:192d7b402bfd313757d5316920fea98606761e96202d850e5bdeab407b9a72ae
      workingDir: $(workspaces.output.path)
      script: |
        #!/usr/bin/env sh

        # -e  Exit on error
        # -u  Treat unset param as error
        set -eu

        # Fetch hash for latest commit (without cloning)
        # GIT_COMMIT_HASH="$(git ls-remote "$(params.url)" "$(params.branch)" | awk '{ print $1}')"

        # .. alternative
        git clone \
          --depth "1" \
          --single-branch "$(params.url)" \
          --branch "$(params.branch)" \
          tmp_repo

        cd tmp_repo

        GIT_COMMIT_HASH="$(git rev-parse HEAD)"
        DATE_STRING="$(date +"%Y-%m-%d_%H-%M-%S_%Z")"

        # Write data to result
        echo "GIT_$GIT_COMMIT_HASH" | tr -cd '[:alnum:]._-' > $(results.GIT_COMMIT_HASH.path)
        echo "DATE_$DATE_STRING" | tr -cd '[:alnum:]._-' > $(results.DATE_STRING.path)
---
apiVersion: tekton.dev/v1beta1
kind: Task
metadata:
  name: kaniko
spec:
  params:
    - name: docker_registry
      description: Docker repository URL to write image into
      type: string
    - name: GIT_COMMIT_HASH
      description: SHA hash for current commit from git
      type: string
    - name: DATE_STRING
      description: Current date as hash
      type: string
  volumes:
    - name: aws-creds
      secret:
        secretName: aws-credentials
    - name: docker-configmap
      configMap:
        name: docker-config
  workspaces:
    - name: output
      description: The workspace containing the source code
  steps:
    - name: echo
      image: alpine/git:v2.32.0@sha256:192d7b402bfd313757d5316920fea98606761e96202d850e5bdeab407b9a72ae
      workingDir: $(workspaces.output.path)/tmp_repo
      script: |
        #!/usr/bin/env sh

        # -e  Exit on error
        # -u  Treat unset param as error
        set -eu

        echo "List of image tags:"
        echo "|$(params.GIT_COMMIT_HASH)|"
        echo "|$(params.DATE_STRING)|"
    - name: build-push
      # https://github.com/GoogleContainerTools/kaniko
      image: gcr.io/kaniko-project/executor:v1.8.0@sha256:ff98af876169a488df4d70418f2a60e68f9e304b2e68d5d3db4c59e7fdc3da3c
      workingDir: $(workspaces.output.path)/tmp_repo
      command:
        - /kaniko/executor
      args:
        - --dockerfile=./Dockerfile
        - --context=$(workspaces.output.path)/tmp_repo
        - --destination=$(params.docker_registry):latest
        - --destination=$(params.docker_registry):$(params.GIT_COMMIT_HASH)
        - --destination=$(params.docker_registry):$(params.DATE_STRING)
        - --cache=true
        - --cache-ttl=720h # 1 month
        - --cache-repo=$(params.docker_registry)
      # kaniko assumes it is running as root, which means this example fails on platforms
      # that default to run containers as random uid (like OpenShift). Adding this securityContext
      # makes it explicit that it needs to run as root.
      securityContext:
        runAsUser: 0
      env:
        - name: "DOCKER_CONFIG"
          value: "/kaniko/.docker/"
      volumeMounts:
        - name: aws-creds
          mountPath: /root/.aws
        - name: docker-configmap
          mountPath: /kaniko/.docker/
---
apiVersion: tekton.dev/v1beta1
kind: Pipeline
metadata:
  name: go-lang
spec:
  params:
    - name: repo_url
      description: Repository URL to clone from.
      type: string
    - name: docker_registry
      description: Docker repository URL to write image into
      type: string
      default: 0190....dkr.ecr.eu-central-1.amazonaws.com/fooBar
  workspaces:
    - name: output
      description: The workspace containing the source code
  tasks:
    - name: git
      taskRef:
        name: git
      params:
        - name: url
          value: "$(params.repo_url)"
        - name: branch
          value: main
      workspaces:
        - name: output
          workspace: output
    - name: kaniko
      taskRef:
        name: kaniko
      runAfter:
        - git
      params:
        - name: docker_registry
          value: "$(params.docker_registry)"
        - name: GIT_COMMIT_HASH
          value: "$(tasks.git.results.GIT_COMMIT_HASH)"
        - name: DATE_STRING
          value: "$(tasks.git.results.DATE_STRING)"
      workspaces:
        - name: output
          workspace: output
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: tekton-user
secrets:
  - name: bitbucket-ssh-key
  - name: aws-credentials
---
apiVersion: tekton.dev/v1beta1
kind: PipelineRun
metadata:
  generateName: go-lang-
spec:
  timeout: 2m
  serviceAccountName: tekton-user
  pipelineRef:
    name: go-lang
  params:
    - name: repo_url
      value: git@bitbucket.org:fooBar/docker-test.git
  workspaces:
    - name: output
      volumeClaimTemplate:
        spec:
          accessModes:
            - ReadWriteOnce
          resources:
            requests:
              storage: 1Gi
@grid-dev grid-dev added the kind/bug Categorizes issue or PR as related to a bug. label Mar 22, 2022
@dibyom dibyom added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Mar 28, 2022
@tekton-robot
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 26, 2022
@grid-dev
Copy link
Author

/remove-lifecycle stale As issue still persists

@vdemeester
Copy link
Member

/remove-lifecycle stale

@tekton-robot tekton-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 27, 2022
@tekton-robot
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 25, 2022
@tekton-robot
Copy link
Collaborator

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 25, 2022
@pritidesai pritidesai added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Nov 1, 2022
@alex-souslik-hs
Copy link

Did you consider disabling the affinity assistant?
I'm currently experiencing this issue and would love some contributor input on this

@icereed
Copy link

icereed commented Apr 4, 2023

Hello all, this is indeed a challenge. How does anybody use cluster autoscaler with Tekton successfully? Is everybody just statically provisioning nodes and burning money this way? I would love to see some How-To on how you can setup Cluster Autoscaler with Tekton (with some kind of volumeClaim)...

@jleonar
Copy link

jleonar commented Apr 19, 2023

@icereed My company uses cluster autoscaler, with an NFS server(in the k8s cluster) to serve NFS mounts for PVCs. We also disable affinity-assistant.

@alanmoment
Copy link

alanmoment commented May 19, 2023

I have the same issue with this. The autoscaling is ok for some jobs which have the same node select label, but the jobs with the same label are not successfully running when resources are not enough.

update:
When I disabled affinity-assistant, the node can scale but not run the Pod on the new Node. I guess still the volume problem.

@lbernick
Copy link
Member

@grid-dev I'm not sure if this addresses your use case, but we've recently introduced some new options for the affinity assistant and would appreciate your feedback! Please feel free to weigh in on #6990. Since you're using a cluster autoscaler w/ a limited number of pods per node I wonder if the "isolate-pipelineruns" option would work well for you? https://github.com/tektoncd/pipeline/blob/main/docs/affinityassistants.md#affinity-assistants

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
Status: Todo
Development

No branches or pull requests

10 participants