Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deck pod in crash loop when using Tekton #3353

Closed
1 of 2 tasks
tdcox opened this issue Mar 15, 2019 · 6 comments
Closed
1 of 2 tasks

Deck pod in crash loop when using Tekton #3353

tdcox opened this issue Mar 15, 2019 · 6 comments

Comments

@tdcox
Copy link
Contributor

tdcox commented Mar 15, 2019

Summary

I am looking at a test cluster that is about 15hrs old. It was created and has had a single golang-http quickstart executed against it shortly after setup.

I am seeing two deck pods, with one in long term CrashLoopBackOff:

jx            deck-5fbbdc9478-kpxgc                                 1/1     Running            3          15h
jx            deck-5fbbdc9478-xt8k5                                 0/1     CrashLoopBackOff   177        15h

The failing pod reports:

    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Fri, 15 Mar 2019 09:09:32 +0000
      Finished:     Fri, 15 Mar 2019 09:09:32 +0000
    Ready:          False
    Restart Count:  177

And the container log is:

time="2019-03-15T08:59:17Z" level=info msg="Spyglass registered viewer buildlog with title Build Log."
time="2019-03-15T08:59:17Z" level=info msg="Spyglass registered viewer junit with title JUnit."
time="2019-03-15T08:59:17Z" level=info msg="Spyglass registered viewer metadata with title Metadata."
{"component":"deck","error":"invalid presubmit job promotion-build: agent must be one of jenkins, knative-build, knative-pipeline-run, kubernetes (found \"tekton\")","level":"fatal","msg":"Error starting config agent.","time":"2019-03-15T08:59:17Z"}

Steps to reproduce the behavior

Create cluster instance with:

jx create cluster gke \
--cluster-name='d23' \
--default-admin-password='xxxxxx' \
--environment-git-owner='tdcox' \
--enhanced-apis=true \
--enhanced-scopes=true \
--git-username='tdcox' \
--git-private=false \
--kaniko=true \
--labels='demo=true' \
--machine-type='n1-standard-4' \
--max-num-nodes='3' \
--min-num-nodes='2' \
--no-tiller=true \
--preemptible=true \
--project-id='jx-mar19' \
--prow=true \
--skip-login=true \
--tekton=true \
--zone='europe-west1-d'

Then run a single quickstart:

➜ jx create quickstart
Using Git provider GitHub at https://github.com
? Do you wish to use tdcox as the Git user name? Yes


About to create repository  on server https://github.com with user tdcox
? Which organisation do you want to use? tdcox
? Enter the new repository name:  test107


Creating repository tdcox/test107
? select the quickstart you wish to create golang-http
Generated quickstart at /Users/terry/Documents/code/jxtesting/test107
### NO charts folder /Users/terry/Documents/code/jxtesting/test107/charts/golang-http
Created project at /Users/terry/Documents/code/jxtesting/test107

The directory /Users/terry/Documents/code/jxtesting/test107 is not yet using git
? Would you like to initialise git now? Yes
? Commit message:  Initial import

Git repository created
performing pack detection in folder /Users/terry/Documents/code/jxtesting/test107
--> Draft detected Go (65.746753%)
selected pack: /Users/terry/.jx/draft/packs/github.com/jenkins-x-buildpacks/jenkins-x-kubernetes/packs/go
replacing placeholders in directory /Users/terry/Documents/code/jxtesting/test107
app name: test107, git server: github.com, org: tdcox, Docker registry org: tdcox
skipping directory "/Users/terry/Documents/code/jxtesting/test107/.git"
Pushed Git repository to https://github.com/tdcox/test107

Creating GitHub webhook for tdcox/test107 for url http://hook.jx.35.241.195.78.nip.io/hook

Watch pipeline activity via:    jx get activity -f test107 -w
Browse the pipeline log via:    jx get build logs tdcox/test107/master
Open the Jenkins console via    jx console
You can list the pipelines via: jx get pipelines
When the pipeline is complete:  jx get applications

For more help on available commands see: https://jenkins-x.io/developing/browsing/

Note that your first pipeline may take a few minutes to start while the necessary images get downloaded!

Expected behavior

Pod to restart after failure.

Actual behavior

Zombie Pod

Jx version

The output of jx version is:

NAME               VERSION
jx                 1.3.974
jenkins x platform 0.0.3535
Kubernetes cluster v1.11.7-gke.4
kubectl            v1.13.4
helm client        Client: v2.13.0+g79d0794
git                git version 2.21.0
Operating System   Mac OS X 10.13.6 build 17G4015

Jenkins type

  • Classic Jenkins
  • Serverless Jenkins
@ccojocar ccojocar added area/prow prow related issues kind/bug Issue is a bug priority/important-longterm labels Mar 18, 2019
@tdcox
Copy link
Contributor Author

tdcox commented Mar 18, 2019

@rawlingsj I have just observed this happen on a fresh cluster. It looks like the cluster auto-scaled down from three to two running nodes triggering a restart of a number of Pods as they were flushed from the terminating node. After this, I ended up with one working deck pod and one in a crash loop.

The failed pod is repeating this error once per second so it should probably have a Circuit Breaker too.

{"component":"deck","error":"invalid presubmit job promotion-build: agent must be one of jenkins, knative-build, knative-pipeline-run, kubernetes (found \"tekton\")","jobConfig":"","level":"error","msg":"Error loading config.","prowConfig":"/etc/config/config.yaml","time":"2019-03-18T16:13:05Z"}

@tdcox
Copy link
Contributor Author

tdcox commented Mar 18, 2019

Confirmed. If you scale the Deployment for deck down to zero and back up again, it fails to recover. Oops!

@dcanadillas
Copy link

I am having the same issue when installing NextGen cluster jx install --provider gke --ng (Tekton, Vaults and No-Tiller).

Anyway I see in the configmap "config" that the agent specified is tekton, but it seems that is not supported or not recognized as valid value:

$ kubectl get cm config -o yaml
apiVersion: v1
data:
  config.yaml: |

 [...]

    deck:
      spyglass: {}
    gerrit: {}
    owners_dir_blacklist:
      default: null
      repos: null
    plank: {}
    pod_namespace: jx
    postsubmits:
      dcanadillas-kube/environment-jx-nextgen-production:
      - agent: tekton
        branches:
        - master
        context: ""
        name: promotion
      dcanadillas-kube/environment-jx-nextgen-staging:
      - agent: tekton
        branches:
        - master
        context: ""
        name: promotion
      jenkins-x/dummy:
      - agent: tekton
        branches:
        - master
        context: ""
        name: release
    presubmits:
      dcanadillas-kube/environment-jx-nextgen-production:
      - agent: tekton
        always_run: true
        context: promotion-build
        name: promotion-build
        rerun_command: /test this
        trigger: (?m)^/test( all| this),?(\s+|$)
      dcanadillas-kube/environment-jx-nextgen-staging:
      - agent: tekton
                contexts:
        always_run: true
        context: promotion-build
        name: promotion-build
        rerun_command: /test this
        trigger: (?m)^/test( all| this),?(\s+|$)
      jenkins-x/dummy:
      - agent: tekton
        always_run: true
        context: serverless-jenkins
        name: serverless-jenkins
        rerun_command: /test this
        trigger: (?m)^/test( all| this),?(\s+|$)
    prowjob_namespace: jx
    push_gateway: {}
    sinker: {}
    tide:

[...]

Could it be related to Prow not supporting Tekton Pipelines?? (tektoncd/pipeline#537).

@tsahiduek
Copy link

I'm having the same issue when using jx install --prow=true --tekton=true --provider=eks

Did anyone found a way to resolve this?

@jenkins-x-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Provide feedback via https://jenkins-x.io/community.
/lifecycle stale

@jenkins-x-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
Provide feedback via https://jenkins-x.io/community.
/lifecycle rotten

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants