Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TaskRun is failing to run/complete in some cases (on different namespace, …) #3310

Closed
vdemeester opened this issue Oct 1, 2020 · 8 comments · Fixed by #3494
Closed

TaskRun is failing to run/complete in some cases (on different namespace, …) #3310

vdemeester opened this issue Oct 1, 2020 · 8 comments · Fixed by #3494
Labels
good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug.

Comments

@vdemeester
Copy link
Member

vdemeester commented Oct 1, 2020

Expected Behavior

OpenShift Pipelines 1.0.1 installed on OpenShift 4.4.17 is failing to run the below TaskRun in some specific namespaces. While it's working on some namespaces it's failing in specific with the following error reported in tekton-pipelines-controller:

{
  "level": "error",
  "logger": "tekton.taskrun-controller",
  "caller": "taskrun/taskrun.go:163",
  "msg": "Reconcile error: error adding ready annotation to Pod \"example-taskrun-pod-bl2cg\": Pod \"example-taskrun-pod-bl2cg\" is invalid: spec: Forbidden: pod updates may not change fields other than `spec.containers[*].image`, `spec.initContainers[*].image`, `spec.activeDeadlineSeconds` or `spec.tolerations` (only additions to existing tolerations)\n core.PodSpec{\n \t... // 14 identical fields\n \tHostname: \"\",\n \tSubdomain: \"\",\n \tAffinity: &core.Affinity{\n \t\tNodeAffinity: &core.NodeAffinity{\n \t\t\tRequiredDuringSchedulingIgnoredDuringExecution: nil,\n \t\t\tPreferredDuringSchedulingIgnoredDuringExecution: []core.PreferredSchedulingTerm{\n \t\t\t\t{Weight: 100, Preference: core.NodeSelectorTerm{MatchExpressions: []core.NodeSelectorRequirement{{Key: \"…/workload\", Operator: \"In\", Values: []string{\"dynamic\"}}}}},\n- \t\t\t\t{\n- \t\t\t\t\tWeight: 100,\n- \t\t\t\t\tPreference: core.NodeSelectorTerm{\n- \t\t\t\t\t\tMatchExpressions: []core.NodeSelectorRequirement{{Key: \"…\", Operator: \"In\", Values: []string{\"dynamic\"}}},\n- \t\t\t\t\t},\n- \t\t\t\t},\n \t\t\t},\n \t\t},\n \t\tPodAffinity: nil,\n \t\tPodAntiAffinity: nil,\n \t},\n \tSchedulerName: \"default-scheduler\",\n \tTolerations: []core.Toleration{{Key: \"node.kubernetes.io/not-ready\", Operator: \"Exists\", Effect: \"NoExecute\", TolerationSeconds: &300}, {Key: \"node.kubernetes.io/unreachable\", Operator: \"Exists\", Effect: \"NoExecute\", TolerationSeconds: &300}, {Key: \"node.kubernetes.io/memory-pressure\", Operator: \"Exists\", Effect: \"NoSchedule\"}, {Key: \"workload\", Operator: \"Equal\", Value: \"dynamic\", Effect: \"NoSchedule\"}},\n \t... // 10 identical fields\n }\n",
  "knative.dev/controller": "taskrun-controller",
  "stacktrace": "github.com/tektoncd/pipeline/pkg/reconciler/taskrun.(*Reconciler).Reconcile\n\t/opt/app-root/src/go/src/github.com/tektoncd/pipeline/pkg/reconciler/taskrun/taskrun.go:163\nknative.dev/pkg/controller.(*Impl).processNextWorkItem\n\t/opt/app-root/src/go/src/github.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller/controller.go:361\nknative.dev/pkg/controller.(*Impl).Run.func2\n\t/opt/app-root/src/go/src/github.com/tektoncd/pipeline/vendor/knative.dev/pkg/controller/controller.go:310"
}

Actual Behavior

It always work.

Steps to Reproduce the Problem

Additional Info

The error is coming from the validation of the PodSpec, which is supposed to not change because we only add or update an annotation.
Looking into

// UpdateReady updates the Pod's annotations to signal the first step to start
we do an update, which could be a problem if the object got updated between the time we got it and the time we want to update the annotation — the spec got enhance by another mutating webhook, limitrange applied, …. It would be safe to use Patch instead to make sure we just update the annotations.

/kind bug

@vdemeester vdemeester added the kind/bug Categorizes issue or PR as related to a bug. label Oct 1, 2020
@vdemeester vdemeester added this to the Pipelines v0.17 milestone Oct 1, 2020
@vdemeester
Copy link
Member Author

cc @imjasonh @pritidesai

@imjasonh
Copy link
Member

imjasonh commented Oct 1, 2020

I'm not sure what difference the namespace makes, that might be a red herring.

In any case, the issue seems to be that we should update annotations with a Patch instead of an Update, to prevent errors from concurrent updates. Is that correct?

@imjasonh imjasonh added the good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. label Oct 1, 2020
@ywluogg
Copy link
Contributor

ywluogg commented Oct 1, 2020

I can take a look at this if no one else is

@vdemeester
Copy link
Member Author

I'm not sure what difference the namespace makes, that might be a red herring.

I think some namespaces had LimitRange and some did not (in the cluster it happened).

In any case, the issue seems to be that we should update annotations with a Patch instead of an Update, to prevent errors from concurrent updates. Is that correct?

@imjasonh exactly 👍

@pritidesai
Copy link
Member

Thanks @ywluogg and possibly replace one more Update with Patch:

if updated {
if _, err := kubeclient.CoreV1().Pods(newPod.Namespace).Update(newPod); err != nil {
return fmt.Errorf("error adding ready annotation to Pod %q: %w", pod.Name, err)

@imjasonh
Copy link
Member

imjasonh commented Oct 1, 2020

Patch takes a JSON object describing the change you want to make, for example when a PipelineRun is cancelled, all of its TaskRuns are cancelled using Patch, code here:

if _, err := clientSet.TektonV1beta1().TaskRuns(pr.Namespace).Patch(taskRunName, types.JSONPatchType, b, ""); err != nil {

@afrittoli
Copy link
Member

@mattmoor if I remember correctly using PATCH would require changes to the unit test infra to handle revision numbers properly?

@mattmoor
Copy link
Member

mattmoor commented Oct 5, 2020

The reactor logic around generation handling won't work properly. This is mainly relevant for "primary key" types IIRC, so patching a child resource might be alright.

@vdemeester vdemeester changed the title TaskRun is failing to run/complete in different namespaces TaskRun is failing to run/complete in some cases (on different namespace, …) Oct 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
6 participants