Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sometimes next Task in Pipeline will start before previous Task completes causing PVC/PV permisison issues #3510

Closed
itewk opened this issue Nov 9, 2020 · 22 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@itewk
Copy link

itewk commented Nov 9, 2020

Expected Behavior

When having sequential Tasks in a Pipeline that all use the same PVC/PV created via volumeClaimTemplate definitions in a TriggerTemplate I would expect that by the time a Task starts running all previous Task's Pods have fully termininated. If not, then the shared PVC/PV when mounted into the follow on Task will not have write permissions if stuck using RWO PVCs.

Actual Behavior

Sometimes, intermittently, Task's 1 pod will still be terminating when Task 2 starts and then the file permissions for the shared mount will be read only in Task 2 which can then causes it to fail if it expected write permissions.

Most annoyingly, this happens 'sometimes' so you never know if on Task 15 of a pipeline run Task 14 will still be terminating and your Pipeline run will fail due to a file permisions issue and then you need to kick off the pipeline again.

Steps to Reproduce the Problem

  1. Create a TriggerTempalte using volumeClaimTemplate (though i suspect this could be done with any PVC) triggering a Pipeline with multiple sequential tasks that all use that PVC performing write operations in the mounted directory
  2. run the pipeline a few times until you see the write permissions issue

Additional Info

  • Kubernetes version:

    Output of kubectl version:

Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.8", GitCommit:"9f2892aab98fe339f3bd70e3c470144299398ace", GitTreeState:"clean", BuildDate:"2020-08-13T16:12:48Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.0+d59ce34", GitCommit:"d59ce3486ae3ca3a0c36e5498e56f51594076596", GitTreeState:"clean", BuildDate:"2020-10-08T15:58:07Z", GoVersion:"go1.15.0", Compiler:"gc", Platform:"linux/amd64"}
  • Tekton Pipeline version:

    Output of tkn version or kubectl get pods -n tekton-pipelines -l app=tekton-pipelines-controller -o=jsonpath='{.items[0].metadata.labels.version}'

v0.16.3%

Workaround

Still trying to come up with one beyond just adding a sleep to the beginging of my tasks to lower chance that previous steps Pod is still running

@itewk itewk added the kind/bug Categorizes issue or PR as related to a bug. label Nov 9, 2020
@jlpettersson
Copy link
Member

jlpettersson commented Nov 9, 2020

I'm not sure I understand the issue here.

Expected Behavior
If not, then the shared PVC/PV when mounted into the follow on Task will not have write permissions if stuck using RWO PVCs

Both Tasks/Pods will have write permissions. The RWO means that the PV can be mounted only at a Node in writable mode at a time - multiple Pods on that Node can write to the volume. The example Parallel Tasks using PVC shows two tasks concurrently writing to the same PVC - in RWO mode.

Actual Behavior
Task's 1 pod will still be terminating when Task 2 starts and then the file permissions for the shared mount will be read only in Task 2 which can then causes it to fail if it expected write permissions.

I would not expect that file permissions is changed wether a Pod/process has terminated or not. In my view, it is up to the Pipeline-designer to make sure that the Tasks/Pods in the Pipeline can be executed in the declared order, e.g. that the expected files as available. I think this depends on how the Tasks handles the files? If a subsequent Task depends on files from a previous task, make sure to use runAfter.

@itewk
Copy link
Author

itewk commented Nov 10, 2020

I'm not 100% sure on my cause analysis, i took a best guess that its because a previous steps pod still had the RW access to the PVC in question. All I know for usre is that indeterminately a pipeline run will fail due to an inconsistent step failing due to Permission denied errors when writing to the workspace that is mounted to the home dir.

@itewk
Copy link
Author

itewk commented Nov 10, 2020

like i just added the following to the begining of my ClusterTask (that gets used multiple times in the same Pipeline with different input params).

      echo "DEBUG"
      ls -al $HOME
      ls -al $HOME/.ssh

and on random steps will get:

step-workflow-step
/bin/sh: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
DEBUG
ls: cannot open directory '/home/tssc': Permission denied

where the ClusterTask has a worksapce of

  workspaces:
  - name: home
    mountPath: /home/tssc

and that workspace is mapped to a PVC/PV created via volumeClaimTemplate

@itewk
Copy link
Author

itewk commented Nov 10, 2020

i wasn't having these issues pre upgrade to 1.2. i had to upgrade to 1.2 to get ahold of the when statement

@itewk
Copy link
Author

itewk commented Nov 10, 2020

some more info:

updated task to run this first:

      echo "DEBUG"
      id
      whoami
      ls -al /home
      ls -al $HOME
      ls -al $HOME/.ssh

Pipeline:

  • Task N -> Task N+1

Task N

/bin/sh: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
DEBUG
uid=1001(tssc) gid=0(root) groups=0(root),1001
tssc
total 4
drwxr-xr-x. 1 root root 18 Oct 23 17:40 .
drwxr-xr-x. 1 root root 37 Nov 10 14:14 ..
drwxrwsr-x. 7 root 1001 4096 Nov 10 14:14 tssc
total 36
drwxrwsr-x. 7 root 1001 4096 Nov 10 14:14 .
drwxr-xr-x. 1 root root 18 Oct 23 17:40 ..
drwxrws---. 2 tssc 1001 4096 Nov 10 14:13 .docker
drwxrws---. 3 tssc 1001 4096 Nov 10 14:14 .gnupg
drwxrws---. 2 tssc 1001 4096 Nov 10 14:13 .ssh
drwxrws---. 2 root 1001 16384 Nov 10 14:13 lost+found
drwxrwsr-x. 5 tssc 1001 4096 Nov 10 14:13 ploigos
total 20
drwxrws---. 2 tssc 1001 4096 Nov 10 14:13 .
drwxrwsr-x. 7 root 1001 4096 Nov 10 14:14 ..
-rw-rw----. 1 tssc 1001 348 Nov 10 14:14 config
-rw-rw----. 1 tssc 1001 2610 Nov 10 14:14 id_git-ssh-keys-ploigos-workflow-reference-quarkus-mvn-foobar-frui
-rw-rw----. 1 tssc 1001 869 Nov 10 14:14 known_hosts

Task N+1

step-workflow-step
2020/11/10 14:15:19 unsuccessful cred copy: ".docker" from "/tekton/creds" to "/home/tssc": unable to create destination directory: mkdir /home/tssc/.docker: permission denied
2020/11/10 14:15:19 unsuccessful cred copy: ".ssh" from "/tekton/creds" to "/home/tssc": unable to create destination directory: mkdir /home/tssc/.ssh: permission denied
/bin/sh: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
DEBUG
uid=1001(tssc) gid=0(root) groups=0(root),1001
tssc
total 4
drwxr-xr-x. 1 root root 18 Oct 23 17:40 .
drwxr-xr-x. 1 root root 37 Nov 10 14:15 ..
drwxrwsr-x. 7 root 1001 4096 Nov 10 14:14 tssc
ls: cannot open directory '/home/tssc': Permission denied

interpretation

?????

I am confused. it looks like in both Tasks, one happening right after the other, the directory permissions in question are the exact same but in one case get a permission denied and in the other dont

@itewk
Copy link
Author

itewk commented Nov 10, 2020

okay. i have learned new things.

      echo "DEBUG"
      ls -al /home
      ls -al $HOME || (echo "PAUSE"; sleep 10000)
      ls -al $HOME/.ssh

for ClusterTasks with multiple steps Tekton seems to be spinning up duplicate pods both with both containers, one for each step.

when i caught the task in the sleep and looked at the running pods, there were two pods running both with the same two step containers in them but then while it was in the sleep the second Pod terminated on its own. "Something" is creating duplicate Pods for the same Task.

@itewk
Copy link
Author

itewk commented Nov 10, 2020

actully, its not limtied to Tasks with multipel steps. my pipeline is running one long run Task at the moment and look at the output from oc pods -o wide

NAME                                                              READY   STATUS      RESTARTS   AGE     IP             NODE                                        NOMINATED NODE
READINESS GATES
cleanup-ploigos-workflow-reference-quarkus-mvn-fooba-16050dcskj   0/1     Completed   0          15m     10.128.5.15    ip-host1.us-west-2.compute.internal   <none>
<none>
cleanup-ploigos-workflow-reference-quarkus-mvn-fooba-16050mnt4h   0/1     Completed   0          39s     10.128.5.56    ip-host1.us-west-2.compute.internal   <none>
<none>
cleanup-ploigos-workflow-reference-quarkus-mvn-fooba-16050q92xw   0/1     Completed   0          30m     10.128.4.253   ip-host1.us-west-2.compute.internal   <none>
<none>
cleanup-ploigos-workflow-reference-quarkus-mvn-fooba-16050r56j6   0/1     Completed   0          60m     10.128.4.193   ip-host1.us-west-2.compute.internal   <none>
<none>
cleanup-ploigos-workflow-reference-quarkus-mvn-fooba-16050v287s   0/1     Completed   0          45m     10.128.4.215   ip-host1.us-west-2.compute.internal   <none>
<none>
el-ploigos-workflow-reference-quarkus-mvn-foobar-frui-5b5brctrd   1/1     Running     0          25h     10.128.4.15    ip-host1.us-west-2.compute.internal   <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-4vtq8--4wpkq   0/1     Completed   0          12m     10.128.5.26    ip-host1.us-west-2.compute.internal   <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-4vtq8--9p82j   0/1     Completed   0          13m     10.128.5.20    ip-host1.us-west-2.compute.internal   <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-4vtq8--bxxms   0/1     Completed   0          13m     10.128.5.21    ip-host1.us-west-2.compute.internal   <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-4vtq8--kt6bn   0/1     Completed   0          12m     10.128.5.30    ip-host1.us-west-2.compute.internal   <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-4vtq8--m74q7   0/1     Completed   0          12m     10.128.5.28    ip-host1.us-west-2.compute.internal   <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-4vtq8--nk57p   0/1     Completed   0          14m     10.128.5.19    ip-host1.us-west-2.compute.internal   <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-4vtq8--ptt2v   0/1     Completed   0          12m     10.128.5.27    ip-host1.us-west-2.compute.internal   <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-4vtq8--twqnq   0/2     Completed   0          13m     10.128.5.22    ip-host1.us-west-2.compute.internal   <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-qv4pt--j8ccv   0/1     Completed   0          84s     10.128.5.54    ip-host1.us-west-2.compute.internal   <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-r4qrr--285fg   0/1     Completed   0          3m34s   10.128.5.36    ip-host1.us-west-2.compute.internal   <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-r4qrr--59dwl   0/1     Completed   0          2m13s   10.128.5.43    ip-host1.us-west-2.compute.internal   <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-r4qrr--5pn2t   0/1     Completed   0          2m4s    10.128.5.45    ip-host1.us-west-2.compute.internal   <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-r4qrr--7xpbc   2/2     Running     0          2m42s   10.128.5.40    ip-host1.us-west-2.compute.internal   <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-r4qrr--ckw7p   0/1     Completed   0          85s     10.128.5.53    ip-host1.us-west-2.compute.internal   <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-r4qrr--csfsk   0/1     Completed   0          115s    10.128.5.49    ip-host1.us-west-2.compute.internal   <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-r4qrr--fsld5   0/1     Completed   0          2m4s    10.128.5.44    ip-host1.us-west-2.compute.internal   <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-r4qrr--g2njp   0/1     Completed   0          3m19s   10.128.5.38    ip-host1.us-west-2.compute.internal   <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-r4qrr--hk4xq   1/1     Running     0          2m14s   10.128.5.42    ip-host1.us-west-2.compute.internal   <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-r4qrr--hzmqq   1/1     Running     0          16s     10.128.5.58    ip-host1.us-west-2.compute.internal   <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-r4qrr--kq6c9   0/1     Completed   0          3m29s   10.128.5.37    ip-host1.us-west-2.compute.internal   <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-r4qrr--lbkwb   1/1     Running     0          116s    10.128.5.50    ip-host1.us-west-2.compute.internal   <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-r4qrr--lgtlp   1/1     Running     0          16s     10.128.5.57    ip-host1.us-west-2.compute.internal   <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-r4qrr--ppn8m   0/1     Completed   0          44s     10.128.5.55    ip-host1.us-west-2.compute.internal   <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-r4qrr--t4w7l   1/1     Running     0          86s     10.128.5.52    ip-host1.us-west-2.compute.internal   <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-r4qrr--tkpqj   0/2     Completed   0          2m41s   10.128.5.41    ip-host1.us-west-2.compute.internal   <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-r4qrr--wh967   0/1     Completed   0          96s     10.128.5.51    ip-host1.us-west-2.compute.internal   <none>
<none>
ploigos-workflow-reference-quarkus-mvn-foobar-frui-r4rmp--2zc7p   0/1     Completed   0          8m5s    10.128.5.31    ip-host1.us-west-2.compute.internal   <none>

there is a huge set of duplciative pods all running even though their steps already long since ended. looks like all of them are in my "PAUSE" sleep state. they failed to get permisisons and then something spun up a duplicate pod?

@itewk
Copy link
Author

itewk commented Nov 10, 2020

looking at the logs for the tekton-pipeline-controller i can see it is creating two TaskRuns for the same Task. Is there a good way for me to upload those logs somewhere?

@itewk
Copy link
Author

itewk commented Nov 10, 2020

reading those more it seems this is a dupe of those, but like this is happening to me now on every single run (on one step or another) now that I have upgraded to 0.16. was hoping to find a work around in those other items but i dont see one

@ghost
Copy link

ghost commented Nov 10, 2020

looking at the logs for the tekton-pipeline-controller i can see it is creating two TaskRuns for the same Task. Is there a good way for me to upload those logs somewhere?

A github gist is my usual go-to for longer logs. https://gist.github.com/

@ghost
Copy link

ghost commented Nov 10, 2020

Can you share a copy of the Pipeline, Tasks, and a Pod? If they're sensitive for some reason, maybe reduce them to a simplified example that exhibits the problem?

I have a lot of questions but no clear solutions to offer atm:

  • You mentioned "sequential Tasks in a Pipeline": what did you mean by this? They're set to runAfter one another?
  • What do the volumes and volumeMounts sections of the containers in one of the duplicate pods look like?
  • How many nodes do you have in your cluster? It looks like all these pods are being scheduled to the same node, which is interesting.
  • What is the Status of the completed pods, and are there any interesting events being emitted by them?
  • What is the state of the volume claim and underlying pv during pipeline execution?

@itewk
Copy link
Author

itewk commented Nov 10, 2020

You mentioned "sequential Tasks in a Pipeline": what did you mean by this? They're set to runAfter one another?

yes

What do the volumes and volumeMounts sections of the containers in one of the duplicate pods look like?

ummm.....like you want the pod yml when it comes up? https://gist.github.com/itewk/8f139349c727b3f83f309a856f6c2e4a

How many nodes do you have in your cluster? It looks like all these pods are being scheduled to the same node, which is interesting.

3 worker nodes
isn't it expected they all get scheduled on the same node due to the magic labeling stuff that goes on. i thought i read it somewhere.

What is the Status of the completed pods, and are there any interesting events being emitted by them?

This gist has the pod.yml from one of the failed pods: https://gist.github.com/itewk/8f139349c727b3f83f309a856f6c2e4a

What is the state of the volume claim and underlying pv during pipeline execution?

umm....good? its bound and everything and each pod is mounting the PVC (even the duplicate TaskRun pods).

Can you share a copy of the Pipeline, Tasks, and a Pod?

@itewk
Copy link
Author

itewk commented Nov 10, 2020

https://gist.github.com/itewk/edbac1dab03e25bc874195369306e845

This gist has tekton pipeline controller logs just after a reboot of the controller pod where at least one instance of duplicate TaskRuns was created. the two in there I know are duplicates are:

  • lzj7g--jhlj4
  • lzj7g--qhz2x

@ghost
Copy link

ghost commented Nov 11, 2020

OK, thanks for sharing all that. I don't immediately see a root cause from reading through the logs. It appears that there's no clear message as to why ploigos-workflow-reference-quarkus-mvn-foobar-frui-lzj7g--qhz2x is started so soon after ploigos-workflow-reference-quarkus-mvn-foobar-frui-lzj7g--jhlj4. I agree that #3126 seems like a very likely duplicate of this one (or vice-versa). It definitely seems like the PipelineRun reconciler is incorrectly starting multiple TaskRuns for a single Task.

@ghost ghost added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Nov 11, 2020
@ghost
Copy link

ghost commented Nov 11, 2020

I've attached a priority to this of critical-urgent so that it gets discussed ASAP at the next owners planning session.

@michaelsauter
Copy link

For the record, I think I'm hitting the same bug. As most tasks mount a shared workspace, many pods are stuck initialising because they can't all mount at the same time ...

To have a successful pipeline run, I first need to manually cleanup all related pods, which looks like this:

→ oc delete pod -l tekton.dev/pipeline=build-and-deploy-go
pod "build-and-deploy-go-run-pjjts-fetch-repository-n2g72-pod-hmxfr" deleted
pod "build-and-deploy-go-run-pjjts-fetch-repository-n2g72-pod-s4vqc" deleted
pod "build-and-deploy-go-run-pjjts-set-bitbucket-build-status--5976r" deleted
pod "build-and-deploy-go-run-pjjts-set-bitbucket-build-status--j57q5" deleted
pod "build-and-deploy-go-run-pjjts-set-bitbucket-build-status--xdbvj" deleted

It can also be different numbers of simultaneous pods:

→ oc delete pod -l tekton.dev/pipeline=build-and-deploy-go
pod "build-and-deploy-go-run-l4f7c-fetch-repository-9r6x6-pod-pcw5x" deleted
pod "build-and-deploy-go-run-l4f7c-fetch-repository-9r6x6-pod-qqgft" deleted
pod "build-and-deploy-go-run-l4f7c-fetch-repository-9r6x6-pod-wc99l" deleted
pod "build-and-deploy-go-run-l4f7c-set-bitbucket-build-status--bx7tx" deleted
pod "build-and-deploy-go-run-l4f7c-set-bitbucket-build-status--fztfl" deleted

@ghost ghost removed the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Nov 11, 2020
@ghost
Copy link

ghost commented Nov 11, 2020

I'm moving the priority label to #3126 since it was created first and will hopefully be a little easier to condense into a reproducible case, at least initially.

@ghost
Copy link

ghost commented Nov 11, 2020

oc delete pod -l tekton.dev/pipeline=build-and-deploy-go

It's interesting that in both of the cases described here it appears to be open shift involved. @michaelsauter I'm not super familiar with open shift but does this mean you're using cri-o as the container runtime too?

This might not be related at all, especially since #3126 doesn't appear to be related to open shift, but flagging here in case it becomes relevant in future.

@michaelsauter
Copy link

Yes this is on OpenShift. Unfortunately I'm super new to this cluster and not an cluster admin for it so I don't know the exact details. It's a 4.5 cluster, which, according to https://docs.openshift.com/container-platform/4.5/release_notes/ocp-4-5-release-notes.html is using cri-o. Kubernetes is v1.18.3+2fbd7c7. I can't even figure out what the Tekton version is ... I'll follow up with this, assuming cluster admins can easily see this from the operator management section.

@itewk
Copy link
Author

itewk commented Nov 11, 2020

i'll close this as a dupe of #3126 now that i understand what I am seeing is simply a symptom of that issue.

@michaelsauter
Copy link

FWIW, I figured out we're on OpenShift Pipelines Technology Preview 1.1, which means we're on Tekton Pipelines 0.14.3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

3 participants