-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sometimes next Task in Pipeline will start before previous Task completes causing PVC/PV permisison issues #3510
Comments
I'm not sure I understand the issue here.
Both Tasks/Pods will have write permissions. The RWO means that the PV can be mounted only at a Node in writable mode at a time - multiple Pods on that Node can write to the volume. The example Parallel Tasks using PVC shows two tasks concurrently writing to the same PVC - in RWO mode.
I would not expect that file permissions is changed wether a Pod/process has terminated or not. In my view, it is up to the Pipeline-designer to make sure that the Tasks/Pods in the Pipeline can be executed in the declared order, e.g. that the expected files as available. I think this depends on how the Tasks handles the files? If a subsequent Task depends on files from a previous task, make sure to use runAfter. |
I'm not 100% sure on my cause analysis, i took a best guess that its because a previous steps pod still had the RW access to the PVC in question. All I know for usre is that indeterminately a pipeline run will fail due to an inconsistent step failing due to |
like i just added the following to the begining of my ClusterTask (that gets used multiple times in the same Pipeline with different input params).
and on random steps will get:
where the ClusterTask has a worksapce of
and that workspace is mapped to a PVC/PV created via |
i wasn't having these issues pre upgrade to 1.2. i had to upgrade to 1.2 to get ahold of the |
some more info: updated task to run this first:
Pipeline:
Task N
Task N+1
interpretation????? I am confused. it looks like in both Tasks, one happening right after the other, the directory permissions in question are the exact same but in one case get a permission denied and in the other dont |
okay. i have learned new things.
for ClusterTasks with multiple steps Tekton seems to be spinning up duplicate pods both with both containers, one for each step. when i caught the task in the sleep and looked at the running pods, there were two pods running both with the same two step containers in them but then while it was in the sleep the second Pod terminated on its own. "Something" is creating duplicate Pods for the same Task. |
actully, its not limtied to Tasks with multipel steps. my pipeline is running one long run Task at the moment and look at the output from
there is a huge set of duplciative pods all running even though their steps already long since ended. looks like all of them are in my "PAUSE" sleep state. they failed to get permisisons and then something spun up a duplicate pod? |
looking at the logs for the tekton-pipeline-controller i can see it is creating two TaskRuns for the same Task. Is there a good way for me to upload those logs somewhere? |
reading those more it seems this is a dupe of those, but like this is happening to me now on every single run (on one step or another) now that I have upgraded to 0.16. was hoping to find a work around in those other items but i dont see one |
A github gist is my usual go-to for longer logs. https://gist.github.com/ |
Can you share a copy of the Pipeline, Tasks, and a Pod? If they're sensitive for some reason, maybe reduce them to a simplified example that exhibits the problem? I have a lot of questions but no clear solutions to offer atm:
|
yes
ummm.....like you want the pod yml when it comes up? https://gist.github.com/itewk/8f139349c727b3f83f309a856f6c2e4a
3 worker nodes
This gist has the pod.yml from one of the failed pods: https://gist.github.com/itewk/8f139349c727b3f83f309a856f6c2e4a
umm....good? its bound and everything and each pod is mounting the PVC (even the duplicate TaskRun pods).
|
https://gist.github.com/itewk/edbac1dab03e25bc874195369306e845 This gist has tekton pipeline controller logs just after a reboot of the controller pod where at least one instance of duplicate TaskRuns was created. the two in there I know are duplicates are:
|
OK, thanks for sharing all that. I don't immediately see a root cause from reading through the logs. It appears that there's no clear message as to why |
I've attached a priority to this of critical-urgent so that it gets discussed ASAP at the next owners planning session. |
For the record, I think I'm hitting the same bug. As most tasks mount a shared workspace, many pods are stuck initialising because they can't all mount at the same time ... To have a successful pipeline run, I first need to manually cleanup all related pods, which looks like this:
It can also be different numbers of simultaneous pods:
|
I'm moving the priority label to #3126 since it was created first and will hopefully be a little easier to condense into a reproducible case, at least initially. |
It's interesting that in both of the cases described here it appears to be open shift involved. @michaelsauter I'm not super familiar with open shift but does this mean you're using cri-o as the container runtime too? This might not be related at all, especially since #3126 doesn't appear to be related to open shift, but flagging here in case it becomes relevant in future. |
Yes this is on OpenShift. Unfortunately I'm super new to this cluster and not an cluster admin for it so I don't know the exact details. It's a 4.5 cluster, which, according to https://docs.openshift.com/container-platform/4.5/release_notes/ocp-4-5-release-notes.html is using cri-o. Kubernetes is v1.18.3+2fbd7c7. I can't even figure out what the Tekton version is ... I'll follow up with this, assuming cluster admins can easily see this from the operator management section. |
i'll close this as a dupe of #3126 now that i understand what I am seeing is simply a symptom of that issue. |
FWIW, I figured out we're on OpenShift Pipelines Technology Preview 1.1, which means we're on Tekton Pipelines 0.14.3. |
Expected Behavior
When having sequential Tasks in a Pipeline that all use the same PVC/PV created via
volumeClaimTemplate
definitions in a TriggerTemplate I would expect that by the time a Task starts running all previous Task's Pods have fully termininated. If not, then the shared PVC/PV when mounted into the follow on Task will not have write permissions if stuck using RWO PVCs.Actual Behavior
Sometimes, intermittently, Task's 1 pod will still be terminating when Task 2 starts and then the file permissions for the shared mount will be read only in Task 2 which can then causes it to fail if it expected write permissions.
Most annoyingly, this happens 'sometimes' so you never know if on Task 15 of a pipeline run Task 14 will still be terminating and your Pipeline run will fail due to a file permisions issue and then you need to kick off the pipeline again.
Steps to Reproduce the Problem
volumeClaimTemplate
(though i suspect this could be done with any PVC) triggering a Pipeline with multiple sequential tasks that all use that PVC performing write operations in the mounted directoryAdditional Info
Kubernetes version:
Output of
kubectl version
:Tekton Pipeline version:
Output of
tkn version
orkubectl get pods -n tekton-pipelines -l app=tekton-pipelines-controller -o=jsonpath='{.items[0].metadata.labels.version}'
Workaround
Still trying to come up with one beyond just adding a sleep to the beginging of my tasks to lower chance that previous steps Pod is still running
The text was updated successfully, but these errors were encountered: