0.4 release must not try to create multiple PVCs across controllers #387

bobcatfish · 2019-01-10T23:01:29Z

Expected Behavior

Folks who try to use Pipelines after our 0.4 release should not have to work around the issue we found in #375, which effectively makes it infeasible to use Pipelines with regional GKE clusters.

Actual Behavior

If you try to use Pipelines with a GKE regional cluster, and you try to use a Task with an output, the resulting pod for that Task will be unschedulable.

This is because two PVCs will be created, one by the PipelineRun controller for the output resource, and one by the TaskRun controller for the logs. These will be assigned round robin to different zones within the region, and then the resulting pod will be unable to be scheduled such that it can use both (it can't be in 2 zones at once).

Additional Info

We have several options for how to resolve this before we release:

If we finish Explore moving away from init containers #224 before the 0.4 release, we can remove the log PVC because users should be able to reliably (enough) get logs from the step containers themselves
If we complete Use storage bucket to link outputs and inputs of tasks #384, we could potentially get rid of the output PVC entirely
If we neither of the above complete, we should remove the PVC we added only for logs

It is arguable that (2) is not enough, and we should remove the extra log PVC regardless, since we shouldn't be leaking PVCs. The main benefit of the log PVC as it stands at the moment is that it allows us to check log output in end to end tests.

We noticed early on that logs from init containers are often cleaned up immediately by k8s, particularly if the containers are short running (e.g. just echoing "hello world"). We started down a path to correct that, which takes an approach based on Prow's entrypoint solution (https://github.com/kubernetes/test-infra/tree/master/prow/cmd/entrypoint) (even using the same image at the moment!) which wraps the user's provided command and streams logs to a volume, from which the logs can be uploaded/streamed by a sidecar. Since we are using init containers for step execution, we can't yet use sidecars, but we are addressing that in tektoncd#224 (also an entrypoint re-writing based solution). Once we have that, we can sidecar support, starting with GCS as a POC (#107) and moving into other types. In the meantime, to enable us to get logs (particularly in tests), we had the taskrun controller create a PVC on the fly to hold logs. This has two problems: * The PVCs are not cleaned up so this is an unexpected side effect for users * Combined with PVC based input + ouput linking, this causes scheduling problems for the resulting pods (tektoncd#375) Now that we want to have an official release, this would be a bad state to release in, so we will remove this magical log PVC creation logic, which was never our intended end state anyway. Since we _do_ need the entrypoint rewriting and log interception logic in the long run, this commit leaves most functionality intact, removing only the PVC creation and changing the volume being used to an `emptyDir`, which is what we will likely use for #107 (and this is how Prow handles this as well). This means the released functionality will be streaming logs to a location where nothing can read them, however I think it is better than completely removing the functionality b/c: 1. We need the functionality in the long run 2. Users should be prepared for this functionality (e.g. dealing with edge cases around the taskrun controller being able to fetch an image's entrypoint) Fixes tektoncd#387

Three caveats to this integration test: - It was created before tektoncd#320, so the resource binding is not correct - It was created before tektoncd#387, so it relies on the log PVC which will no longer exist (can work around this by mounting a PVC explicitly in the test and writing directly to it instead of echoing?) - It doesn't exercise `runAfter` functionality

This tests DAG functionality by defining a pipeline with both fan in and fan out. The idea is that each Task echoes the current time, so after the pipeline compeletes, we can look at which Task echoes which which time to make sure they run in order. The tasks are also declared in the Pipeline in the wrong order on purpose, to make sure that the order of declaration doesn't affect how they are run (the opposite of the current functionality) Three caveats to this integration test: - It was created before tektoncd#320, so the resource binding is not correct - It was created before tektoncd#387, so it relies on the log PVC which will no longer exist (can work around this by mounting a PVC explicitly in the test and writing directly to it instead of echoing?) - It doesn't exercise `runAfter` functionality

We noticed early on that logs from init containers are often cleaned up immediately by k8s, particularly if the containers are short running (e.g. just echoing "hello world"). We started down a path to correct that, which takes an approach based on Prow's entrypoint solution (https://github.com/kubernetes/test-infra/tree/master/prow/cmd/entrypoint) (even using the same image at the moment!) which wraps the user's provided command and streams logs to a volume, from which the logs can be uploaded/streamed by a sidecar. Since we are using init containers for step execution, we can't yet use sidecars, but we are addressing that in tektoncd#224 (also an entrypoint re-writing based solution). Once we have that, we can sidecar support, starting with GCS as a POC (#107) and moving into other types. In the meantime, to enable us to get logs (particularly in tests), we had the taskrun controller create a PVC on the fly to hold logs. This has two problems: * The PVCs are not cleaned up so this is an unexpected side effect for users * Combined with PVC based input + ouput linking, this causes scheduling problems for the resulting pods (tektoncd#375) Now that we want to have an official release, this would be a bad state to release in, so we will remove this magical log PVC creation logic, which was never our intended end state anyway. Since we _do_ need the entrypoint rewriting and log interception logic in the long run, this commit leaves most functionality intact, removing only the PVC creation and changing the volume being used to an `emptyDir`, which is what we will likely use for #107 (and this is how Prow handles this as well). This means the released functionality will be streaming logs to a location where nothing can read them, however I think it is better than completely removing the functionality b/c: 1. We need the functionality in the long run 2. Users should be prepared for this functionality (e.g. dealing with edge cases around the taskrun controller being able to fetch an image's entrypoint) Fixes tektoncd#387

We noticed early on that logs from init containers are often cleaned up immediately by k8s, particularly if the containers are short running (e.g. just echoing "hello world"). We started down a path to correct that, which takes an approach based on Prow's entrypoint solution (https://github.com/kubernetes/test-infra/tree/master/prow/cmd/entrypoint) (even using the same image at the moment!) which wraps the user's provided command and streams logs to a volume, from which the logs can be uploaded/streamed by a sidecar. Since we are using init containers for step execution, we can't yet use sidecars, but we are addressing that in #224 (also an entrypoint re-writing based solution). Once we have that, we can sidecar support, starting with GCS as a POC (#107) and moving into other types. In the meantime, to enable us to get logs (particularly in tests), we had the taskrun controller create a PVC on the fly to hold logs. This has two problems: * The PVCs are not cleaned up so this is an unexpected side effect for users * Combined with PVC based input + ouput linking, this causes scheduling problems for the resulting pods (#375) Now that we want to have an official release, this would be a bad state to release in, so we will remove this magical log PVC creation logic, which was never our intended end state anyway. Since we _do_ need the entrypoint rewriting and log interception logic in the long run, this commit leaves most functionality intact, removing only the PVC creation and changing the volume being used to an `emptyDir`, which is what we will likely use for #107 (and this is how Prow handles this as well). This means the released functionality will be streaming logs to a location where nothing can read them, however I think it is better than completely removing the functionality b/c: 1. We need the functionality in the long run 2. Users should be prepared for this functionality (e.g. dealing with edge cases around the taskrun controller being able to fetch an image's entrypoint) Fixes #387

Fixes tektoncd#387 This PR fixes a pointer bug where an EventListener with `n` triggers would have the last trigger execute `n` times instead of all `n` triggers executing once. This PR also updates the TestHandleEvent function to test multiple triggers. In the future, it might be worth refactoring the TestHandleEvent function into a testing table, so we can increase our test coverage.

bobcatfish added this to the 0.4 Knative Release milestone Jan 10, 2019

bobcatfish mentioned this issue Jan 22, 2019

Optimization of copying output resources to PVC(or intermediate storage location) #398

Closed

bobcatfish self-assigned this Jan 27, 2019

bobcatfish mentioned this issue Jan 27, 2019

Remove extra log PVC #443

Merged

knative-prow-robot closed this as completed in #443 Jan 31, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

0.4 release must not try to create multiple PVCs across controllers #387

0.4 release must not try to create multiple PVCs across controllers #387

bobcatfish commented Jan 10, 2019

0.4 release must not try to create multiple PVCs across controllers #387

0.4 release must not try to create multiple PVCs across controllers #387

Comments

bobcatfish commented Jan 10, 2019

Expected Behavior

Actual Behavior

Additional Info