Report error when common PVC cleanup job hangs #846

AObuchow · 2022-05-30T20:54:08Z

What does this PR do?

This PR reuses the logic for checking failed workspace deployments to check the status and events of the common PVC cleanup job pods. This allows detecting for pod failure events and status, which can determine whether the cleanup pod was unable to be scheduled.

What issues does this PR fix or reference?

Fix #551

Is it tested? How?

In the PR's current state, the common PVC cleanup job spec has been modified so that the created cleanup pods fail:

	Args: []string{
		"-c",
- 		fmt.Sprintf(cleanupCommandFmt, path.Join(pvcClaimMountPath, workspaceId)),
+		 "exit 1",
	},

Though this isn't creating a case where the cleanup jobs pods aren't able to schedule, it will test the cleanup.go code that this PR adds.

To test this PR:

Start up DWO
Create 2 workspaces that use the common PVC storage-class strategy
Delete one of the workspaces so that the common PVC cleanup job will be run
Ensure an error related to the cleanup job's status is logged by DWO, something similar to the following (though the state will vary depending on the pod's state):

"level":"error","ts":1653943385.0445013,"logger":"controllers.DevWorkspace","msg":"Failed to clean up DevWorkspace storage","Request.Namespace":"devworkspace-controller","Request.Name":"theia-next-3","devworkspace_id":"workspace0192d636bd3f4b98","error":"DevWorkspace PVC cleanup job failed: see logs for job \"cleanup-workspace0192d636bd3f4b98\" for details. Additional information: Common PVC Cleanup related container cleanup-workspace0192d636bd3f4b98 has state ContainerCreating."

PR Checklist

E2E tests pass (when PR is ready, comment /test v8-devworkspace-operator-e2e, v8-che-happy-path to trigger)
- v8-devworkspace-operator-e2e: DevWorkspace e2e test
- v8-che-happy-path: Happy path for verification integration with Che

pkg/library/status/check.go

AObuchow · 2022-05-30T20:57:14Z

pkg/provision/storage/cleanup.go

@@ -146,7 +167,8 @@ func getSpecCommonPVCCleanupJob(workspace *dw.DevWorkspace, clusterAPI sync.Clus
 							Command: []string{"/bin/sh"},
 							Args: []string{
 								"-c",
-								fmt.Sprintf(cleanupCommandFmt, path.Join(pvcClaimMountPath, workspaceId)),
+								//fmt.Sprintf(cleanupCommandFmt, path.Join(pvcClaimMountPath, workspaceId)),


These two lines should be reverted before merging, as they exist for testing purposes

Don't forget :)

AObuchow · 2022-05-30T21:00:25Z

pkg/provision/storage/cleanup.go

+	for _, pod := range pods.Items {
+
+		for _, containerStatus := range pod.Status.ContainerStatuses {
+			if check.CheckContainerStatusForFailure(&containerStatus) {


Currently, it's possible for the container status to be set to waiting with reason ContainerCreating which doesn't seem like it should return an error, though it will cause my patch to return a ProvisionError. This should be fixed IMO, though I'm not sure if the fix should be specific to the cleanup job or part of check.CheckContainerStatusForFailure (which will impact workspace deployments).

amisevsk

Generally looks good, a few comments. I'll test it out soon.

pkg/library/status/check.go

pkg/provision/storage/cleanup.go

pkg/provision/workspace/deployment.go

pkg/library/status/check.go

amisevsk · 2022-05-31T16:02:57Z

/ok-to-test

AObuchow · 2022-05-31T16:22:23Z

pkg/library/status/check.go

@@ -28,7 +28,6 @@ import (
 	corev1 "k8s.io/api/core/v1"
 	"k8s.io/apimachinery/pkg/fields"
 	k8sclient "sigs.k8s.io/controller-runtime/pkg/client"
-	runtimeClient "sigs.k8s.io/controller-runtime/pkg/client"


Removing GetPods also fixed this duplicate import.. nice :)

AObuchow · 2022-06-01T16:24:10Z

pkg/provision/storage/cleanup.go

+	for _, pod := range pods.Items {
+		for _, containerStatus := range pod.Status.ContainerStatuses {
+			if status.CheckContainerStatusForFailure(&containerStatus) {
+				// TODO: Maybe move this logic into CheckContainerStatusForFailure and return bool, reason ?


Also need to address this TODO before PR is ready for merge. status.CheckCOntainerStatusForFailure currently assumes that if there is a failure, the container status state will be set to waiting, when it could be set to terminated.

AObuchow · 2022-06-09T21:13:53Z

pkg/provision/storage/cleanup.go

+
+	for _, pod := range pods.Items {
+		for _, containerStatus := range pod.Status.ContainerStatuses {
+			if status.CheckContainerStatusForFailure(&containerStatus) {


I knew something was wrong... this should be negated, ie. !status.CheckContainerStatusForFailure(&containerStatus)

amisevsk · 2022-06-13T17:53:49Z

pkg/provision/storage/cleanup.go

+			noFailure, reason := status.CheckContainerStatusForFailure(&containerStatus)
+			if !noFailure {
+				return fmt.Sprintf("Common PVC Cleanup related container %s has state %s.", containerStatus.Name, reason), nil


I'd rename here to avoid the double negation ("if not no failure")

amisevsk · 2022-06-13T17:54:54Z

pkg/provision/storage/cleanup.go

+		for _, initContainerStatus := range pod.Status.InitContainerStatuses {
+			noFailure, reason := status.CheckContainerStatusForFailure(&initContainerStatus)
+			if !noFailure {
+				return fmt.Sprintf("Common PVC Cleanup related init container %s has state %s.", initContainerStatus.Name, reason), nil
+			}
+		}


The cleanup job doesn't have any init containers as far as I know, so it's not clear why this check is necessary.

Good catch, thanks :)

pkg/library/status/check.go

ibuziuk · 2022-06-14T12:45:09Z

@AObuchow could you please if this is expected flow? currently, it looks like the error logs are streamed forever if the error happens:

AObuchow · 2022-06-14T14:16:48Z

@AObuchow could you please if this is expected flow? currently, it looks like the error logs are streamed forever if the error happens:

I believe the continuous stream of errors is related to #845 which was fixed on the main branch. I just rebased this PR to get the fix. When you get a chance, please let me know if this issue persists, as it is not the expected flow (The error should ideally be logged only once).

ibuziuk

@AObuchow after the rebase I have even more weird behavior:

create 2 devworkspaces
delete one of them
ERROR: common PVC is terminating even though there is still a devworkspace is running:

amisevsk · 2022-06-15T18:56:00Z

@ibuziuk That issue is on me -- caused by #858, fixed by #870. I guess when testing I didn't test specifically two workspaces -> deleting one of them.

Signed-off-by: Andrew Obuchowicz <aobuchow@redhat.com>

amisevsk

Tested on OpenShift -- no issues. Well done 👍.

I did see one issue that should be addressed in a separate PR: if you have a workspace that failed to delete in this way and then delete all workspaces, the common PVC is deleted and all non-errored workspaces are removed, but workspaces that failed to cleanup the PVC are still stuck in an errored state. This is likely due to us not processing errored workspaces, so I don't have a good fix in mind.

To reproduce:

oc apply -f samples/theia-next.yaml
yq '.metadata.name="theia-next-2"' samples/theia-next.yaml | kubectl apply -f -
Wait for workspaces to start/get finalizers at least
oc delete dw theia-next
Wait for deletion to hit error
oc delete dw --all

This results in the shared PVC and the theia-next workspace being deleted, but the theia-next-2 workspace being left in its errored state. At this point we can technically remove the storage finalizer from the errored workspace, as the PVC we're waiting to clean up is gone.

amisevsk · 2022-06-17T00:51:30Z

pkg/provision/storage/cleanup.go

@@ -146,7 +167,8 @@ func getSpecCommonPVCCleanupJob(workspace *dw.DevWorkspace, clusterAPI sync.Clus
 							Command: []string{"/bin/sh"},
 							Args: []string{
 								"-c",
-								fmt.Sprintf(cleanupCommandFmt, path.Join(pvcClaimMountPath, workspaceId)),
+								//fmt.Sprintf(cleanupCommandFmt, path.Join(pvcClaimMountPath, workspaceId)),


Don't forget :)

amisevsk · 2022-06-17T00:58:59Z

PR needs squash + signoff on all commits (and then re-run tests)

/ok-to-test

Fix devfile#551 Signed-off-by: Andrew Obuchowicz <aobuchow@redhat.com>

ibuziuk

@AObuchow LGTM
feel free to merge 👍

openshift-ci · 2022-06-20T16:51:28Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: amisevsk, AObuchow, ibuziuk

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [amisevsk]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

AObuchow requested review from amisevsk and ibuziuk as code owners May 30, 2022 20:54

AObuchow commented May 30, 2022

View reviewed changes

pkg/library/status/check.go Outdated Show resolved Hide resolved

AObuchow commented May 30, 2022

View reviewed changes

pkg/library/status/check.go Outdated Show resolved Hide resolved

AObuchow commented May 30, 2022

View reviewed changes

amisevsk reviewed May 31, 2022

View reviewed changes

pkg/library/status/check.go Outdated Show resolved Hide resolved

pkg/provision/storage/cleanup.go Outdated Show resolved Hide resolved

pkg/provision/workspace/deployment.go Outdated Show resolved Hide resolved

pkg/library/status/check.go Outdated Show resolved Hide resolved

openshift-ci bot added the ok-to-test label May 31, 2022

AObuchow commented May 31, 2022

View reviewed changes

AObuchow commented Jun 1, 2022

View reviewed changes

dkwon17 mentioned this pull request Jun 3, 2022

stop reconciling workspace when PVC cleanup job fails #851

Merged

3 tasks

AObuchow commented Jun 9, 2022

View reviewed changes

amisevsk reviewed Jun 13, 2022

View reviewed changes

ibuziuk removed the ok-to-test label Jun 14, 2022

AObuchow force-pushed the check_pvc_cleanup_job branch from a7b968b to 045870e Compare June 14, 2022 14:13

ibuziuk reviewed Jun 15, 2022

View reviewed changes

amisevsk mentioned this pull request Jun 15, 2022

Fix issue where deleting a workspace deletes the PVC incorrectly #870

Merged

3 tasks

Move deployment status-checking logic to library package

234a420

Signed-off-by: Andrew Obuchowicz <aobuchow@redhat.com>

AObuchow force-pushed the check_pvc_cleanup_job branch from 045870e to d31732c Compare June 16, 2022 14:07

amisevsk approved these changes Jun 17, 2022

View reviewed changes

openshift-ci bot assigned amisevsk Jun 17, 2022

openshift-ci bot added lgtm approved labels Jun 17, 2022

openshift-ci bot added the ok-to-test label Jun 17, 2022

feat: report error when common PVC cleanup job hangs

9c13ba4

Fix devfile#551 Signed-off-by: Andrew Obuchowicz <aobuchow@redhat.com>

AObuchow force-pushed the check_pvc_cleanup_job branch from d31732c to 9c13ba4 Compare June 17, 2022 16:36

openshift-ci bot removed the lgtm label Jun 17, 2022

AObuchow mentioned this pull request Jun 17, 2022

Errored workspaces aren't removed after PVC cleanup failure #873

Closed

ibuziuk approved these changes Jun 20, 2022

View reviewed changes

openshift-ci bot assigned ibuziuk Jun 20, 2022

openshift-ci bot added the lgtm label Jun 20, 2022

AObuchow merged commit f3e317a into devfile:main Jun 21, 2022

AObuchow deleted the check_pvc_cleanup_job branch June 21, 2022 14:03

amisevsk mentioned this pull request Jul 4, 2022

Cherry pick for 0.15.1 #885

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Report error when common PVC cleanup job hangs #846

Report error when common PVC cleanup job hangs #846

AObuchow commented May 30, 2022

AObuchow May 30, 2022

amisevsk Jun 17, 2022

AObuchow May 30, 2022

amisevsk left a comment

amisevsk commented May 31, 2022

AObuchow May 31, 2022

AObuchow Jun 1, 2022 •

edited

Loading

AObuchow Jun 9, 2022

amisevsk Jun 13, 2022

amisevsk Jun 13, 2022

AObuchow Jun 13, 2022

ibuziuk commented Jun 14, 2022

AObuchow commented Jun 14, 2022

ibuziuk left a comment •

edited

Loading

amisevsk commented Jun 15, 2022

amisevsk left a comment

amisevsk Jun 17, 2022

amisevsk commented Jun 17, 2022

ibuziuk left a comment

openshift-ci bot commented Jun 20, 2022

Report error when common PVC cleanup job hangs #846

Report error when common PVC cleanup job hangs #846

Conversation

AObuchow commented May 30, 2022

What does this PR do?

What issues does this PR fix or reference?

Is it tested? How?

PR Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amisevsk left a comment

Choose a reason for hiding this comment

amisevsk commented May 31, 2022

Choose a reason for hiding this comment

AObuchow Jun 1, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ibuziuk commented Jun 14, 2022

AObuchow commented Jun 14, 2022

ibuziuk left a comment • edited Loading

Choose a reason for hiding this comment

amisevsk commented Jun 15, 2022

amisevsk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amisevsk commented Jun 17, 2022

ibuziuk left a comment

Choose a reason for hiding this comment

openshift-ci bot commented Jun 20, 2022

AObuchow Jun 1, 2022 •

edited

Loading

ibuziuk left a comment •

edited

Loading