Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Report error when common PVC cleanup job hangs #846

Merged
merged 2 commits into from
Jun 21, 2022

Conversation

AObuchow
Copy link
Collaborator

What does this PR do?

This PR reuses the logic for checking failed workspace deployments to check the status and events of the common PVC cleanup job pods. This allows detecting for pod failure events and status, which can determine whether the cleanup pod was unable to be scheduled.

What issues does this PR fix or reference?

Fix #551

Is it tested? How?

In the PR's current state, the common PVC cleanup job spec has been modified so that the created cleanup pods fail:

	Args: []string{
		"-c",
- 		fmt.Sprintf(cleanupCommandFmt, path.Join(pvcClaimMountPath, workspaceId)),
+		 "exit 1",
	},

Though this isn't creating a case where the cleanup jobs pods aren't able to schedule, it will test the cleanup.go code that this PR adds.

To test this PR:

  1. Start up DWO
  2. Create 2 workspaces that use the common PVC storage-class strategy
  3. Delete one of the workspaces so that the common PVC cleanup job will be run
  4. Ensure an error related to the cleanup job's status is logged by DWO, something similar to the following (though the state will vary depending on the pod's state):
"level":"error","ts":1653943385.0445013,"logger":"controllers.DevWorkspace","msg":"Failed to clean up DevWorkspace storage","Request.Namespace":"devworkspace-controller","Request.Name":"theia-next-3","devworkspace_id":"workspace0192d636bd3f4b98","error":"DevWorkspace PVC cleanup job failed: see logs for job \"cleanup-workspace0192d636bd3f4b98\" for details. Additional information: Common PVC Cleanup related container cleanup-workspace0192d636bd3f4b98 has state ContainerCreating."

PR Checklist

  • E2E tests pass (when PR is ready, comment /test v8-devworkspace-operator-e2e, v8-che-happy-path to trigger)
    • v8-devworkspace-operator-e2e: DevWorkspace e2e test
    • v8-che-happy-path: Happy path for verification integration with Che

pkg/library/status/check.go Outdated Show resolved Hide resolved
pkg/library/status/check.go Outdated Show resolved Hide resolved
@@ -146,7 +167,8 @@ func getSpecCommonPVCCleanupJob(workspace *dw.DevWorkspace, clusterAPI sync.Clus
Command: []string{"/bin/sh"},
Args: []string{
"-c",
fmt.Sprintf(cleanupCommandFmt, path.Join(pvcClaimMountPath, workspaceId)),
//fmt.Sprintf(cleanupCommandFmt, path.Join(pvcClaimMountPath, workspaceId)),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two lines should be reverted before merging, as they exist for testing purposes

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget :)

for _, pod := range pods.Items {

for _, containerStatus := range pod.Status.ContainerStatuses {
if check.CheckContainerStatusForFailure(&containerStatus) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, it's possible for the container status to be set to waiting with reason ContainerCreating which doesn't seem like it should return an error, though it will cause my patch to return a ProvisionError. This should be fixed IMO, though I'm not sure if the fix should be specific to the cleanup job or part of check.CheckContainerStatusForFailure (which will impact workspace deployments).

Copy link
Collaborator

@amisevsk amisevsk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good, a few comments. I'll test it out soon.

pkg/library/status/check.go Outdated Show resolved Hide resolved
pkg/provision/storage/cleanup.go Outdated Show resolved Hide resolved
pkg/provision/workspace/deployment.go Outdated Show resolved Hide resolved
pkg/library/status/check.go Outdated Show resolved Hide resolved
@amisevsk
Copy link
Collaborator

/ok-to-test

@@ -28,7 +28,6 @@ import (
corev1 "k8s.io/api/core/v1"
"k8s.io/apimachinery/pkg/fields"
k8sclient "sigs.k8s.io/controller-runtime/pkg/client"
runtimeClient "sigs.k8s.io/controller-runtime/pkg/client"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing GetPods also fixed this duplicate import.. nice :)

for _, pod := range pods.Items {
for _, containerStatus := range pod.Status.ContainerStatuses {
if status.CheckContainerStatusForFailure(&containerStatus) {
// TODO: Maybe move this logic into CheckContainerStatusForFailure and return bool, reason ?
Copy link
Collaborator Author

@AObuchow AObuchow Jun 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also need to address this TODO before PR is ready for merge. status.CheckCOntainerStatusForFailure currently assumes that if there is a failure, the container status state will be set to waiting, when it could be set to terminated.


for _, pod := range pods.Items {
for _, containerStatus := range pod.Status.ContainerStatuses {
if status.CheckContainerStatusForFailure(&containerStatus) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I knew something was wrong... this should be negated, ie. !status.CheckContainerStatusForFailure(&containerStatus)

Comment on lines 238 to 240
noFailure, reason := status.CheckContainerStatusForFailure(&containerStatus)
if !noFailure {
return fmt.Sprintf("Common PVC Cleanup related container %s has state %s.", containerStatus.Name, reason), nil
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rename here to avoid the double negation ("if not no failure")

Comment on lines 244 to 249
for _, initContainerStatus := range pod.Status.InitContainerStatuses {
noFailure, reason := status.CheckContainerStatusForFailure(&initContainerStatus)
if !noFailure {
return fmt.Sprintf("Common PVC Cleanup related init container %s has state %s.", initContainerStatus.Name, reason), nil
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cleanup job doesn't have any init containers as far as I know, so it's not clear why this check is necessary.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thanks :)

pkg/library/status/check.go Show resolved Hide resolved
@ibuziuk
Copy link
Contributor

ibuziuk commented Jun 14, 2022

@AObuchow could you please if this is expected flow? currently, it looks like the error logs are streamed forever if the error happens:

cleanup job

@AObuchow
Copy link
Collaborator Author

@AObuchow could you please if this is expected flow? currently, it looks like the error logs are streamed forever if the error happens:

I believe the continuous stream of errors is related to #845 which was fixed on the main branch. I just rebased this PR to get the fix. When you get a chance, please let me know if this issue persists, as it is not the expected flow (The error should ideally be logged only once).

Copy link
Contributor

@ibuziuk ibuziuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AObuchow after the rebase I have even more weird behavior:

  • create 2 devworkspaces
  • delete one of them
  • ERROR: common PVC is terminating even though there is still a devworkspace is running:
    terminating PVC

@amisevsk
Copy link
Collaborator

@ibuziuk That issue is on me -- caused by #858, fixed by #870. I guess when testing I didn't test specifically two workspaces -> deleting one of them.

Signed-off-by: Andrew Obuchowicz <aobuchow@redhat.com>
Copy link
Collaborator

@amisevsk amisevsk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested on OpenShift -- no issues. Well done 👍.

I did see one issue that should be addressed in a separate PR: if you have a workspace that failed to delete in this way and then delete all workspaces, the common PVC is deleted and all non-errored workspaces are removed, but workspaces that failed to cleanup the PVC are still stuck in an errored state. This is likely due to us not processing errored workspaces, so I don't have a good fix in mind.

To reproduce:

  1. oc apply -f samples/theia-next.yaml
  2. yq '.metadata.name="theia-next-2"' samples/theia-next.yaml | kubectl apply -f -
  3. Wait for workspaces to start/get finalizers at least
  4. oc delete dw theia-next
  5. Wait for deletion to hit error
  6. oc delete dw --all

This results in the shared PVC and the theia-next workspace being deleted, but the theia-next-2 workspace being left in its errored state. At this point we can technically remove the storage finalizer from the errored workspace, as the PVC we're waiting to clean up is gone.

@@ -146,7 +167,8 @@ func getSpecCommonPVCCleanupJob(workspace *dw.DevWorkspace, clusterAPI sync.Clus
Command: []string{"/bin/sh"},
Args: []string{
"-c",
fmt.Sprintf(cleanupCommandFmt, path.Join(pvcClaimMountPath, workspaceId)),
//fmt.Sprintf(cleanupCommandFmt, path.Join(pvcClaimMountPath, workspaceId)),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget :)

@amisevsk
Copy link
Collaborator

PR needs squash + signoff on all commits (and then re-run tests)

/ok-to-test

Fix devfile#551

Signed-off-by: Andrew Obuchowicz <aobuchow@redhat.com>
Copy link
Contributor

@ibuziuk ibuziuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AObuchow LGTM
feel free to merge 👍

@openshift-ci
Copy link

openshift-ci bot commented Jun 20, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: amisevsk, AObuchow, ibuziuk

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@AObuchow AObuchow merged commit f3e317a into devfile:main Jun 21, 2022
@AObuchow AObuchow deleted the check_pvc_cleanup_job branch June 21, 2022 14:03
@amisevsk amisevsk mentioned this pull request Jul 4, 2022
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Check PVC cleanup pods for failure (events and status)
3 participants