-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Report error when common PVC cleanup job hangs #846
Conversation
pkg/provision/storage/cleanup.go
Outdated
@@ -146,7 +167,8 @@ func getSpecCommonPVCCleanupJob(workspace *dw.DevWorkspace, clusterAPI sync.Clus | |||
Command: []string{"/bin/sh"}, | |||
Args: []string{ | |||
"-c", | |||
fmt.Sprintf(cleanupCommandFmt, path.Join(pvcClaimMountPath, workspaceId)), | |||
//fmt.Sprintf(cleanupCommandFmt, path.Join(pvcClaimMountPath, workspaceId)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two lines should be reverted before merging, as they exist for testing purposes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't forget :)
pkg/provision/storage/cleanup.go
Outdated
for _, pod := range pods.Items { | ||
|
||
for _, containerStatus := range pod.Status.ContainerStatuses { | ||
if check.CheckContainerStatusForFailure(&containerStatus) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, it's possible for the container status to be set to waiting
with reason ContainerCreating
which doesn't seem like it should return an error, though it will cause my patch to return a ProvisionError
. This should be fixed IMO, though I'm not sure if the fix should be specific to the cleanup job or part of check.CheckContainerStatusForFailure
(which will impact workspace deployments).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks good, a few comments. I'll test it out soon.
/ok-to-test |
pkg/library/status/check.go
Outdated
@@ -28,7 +28,6 @@ import ( | |||
corev1 "k8s.io/api/core/v1" | |||
"k8s.io/apimachinery/pkg/fields" | |||
k8sclient "sigs.k8s.io/controller-runtime/pkg/client" | |||
runtimeClient "sigs.k8s.io/controller-runtime/pkg/client" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing GetPods
also fixed this duplicate import.. nice :)
pkg/provision/storage/cleanup.go
Outdated
for _, pod := range pods.Items { | ||
for _, containerStatus := range pod.Status.ContainerStatuses { | ||
if status.CheckContainerStatusForFailure(&containerStatus) { | ||
// TODO: Maybe move this logic into CheckContainerStatusForFailure and return bool, reason ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also need to address this TODO before PR is ready for merge. status.CheckCOntainerStatusForFailure
currently assumes that if there is a failure, the container status state will be set to waiting, when it could be set to terminated.
pkg/provision/storage/cleanup.go
Outdated
|
||
for _, pod := range pods.Items { | ||
for _, containerStatus := range pod.Status.ContainerStatuses { | ||
if status.CheckContainerStatusForFailure(&containerStatus) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I knew something was wrong... this should be negated, ie. !status.CheckContainerStatusForFailure(&containerStatus)
pkg/provision/storage/cleanup.go
Outdated
noFailure, reason := status.CheckContainerStatusForFailure(&containerStatus) | ||
if !noFailure { | ||
return fmt.Sprintf("Common PVC Cleanup related container %s has state %s.", containerStatus.Name, reason), nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rename here to avoid the double negation ("if not no failure")
pkg/provision/storage/cleanup.go
Outdated
for _, initContainerStatus := range pod.Status.InitContainerStatuses { | ||
noFailure, reason := status.CheckContainerStatusForFailure(&initContainerStatus) | ||
if !noFailure { | ||
return fmt.Sprintf("Common PVC Cleanup related init container %s has state %s.", initContainerStatus.Name, reason), nil | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cleanup job doesn't have any init containers as far as I know, so it's not clear why this check is necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, thanks :)
@AObuchow could you please if this is expected flow? currently, it looks like the error logs are streamed forever if the error happens: |
a7b968b
to
045870e
Compare
I believe the continuous stream of errors is related to #845 which was fixed on the main branch. I just rebased this PR to get the fix. When you get a chance, please let me know if this issue persists, as it is not the expected flow (The error should ideally be logged only once). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AObuchow after the rebase I have even more weird behavior:
Signed-off-by: Andrew Obuchowicz <aobuchow@redhat.com>
045870e
to
d31732c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested on OpenShift -- no issues. Well done 👍.
I did see one issue that should be addressed in a separate PR: if you have a workspace that failed to delete in this way and then delete all workspaces, the common PVC is deleted and all non-errored workspaces are removed, but workspaces that failed to cleanup the PVC are still stuck in an errored state. This is likely due to us not processing errored workspaces, so I don't have a good fix in mind.
To reproduce:
oc apply -f samples/theia-next.yaml
yq '.metadata.name="theia-next-2"' samples/theia-next.yaml | kubectl apply -f -
- Wait for workspaces to start/get finalizers at least
oc delete dw theia-next
- Wait for deletion to hit error
oc delete dw --all
This results in the shared PVC and the theia-next
workspace being deleted, but the theia-next-2
workspace being left in its errored state. At this point we can technically remove the storage finalizer from the errored workspace, as the PVC we're waiting to clean up is gone.
pkg/provision/storage/cleanup.go
Outdated
@@ -146,7 +167,8 @@ func getSpecCommonPVCCleanupJob(workspace *dw.DevWorkspace, clusterAPI sync.Clus | |||
Command: []string{"/bin/sh"}, | |||
Args: []string{ | |||
"-c", | |||
fmt.Sprintf(cleanupCommandFmt, path.Join(pvcClaimMountPath, workspaceId)), | |||
//fmt.Sprintf(cleanupCommandFmt, path.Join(pvcClaimMountPath, workspaceId)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't forget :)
PR needs squash + signoff on all commits (and then re-run tests) /ok-to-test |
Fix devfile#551 Signed-off-by: Andrew Obuchowicz <aobuchow@redhat.com>
d31732c
to
9c13ba4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AObuchow LGTM
feel free to merge 👍
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: amisevsk, AObuchow, ibuziuk The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What does this PR do?
This PR reuses the logic for checking failed workspace deployments to check the status and events of the common PVC cleanup job pods. This allows detecting for pod failure events and status, which can determine whether the cleanup pod was unable to be scheduled.
What issues does this PR fix or reference?
Fix #551
Is it tested? How?
In the PR's current state, the common PVC cleanup job spec has been modified so that the created cleanup pods fail:
Though this isn't creating a case where the cleanup jobs pods aren't able to schedule, it will test the
cleanup.go
code that this PR adds.To test this PR:
PR Checklist
/test v8-devworkspace-operator-e2e, v8-che-happy-path
to trigger)v8-devworkspace-operator-e2e
: DevWorkspace e2e testv8-che-happy-path
: Happy path for verification integration with Che