-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Boskos server fail to claim pv after restart/reschedule from the node #4571
Comments
The event is cryptic. @kubernetes/sig-storage-misc how can we debug this kind of issues? |
You need to look for the volume in kubelet.log |
(hummm seems it just happened again cc @jingxu97 (was investigating offline try to find some clue)) |
@msau42 I think it's something else - this time it took 6h and still not get ready - let's get the statefulset conversion asap |
From the the log, I see the following issue
Between the first pod and second pod, there are lots of nodes failures error message, not sure what happened to the tests and whether this is normal. |
Hmm, my guess for why this happened is - we changed the node controller to not automatically cleanup pods on unhealthy nodes since 1.5. As part of that, cleanup will happen only when the nodes are deleted from the API. So, when the node goes unready/fails, the pods go into an "unknown" state and stay stuck with the mounted volume. The controller knows to create replacements but they cannot mount the volume. The fix would be - detecting an unhealthy node and simply deleting it from the API (which GKE should do automagically). |
@foxish that makes sense, the detach code does not kick in unless the pod is in failed or succeeded states or deleted. So if it goes into unknown state the volume will remain attached (the safe behavior).
Why didn't that happen in @krzyzacy's case? |
For replica_set controller, it will not delete the unknown state pod before creating a new pod? How the controller counts the number of pods in the replica_set, it does not count the unknown status pod? |
we have gke node-auto-repair on, and that should auto fixing any unhealthy nodes? |
Yeah, the pod will get the |
We don't detach volume from "to-be-deleted" pod. Only after pod is deleted from API server, controller will detach. |
@saad-ali Based on @foxish description, I think there is a mismatch between replication_set controller and attach_detach controller. If a pod is in "unknown" state, replication_set controller will not count this as running pod and start a new one. But attach_detach_controller will not detach until the "unkown" pod is deleted from api server. If no garbage collection happen for this "unknown" pod for a long time, the new pod will not ever get the volume attached to the new node. But I assume after setting "deletionTimeStamp", the pod should be deleted soon from the API server. Anything missed here? cc @yujuhong |
Attach/detach controller does react to pod deletion events: if the pod's However it does not trigger detach if the pod state is "Unknown" because we don't want to detach and potentially corrupt user data, if pod state is unknown. We depend on some outside entity (node repair tool?) stepping in and deleting these pods. Question is if @krzyzacy has node-auto-repair on why didn't it delete the pods? |
I checked the log again. I think there are a number of cases that the node is not marked as not ready. No record of the pod status. But somehow replica_set controller starts a new pod because the count of running pod is no longer 1. But the new pod cannot attach the volume because it is still attached to different node. Eventually the old pod is deleted, but that's because @krzyzacy manually delete the deployment. |
Yeah, PVs and ReplicaSets/Deployments don't play well. @krzyzacy can we close this issue now that boskos moved to a StatefulSet? |
yeah I'll close the issue, but we can keep the discussion going :-) |
@Kargakis But it seems like for some reason replica_set controller tries to create a second pod even the node status does not change to "NotReady". I am not sure the pod status. Might need to check the kubelet log. |
The ReplicaSet controller will create a replacement as soon as the previous pod is marked for deletion. RC/RS/D are optimized for availability. |
I got the more information from kubelet log. The events happened as follows
the following log from kukbelet during eviction, note this is log is in reversed order
So the problem is that after pod is evicted, the pod is not deleted from API server. (I think pos is in evicted state) attach_detach_controller does not receive pod delete event from API server so it will not detach |
It should not be deleted from the api server when it is evicted. This allows introspection of the pod after it is evicted. I believe the correct behavior here is for the attack_detach_controller to identify that the Pod.Status.Phase == Failed or Succeeded. The only part I am not sure of is if there is a requirement for the volume to be unmounted before it is detached. In that case it may be trickier for the attach_detach_controller to identify that the volume has bee unmounted. |
I just recalled that we have already changed attach_detach controller to handle terminated pods. Please see kubernetes/kubernetes#45286. The PR was merged in May 11, and is included after 1.7.0 release. |
/area boskos
/assign
redeploy fixes it, but still want to fix the actual issue.
The text was updated successfully, but these errors were encountered: