-
Notifications
You must be signed in to change notification settings - Fork 222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Certain nodes cannot attach Trident volumes #811
Comments
I checked how long the terminating state of the trident controller lasted when the issue occurred. In the below sample, it took 22 seconds to complete the shutdown. I guess the longer it takes, the more often the issue occurs.
|
@tksm, thanks for posting this additional information. I believe you are right that the Trident controller is taking longer to terminate than it did previously. The team is considering how to fix this issue. |
@tksm This makes a lot of sense. We definitely dont support having multiple Trident controller pods running at the same time. I think fundamentally, this is a kubernetes issue where it's not waiting for the first pod to fully terminate before creating the new pod. It should be waiting since we have "Recreate" strategy, but perhaps k8s has a bug when it's an evicted pod. A second k8s issue is that the trident-csi-controller Service is sending traffic to a terminating Pod. Possibly related to this kubernetes/kubernetes#110171 I think on the Trident side, there are a couple straight forward mitigations we could implement to shorten the chance of this race condition.
Another idea is to have the Trident Controller reconcile it's knowledge of nodes on a periodic basis, this may have other implications we would need to think through. Another thing, according to this bug, kubernetes/kubernetes#115819, k8s shouldnt even be waiting for the evicted trident controller to terminate. |
@ameade Hi, thanks for the information. I believe the periodic node reconciliation is a fundamental solution, but these mitigations look good to lower the chance of the issue.
As far as I know, this is a specification of the "Recreate" strategy rather than a bug. This behavior is noted in the Recreate Deployment section as follows.
|
@tksm ah I see, I guess the Pod eviction is falling under deleting the pod directly and not the deployment upgrade, which makes sense. I have a fix for what is slowing down the Trident shutdown, which should practically remove this race condition. Perhaps another mitigation is changing Trident controller to a statefulset.
|
Describe the bug
We occasionally have a problem that specific nodes cannot attach Trident volumes. Once it happens, it never recovers until recreating Trident Node pods.
In this situation, pods with Trident volumes get stuck in the ContainerCreating state with the following error.
After the investigation, we found the following things.
Added a new node
for the affected nodes were found in the old trident-controller pod.Having two trident-controller pods might be the cause of the problem. Some nodes might register to the old trident-controller, so the new trident-controller does not know those nodes, causing the not found error.
Environment
Provide accurate information about the environment to help us reproduce the issue.
silenceAutosupport: true
(Trident Operator)To Reproduce
We can confirm that having multiple trident-controller pods causes the issue by the following steps.
Setup Trident with trident-operator v23.01.0
Remove the trident-operator deployment
We manually increase the replicas of the trident-controller deployment since it is difficult to reproduce this issue due to a timing issue.
New nodes will register to one of the five trident-controller pods.
Some of the pods will likely get stuck in the ContainerCreating state.
Expected behavior
All nodes can attach Trident volumes.
Additional context
The text was updated successfully, but these errors were encountered: