-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework the Understanding Node Rebooting topic #23256
Conversation
The preview will be available shortly at: |
@@ -13,12 +13,8 @@ With this in place, if only two infrastructure nodes are available and one is re | |||
pod is prevented from running on the other node. `*oc get pods*` reports the pod as unready until a suitable node is available. | |||
Once a node is available and all pods are back in ready state, the next node can be restarted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can podAntiAffinity
actually block further node reboots? Linking the upstream docs too, in case that helps. My impression is that it just limits pod scheduling, and if you also have a pod disruption budget, perhaps the cluster would have to hold on further node reboots until sufficient evicted pods had been rescheduled and readied elsewhere. Maybe this module is really about configuring robust workloads, and not specific to rebooting or nodes at all?
@@ -3,7 +3,7 @@ | |||
// * nodes/nodes-nodes-rebooting.adoc | |||
|
|||
[id="nodes-nodes-rebooting-router_{context}"] | |||
= Understanding how to reboot nodes running routers | |||
= About rebooting a node that is running a router |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this module really an ops thing that cluster admins should care about? It seems more like "if you are creating your own router load balancer, here's a bit of context". Which would be an install-time thing. As it stands, it's not clear to me who reads this module, or what action we have enabled them to take post-reading.
Another challenge is how to handle nodes that are running critical | ||
infrastructure such as the router or the registry. The same node evacuation | ||
process applies, though it is important to understand certain edge cases. | ||
Another challenge is how to handle xref:../../nodes/nodes/nodes-nodes-rebooting.html#nodes-nodes-rebooting-infrastructure_nodes-nodes-rebooting[rebooting master nodes], which running critical infrastructure such as a router or the registry. The same node evacuation process applies, though it is important to xref:../../nodes/nodes/nodes-nodes-rebooting.html#nodes-nodes-rebooting-router_nodes-nodes-rebooting[understand certain edge cases]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We explicitly exclude the router pods from control-plane nodes by default because of an upstream bug (only recently fixed). Should we consolidate here to just talk about the registry? I'd expect the registry operator to have the appropriate pod disruption budgets in place already, and not be something we needed to walk folks through in docs. I'd also be very surprised if we run the registry on control-plane nodes by default, but I don't have source links to back that up right now.
I agree with Trevor. The documentation I originally wrote for version 3 shouldn't really be needed. Operators should ensure that rebooting a host is an easy and harmless thing to do. |
@brenton Thank you for your comment. Just to be clear, the section you wrote, which you called Rebooting Nodes (and is renamed Understanding node rebooting) [1] can be removed from the 4.x docs because the operators do ensure that rebooting a host is easy, making these topics unnecessary? |
I think it's still worth a section on (anti-)affinity, PDBs, etc. (or link out to the upstream docs), to remind folks of these tools to make their workloads robust. And probably also calling out the lack of an in-cluster reboot knob, and the safety of either hitting a provider-side reboot button or deleting the machine (linking over to the MachineHealthCheck docs). And the fact that you really don't want to break etcd quorum, so you should be very cautious about doing anything to the control-plane machines until we grow product-side MHCs for them. |
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
Stale issues rot after 30d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle rotten |
Rotten issues close after 30d of inactivity. Reopen the issue by commenting /close |
@openshift-bot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
#23244