Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework the Understanding Node Rebooting topic #23256

Closed
wants to merge 2 commits into from

Conversation

mburke5678
Copy link
Contributor

@openshift-docs-preview-bot

The preview will be available shortly at:

@@ -13,12 +13,8 @@ With this in place, if only two infrastructure nodes are available and one is re
pod is prevented from running on the other node. `*oc get pods*` reports the pod as unready until a suitable node is available.
Once a node is available and all pods are back in ready state, the next node can be restarted.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can podAntiAffinity actually block further node reboots? Linking the upstream docs too, in case that helps. My impression is that it just limits pod scheduling, and if you also have a pod disruption budget, perhaps the cluster would have to hold on further node reboots until sufficient evicted pods had been rescheduled and readied elsewhere. Maybe this module is really about configuring robust workloads, and not specific to rebooting or nodes at all?

@@ -3,7 +3,7 @@
// * nodes/nodes-nodes-rebooting.adoc

[id="nodes-nodes-rebooting-router_{context}"]
= Understanding how to reboot nodes running routers
= About rebooting a node that is running a router
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this module really an ops thing that cluster admins should care about? It seems more like "if you are creating your own router load balancer, here's a bit of context". Which would be an install-time thing. As it stands, it's not clear to me who reads this module, or what action we have enabled them to take post-reading.

Another challenge is how to handle nodes that are running critical
infrastructure such as the router or the registry. The same node evacuation
process applies, though it is important to understand certain edge cases.
Another challenge is how to handle xref:../../nodes/nodes/nodes-nodes-rebooting.html#nodes-nodes-rebooting-infrastructure_nodes-nodes-rebooting[rebooting master nodes], which running critical infrastructure such as a router or the registry. The same node evacuation process applies, though it is important to xref:../../nodes/nodes/nodes-nodes-rebooting.html#nodes-nodes-rebooting-router_nodes-nodes-rebooting[understand certain edge cases].
Copy link
Member

@wking wking Jun 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We explicitly exclude the router pods from control-plane nodes by default because of an upstream bug (only recently fixed). Should we consolidate here to just talk about the registry? I'd expect the registry operator to have the appropriate pod disruption budgets in place already, and not be something we needed to walk folks through in docs. I'd also be very surprised if we run the registry on control-plane nodes by default, but I don't have source links to back that up right now.

@mburke5678
Copy link
Contributor Author

mburke5678 commented Jun 26, 2020

@brenton If you don't mind a trip in the way-back machine, you added this section to the docs in July 2016 [1]. Are you able to help @wking with some of his concerns? I have made some attempts. Perhaps the original author might be a better source?

[1] e4ff0de

@brenton
Copy link

brenton commented Jun 26, 2020

I agree with Trevor. The documentation I originally wrote for version 3 shouldn't really be needed. Operators should ensure that rebooting a host is an easy and harmless thing to do.

@mburke5678
Copy link
Contributor Author

@brenton Thank you for your comment. Just to be clear, the section you wrote, which you called Rebooting Nodes (and is renamed Understanding node rebooting) [1] can be removed from the 4.x docs because the operators do ensure that rebooting a host is easy, making these topics unnecessary?
[1] https://docs.openshift.com/container-platform/4.4/nodes/nodes/nodes-nodes-rebooting.html
https://docs.openshift.com/container-platform/4.4/nodes/nodes/nodes-nodes-rebooting.html

@wking
Copy link
Member

wking commented Jun 27, 2020

I think it's still worth a section on (anti-)affinity, PDBs, etc. (or link out to the upstream docs), to remind folks of these tools to make their workloads robust. And probably also calling out the lack of an in-cluster reboot knob, and the safety of either hitting a provider-side reboot button or deleting the machine (linking over to the MachineHealthCheck docs). And the fact that you really don't want to break etcd quorum, so you should be very cautious about doing anything to the control-plane machines until we grow product-side MHCs for them.

@openshift-bot
Copy link

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 24, 2020
@openshift-bot
Copy link

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 23, 2020
@openshift-bot
Copy link

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci-robot
Copy link

@openshift-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mburke5678 mburke5678 deleted the issue-23244 branch January 14, 2021 15:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch/enterprise-4.2 branch/enterprise-4.3 branch/enterprise-4.4 branch/enterprise-4.5 do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants