Rework the Understanding Node Rebooting topic #23256

mburke5678 · 2020-06-24T14:26:19Z

openshift-docs-preview-bot · 2020-06-24T14:28:44Z

The preview will be available shortly at:

https://issue-23244--ocpdocs.netlify.com/

wking · 2020-06-24T22:03:04Z

modules/nodes-nodes-rebooting-affinity.adoc

@@ -13,12 +13,8 @@ With this in place, if only two infrastructure nodes are available and one is re
 pod is prevented from running on the other node. `*oc get pods*` reports the pod as unready until a suitable node is available. 
 Once a node is available and all pods are back in ready state, the next node can be restarted.


Can podAntiAffinity actually block further node reboots? Linking the upstream docs too, in case that helps. My impression is that it just limits pod scheduling, and if you also have a pod disruption budget, perhaps the cluster would have to hold on further node reboots until sufficient evicted pods had been rescheduled and readied elsewhere. Maybe this module is really about configuring robust workloads, and not specific to rebooting or nodes at all?

wking · 2020-06-24T22:06:47Z

modules/nodes-nodes-rebooting-router.adoc

@@ -3,7 +3,7 @@
 // * nodes/nodes-nodes-rebooting.adoc

 [id="nodes-nodes-rebooting-router_{context}"]
-= Understanding how to reboot nodes running routers
+= About rebooting a node that is running a router


Is this module really an ops thing that cluster admins should care about? It seems more like "if you are creating your own router load balancer, here's a bit of context". Which would be an install-time thing. As it stands, it's not clear to me who reads this module, or what action we have enabled them to take post-reading.

wking · 2020-06-24T22:24:04Z

nodes/nodes/nodes-nodes-rebooting.adoc

-Another challenge is how to handle nodes that are running critical
-infrastructure such as the router or the registry. The same node evacuation
-process applies, though it is important to understand certain edge cases.
+Another challenge is how to handle xref:../../nodes/nodes/nodes-nodes-rebooting.html#nodes-nodes-rebooting-infrastructure_nodes-nodes-rebooting[rebooting master nodes], which running critical infrastructure such as a router or the registry. The same node evacuation process applies, though it is important to xref:../../nodes/nodes/nodes-nodes-rebooting.html#nodes-nodes-rebooting-router_nodes-nodes-rebooting[understand certain edge cases].


We explicitly exclude the router pods from control-plane nodes by default because of an upstream bug (only recently fixed). Should we consolidate here to just talk about the registry? I'd expect the registry operator to have the appropriate pod disruption budgets in place already, and not be something we needed to walk folks through in docs. I'd also be very surprised if we run the registry on control-plane nodes by default, but I don't have source links to back that up right now.

mburke5678 · 2020-06-26T20:16:31Z

@brenton If you don't mind a trip in the way-back machine, you added this section to the docs in July 2016 [1]. Are you able to help @wking with some of his concerns? I have made some attempts. Perhaps the original author might be a better source?

[1] e4ff0de

brenton · 2020-06-26T20:58:55Z

I agree with Trevor. The documentation I originally wrote for version 3 shouldn't really be needed. Operators should ensure that rebooting a host is an easy and harmless thing to do.

mburke5678 · 2020-06-27T12:57:44Z

@brenton Thank you for your comment. Just to be clear, the section you wrote, which you called Rebooting Nodes (and is renamed Understanding node rebooting) [1] can be removed from the 4.x docs because the operators do ensure that rebooting a host is easy, making these topics unnecessary?
[1] https://docs.openshift.com/container-platform/4.4/nodes/nodes/nodes-nodes-rebooting.html
https://docs.openshift.com/container-platform/4.4/nodes/nodes/nodes-nodes-rebooting.html

wking · 2020-06-27T15:03:40Z

I think it's still worth a section on (anti-)affinity, PDBs, etc. (or link out to the upstream docs), to remind folks of these tools to make their workloads robust. And probably also calling out the lack of an in-cluster reboot knob, and the safety of either hitting a provider-side reboot button or deleting the machine (linking over to the MachineHealthCheck docs). And the fact that you really don't want to break etcd quorum, so you should be very cautious about doing anything to the control-plane machines until we grow product-side MHCs for them.

openshift-bot · 2020-10-24T01:08:37Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2020-11-23T03:00:43Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot · 2020-12-23T04:52:06Z

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci-robot · 2020-12-23T04:52:22Z

@openshift-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Rework the Understanding Node Rebooting topic

b406c96

mburke5678 mentioned this pull request Jun 24, 2020

[enterprise-4.4] Issue in file nodes/nodes/nodes-nodes-rebooting.adoc #23244

Closed

mburke5678 added branch/enterprise-4.2 branch/enterprise-4.3 branch/enterprise-4.4 branch/enterprise-4.5 do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Jun 24, 2020

mburke5678 added this to the Next Release milestone Jun 24, 2020

wking reviewed Jun 24, 2020

View reviewed changes

edits per wking

e25d6cd

openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 24, 2020

openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 23, 2020

openshift-ci-robot closed this Dec 23, 2020

mburke5678 deleted the issue-23244 branch January 14, 2021 15:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework the Understanding Node Rebooting topic #23256

Rework the Understanding Node Rebooting topic #23256

mburke5678 commented Jun 24, 2020

openshift-docs-preview-bot commented Jun 24, 2020

wking Jun 24, 2020

wking Jun 24, 2020

wking Jun 24, 2020 •

edited

Loading

mburke5678 commented Jun 26, 2020 •

edited

Loading

brenton commented Jun 26, 2020

mburke5678 commented Jun 27, 2020

wking commented Jun 27, 2020

openshift-bot commented Oct 24, 2020

openshift-bot commented Nov 23, 2020

openshift-bot commented Dec 23, 2020

openshift-ci-robot commented Dec 23, 2020

		@@ -13,12 +13,8 @@ With this in place, if only two infrastructure nodes are available and one is re
		pod is prevented from running on the other node. `oc get pods` reports the pod as unready until a suitable node is available.
		Once a node is available and all pods are back in ready state, the next node can be restarted.

Rework the Understanding Node Rebooting topic #23256

Rework the Understanding Node Rebooting topic #23256

Conversation

mburke5678 commented Jun 24, 2020

openshift-docs-preview-bot commented Jun 24, 2020

wking Jun 24, 2020

Choose a reason for hiding this comment

wking Jun 24, 2020

Choose a reason for hiding this comment

wking Jun 24, 2020 • edited Loading

Choose a reason for hiding this comment

mburke5678 commented Jun 26, 2020 • edited Loading

brenton commented Jun 26, 2020

mburke5678 commented Jun 27, 2020

wking commented Jun 27, 2020

openshift-bot commented Oct 24, 2020

openshift-bot commented Nov 23, 2020

openshift-bot commented Dec 23, 2020

openshift-ci-robot commented Dec 23, 2020

wking Jun 24, 2020 •

edited

Loading

mburke5678 commented Jun 26, 2020 •

edited

Loading