Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add node status to API object. #2315

Merged
merged 1 commit into from
Dec 11, 2014
Merged

Conversation

ddysher
Copy link
Contributor

@ddysher ddysher commented Nov 12, 2014

@smarterclayton @lavalamp I'm not sure if this is enough to get it work. Beat me up.

@smarterclayton Should I also take care of v1beta3?

@smarterclayton
Copy link
Contributor

Add it to v1beta3 (in the new form) under Node. I would recommend not taking on any more of the v1beta3 changes in this pull - we should do the internal refactor of Node next but that can build on top of this.

@smarterclayton
Copy link
Contributor

I think this looks ok otherwise - I haven't followed any other discussions about node conditions. Is the scheduler expected to only schedule on nodes with a certain condition? In another issue we discussed the admin being explicitly able to mark a node as not accepting new scheduled nodes - is that something that would be reflected in Condition, or separate? Seems like you might still schedule onto vanished or unhealthy nodes in some cases.

@ddysher
Copy link
Contributor Author

ddysher commented Nov 12, 2014

The Condition here is meant to replace node health check. I don't know which issue you are referring to, but we can add another condition value, or a new field in NodeStatus, to let admin to disable a node.

What are the cases that we will schedule onto vanished/unhealthy node? I think at first, we can have apiserver filter out unheathy/vanished node. Then we can approach to selector based list, so scheduler can list only healthy node, kubectl can list all nodes, etc?

@smarterclayton
Copy link
Contributor

On Nov 11, 2014, at 11:55 PM, Deyuan Deng notifications@github.com wrote:

The Condition here is meant to replace node health check. I don't know which issue you are referring to, but we can add another condition value, or a new field in NodeStatus, to let admin to disable a node.

It would be part of spec in v1beta3 (set by user), but I don't think have to handle it here.

What are the cases that we will schedule onto vanished/unhealthy node? I think at first, we can have apiserver filter out unheathy/vanished node. Then we can approach to selector based list, so scheduler can list only healthy node, kubectl can list all nodes, etc?

It can. Are we planning on defining as part of the api spec rigorously what Vanished means to a client like we do for pods? Or is it vaguer? The precision for PodCondition is very important, whereas Vanished seems like the definition could be very flexible.


Reply to this email directly or view it on GitHub.

@bgrant0607
Copy link
Member

I will take a look at this this morning.

@bgrant0607 bgrant0607 self-assigned this Nov 12, 2014
@bgrant0607 bgrant0607 added the area/api Indicates an issue on api area. label Nov 12, 2014
// NodeHealthy means the Node is running but unhealthy
NodeUnhealthy NodeCondition = "Unhealthy"
// NodeVanished means the Node is not reachable
NodeVanished NodeCondition = "Vanished"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you plan to define "Vanished", and what behavior to you plan to attach to it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here "Vanished" is defined as not reachable from the cluster, i.e. connection refused from the node's kubelet port. Ocasional network failure or machine restarts can make a node unreachable, so we'll need a policy surround the definition, like N out of M tries, at most no connection in X min, etc. I plan to enforce this on node-controller.

@dchen1107
Copy link
Member

Besides question on Vanished raised by others, should we define a condition for node at staging state? healthy, but not ready to provide any services yet? Or this belongs to kubelet readiness state?

@@ -576,6 +594,8 @@ type Minion struct {
HostIP string `json:"hostIP,omitempty" yaml:"hostIP,omitempty"`
// Resources available on the node
NodeResources NodeResources `json:"resources,omitempty" yaml:"resources,omitempty"`
// Status describes the current status of a Node
Status NodeStatus `json:"status,omitempty" yaml:"status,omitempty""`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new status field is fine, but FYI:

We're trying to move our internal representation towards v1beta3.

There should eventually be just TypeMeta, ObjectMeta, NodeSpec, and NodeStatus in Minion/Node.

NodeResources was added to the wrong place in v1beta3 and HostIP was apparently dropped (not sure whether that was intentional or not).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Is there any particular reason we still leave the NodeResources here? Or it's just not cleaned up yet?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would assume whoever does the internal refactor should clean it up.

----- Original Message -----

@@ -576,6 +594,8 @@ type Minion struct {
HostIP string json:"hostIP,omitempty" yaml:"hostIP,omitempty"
// Resources available on the node
NodeResources NodeResources json:"resources,omitempty" yaml:"resources,omitempty"

  • // Status describes the current status of a Node
  • Status NodeStatus json:"status,omitempty" yaml:"status,omitempty""

Thanks!

Is there any particular reason we still leave the NodeResources here? Or
it's just not cleaned up yet?


Reply to this email directly or view it on GitHub:
https://github.com/GoogleCloudPlatform/kubernetes/pull/2315/files#r20271948

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SG. I can investigate this, but I'm not fully understand how we organize our api versions. @smarterclayton

For the NodeCondition change, I need to update all 3 versions and internal version, but for dropping the HostIP, seems we only did that in v1beta3. What's the rationale behind this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bgrant0607 Should NodeResources be split into Spec and Status (Capacity in Spec, whatever represents the current state in Status)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Short answer is yes.

See resourceCapacitySpec in https://github.com/GoogleCloudPlatform/kubernetes/blob/master/docs/resources.md. I updated it this week.

Status is more complicated. See Usage Data in the Appendix of resources.md.

@bgrant0607
Copy link
Member

Sorry, I haven't had time to follow node controller discussions, either.

@smarterclayton

Vanished probably comes from #1366.

We do eventually need a rigorous definition of vanished. There will be a standard definition, though it may be configurable per k8s cluster. As discussed in #1366, once a node is considered vanished, I'd like the node controller to kill the pods on that node. If we need more flexibility, we'd use forgiveness to override the default kill policy.

Schedulers should only schedule on Healthy nodes.

I expect to add more flavors of not schedulable in the future, such as Lame (node controller draining running pods gracefully -- deliberately unschedulable via API call to apiserver and/or to Kubelet), Disabled (deliberately kill pods and make unschedulable via API call to apiserver and/or to Kubelet), etc.

@bgrant0607
Copy link
Member

@dchen1107 Good point. Not sure we want healthy, unhealthy, vanished to be the values of NodeCondition. Will think about this a bit.

@smarterclayton
Copy link
Contributor

On Nov 12, 2014, at 6:02 PM, bgrant0607 notifications@github.com wrote:

Sorry, I haven't had time to follow node controller discussions, either.

@smarterclayton

Vanished probably comes from #1366.

We do eventually need a rigorous definition of vanished. There will be a standard definition, though it may be configurable per k8s cluster. As discussed in #1366, once a node is considered vanished, I'd like the node controller to kill the pods on that node. If we need more flexibility, we'd use forgiveness to override the default kill policy.

Schedulers should only schedule on Healthy nodes.

I expect to add more flavors of not schedulable in the future, such as Lame (node controller draining running pods gracefully -- deliberately unschedulable via API call to apiserver and/or to Kubelet), Disabled (deliberately kill pods and make unschedulable via API call to apiserver and/or to Kubelet), etc.

Seems like something can be both disabled and unhealthy at the same time. I.e. When an ops team wants to take a node out of the scheduler pool slowly and drain pods by deleting them slowly, but then a temporary network problem occurs. Tooling shouldn't be confused about the state of the node if that happens.

Reply to this email directly or view it on GitHub.

@bgrant0607
Copy link
Member

@smarterclayton Agree. For explicit disabling/laming, a NodeSpec field would be required.

@ddysher
Copy link
Contributor Author

ddysher commented Nov 13, 2014

@dchen1107 I believe we will eventually need such a staging state, especially when kubernetes can scale nodes dynamically. At which point, we definitely need a rigorous definition, as the cutoff between staging and unhealthy can be subtle.

@bgrant0607 Thanks for reviewing, some comments:
Agreed with disabling/laming being another NodeSpec field. After moving health check out of minion registry, I'd like to implement watch interface on minion registry, which should make it eaiser to watch for any node spce changes.

Regarding to @dchen1107 's suggestion, why do you think healthy, unhealthy and vanished may not be wanted?

The current NodeCondition is basically a rework of minion health check. Is Node life cycle well understood in k8s that we are able to give a rigorous definition? If so, then I can get on board with it; if not, then we can first finish the rework?

@bgrant0607
Copy link
Member

I think we want to distinguish lifecycle from recently observed behavior.

By analogy with pods, NodeCondition should reflect lifecycle. Proposed conditions could be Pending, Running, Shutdown.

Then similar to ContainerState, we could provide more detailed information, including liveness, readiness, and reachability, each with time of the last transition (e.g., live -> not live).

Any condition (in the general meaning of the word) that could have a fairly open-ended number of causes should have an associated Reason and possibly Message.

@ddysher
Copy link
Contributor Author

ddysher commented Nov 16, 2014

It's beyond the initial scope for NodeCondition, but I'll extend the idea here since it's desirable to have a lifecycle management of node.

Pending and Running looks good to me, but I'm not sure if Shutdown is appropriate here. How could controller know if a node is shutdown or not? I'd prefer using Stopped or Vanished, which includes more possibilities. Also, I think we need to include unhealthy state for node status to be complete, but we could also just include unhealthy state into stopped state. I've included a brief summary.

Pending

Pending means the node has been registered in kubernetes, but not ready to accept new pod. Either node-controller registers the node, or kube admin does (by plugin or cli). Pending state is a transient state: node in Pending state will finally become Running, Unhealthy or Stopped.

Transition

  • If a node is just added to k8s, its condition will become Pending.
  • If a node is re-enabled by admin, it will become Pending.
  • If during the X minute, Node starts responding, it is then marked as Running.
  • If during the X minute, Node starts responding with error, it is then marked as Unhealthy.
  • After X minute, Node is marked as Stopped.

Meta Data

  • Reason: This should probably be Node not ready, or Node in provision.

Running

Running means the node is running and ready to accept pod.

Transition
See others

Meta Data

  • StartedAt: The time that node start running.

Disabled

Disabled means the node is explicitly marked by admin to not accept any Pod.

Transition

  • An admin disables the node. When enabled, the node enters Pending state.

Meta Data

  • DisabledAt: The time that node is disabled
  • Reason: Reason for disabling the node
  • Message: A detailed message.

Unhealthy

Unhealthy means Node is responding to health check, but there is error in the Node.

Transition

  • If a node fails health check due to error status (OOM, etc), it is then marked as Unhealthy.
    • TODO: Define errors

Meta Data

  • LastHealthyAt: The latest time that node reports healthy
  • UnhealthyAt: The time that node becomes unehealty.
  • Reason: The reason for the unhealthy state.
  • Message: A detailed message.

Stopped (Shutdown, Vanished)

Stopped means Node is not responding to health check.

Transition

  • If a node fails health check due to no reaspond, it is then marked as Stopped.
    • TODO: How about machine restart?

Meta Data

  • StoppedAt: The time that Node is stopped.

@bgrant0607
Copy link
Member

Lifecycle stages should be similar to pods. When created/added to the cluster, they start in a pending state, progress to an active state, and then end in a terminated state when the node is deleted or potentially when it disappears from the host provider (e.g., the VM is deleted from GCE). It shouldn't be possible to go back to a prior stage without re-creating the node object.

Everything else should be current status.

@bgrant0607
Copy link
Member

I think we should have an array of status info, each with a kind of health, starting with Reachable, Live, Ready.

Additionally, in the spec, there should be a Schedulable field and a Runnable field. We'll want to make this fancier in various ways in the future.

Will address the rest in a bit.

@bgrant0607
Copy link
Member

We should also represent Schedulable and Runnable in status, similar to the other properties, recording time of transition, reason, and message.

@bgrant0607
Copy link
Member

Sigh. This discussion makes it clear what a mistake it was to use "Condition" for what is, essentially, a lifecycle phase (definition: a distinct period or stage in a process of change or forming part of something's development; or stage: a point, period, or step in a process or development).

At least we got Status right.

@smarterclayton @markturansky I feel bad for even considering this, but how would you feel about replacing Condition with Phase (which I prefer to Stage because the latter is used in other ways in deployment contexts)? It looks like Condition only appears in a couple dozen lines of code. That would allow us to repurpose Condition for things like Reachable, Live, Ready, Schedulable, Runnable.

@markturansky
Copy link
Contributor

@bgrant0607 yes, that PodCondition refactor was easy. I am happy to change it to PodPhase.

I'll have a new pull for that change and I'll be sure to reference this issue.

@ddysher
Copy link
Contributor Author

ddysher commented Nov 23, 2014

@bgrant0607 I'm not sure how Schedulable/Runnable fit into Reachable/Live/Ready in NodeStatus. We could also include each kind of unhealth in the status info. So my interpretation of your suggestion is something like follows. (I didn't follow Pod lifecycle discussion, correct me if it's wrong).

type NodeStatus struct {
    // NodePhase is the current phase of the Node, one of "Pending", "Running" and "Terminated". It
    // is populated based on the value of NodeCondition.
    Phase     NodePhase
    Condition NodeCondition
}

// NodeCondition is a detailed condition of Node. Only one of the field can be set at a given time.
type NodeCondition struct {
    UnReachable *NodeConditionUnReachable
    Reachable   *NodeConditionReachable
    Schedulable *NodeConditionSchedulable
    Live        *NodeConditionLive
    Ready       *NodeConditionReady
    NotReady    *NodeConditionNotReady
    ...
}

type NodeConditionUnReachable struct {
    Reason            string
    LastReachableTime Time
}
...

For example, when Node is unreachable and its LastReachableTime is none, then the node is Pending; otherwise, it's Terminated.

The tricky part is how do we define different node conditions?

@bgrant0607
Copy link
Member

The running phase may be populated based on condition, but the terminated phase should not be.

My proposal:

type NodeStatus struct {
    // NodePhase is the current lifecycle phase of the Node, one of "Pending", "Running" and "Terminated". 
    Phase     NodePhase
    Conditions []NodeCondition
}
type NodeConditionKind string
const (
  NodeReachable   NodeConditionKind = "Reachable"
  NodeLive        NodeConditionKind = "Live"
  NodeReady       NodeConditionKind = "Ready"
  NodeSchedulable NodeConditionKind = "Schedulable"
  NodeRunnable    NodeConditionKind = "Runnable"
)
type NodeCondition struct {
  Kind NodeConditionKind
  Status string // possible values: true, false, unknown
  LastTransitionTime util.Time
  Reason string
  Message string
}

@ddysher
Copy link
Contributor Author

ddysher commented Nov 25, 2014

Ok, I think the idea is same here: with NodeCondition.Status, unhealthy condition is also included in Node status.

Currently, kubelet healthz is pretty weak, we cannot gather much information from it. The only information we can get is Reachable/Unreachable (connection succeed or not), and Ready/NotReady (status ok or not). so we'll start with a subset of the conditions.

I'm not sure what "Live" means here, or in general, the difference between Live/Reachable/Ready and the reason we include all them in node condition.

We'll want to make sure at any given time of the node, all of the conditions exist for the node. So I'd say change the slice to map will be easier to implement.

At last, I think we still need to populate NodePhase based on NodeCondition, even for terminate phase. NodePhase is really just a summary of NodeCondition, surrounded by our policy (defines which combination of conditions belongs to which phase).

@kubernetes-bot
Copy link

Can one of the admins verify this patch?

@bgrant0607
Copy link
Member

Discussed IRL. Summary:

NodePhase should not be derived from NodeCondition -- I'm deliberately trying to keep them orthogonal. NodePhase is about provisioning status. Pending is node has been created/added but not configured. (We need some mechanism to configure nodes on demand, to accommodate thing like auto-scaled clusters.) Running means that it has been configured and has Kubelet running. Terminated means the node has been removed from the cluster (e.g., its VM has been deleted).

I'm fine if we just start with one NodeCondition -- say, "Ready". We can add the others when we have more information.

Re. why array of conditions rather than map, see #2004.

I'm not completely opposed to aggregating default schedulability and runnability from NodeConditions, but as separate fields and not as part of NodePhase. However, such summary fields would become confusing later when we allow exceptions to the default policies, such as forgiveness (#1574).

@ddysher ddysher force-pushed the node-status branch 2 times, most recently from cf0f8b8 to 5d0fc30 Compare December 3, 2014 05:18
@ddysher
Copy link
Contributor Author

ddysher commented Dec 3, 2014

Updated PR based on discussion. Two changes made after @bgrant0607 's last comment:

  1. For NodeCondition, I start with "Reachable" and "Ready". The definition is included in code comment. For 'Reachable', we could do a bit lower level connection, but if http is not working, it's not reachable from master's point of view anyway, so I'm sticking to http for now.
  2. Created a new type for NodeCondition.Status

Now that we want to keep node phase from condition, we definitely won't include schedulability and runnability to NodePhase. The two properties can be summarized from NodeConditions (I think), but we need to persist the properties to support forgiveness, e.g. we need to know when a node becomes unschedulable. I actually like the idea of having a separate field.. but that probably should be discussed in another issue.

@ddysher
Copy link
Contributor Author

ddysher commented Dec 9, 2014

@bgrant0607 What's the status for this? Are we going to merge it or have some more discussion?

@bgrant0607
Copy link
Member

@ddysher Was busy all last week. Will try to look it over soon.

type NodeConditionStatus string

// These are valid condition status. "ConditionTrue" means node is in the condition;
// "ConditionFalse" means node is not in the condition; "ConditionUnknown" means kuernetes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: Kubernetes

@bgrant0607
Copy link
Member

Looks pretty good. Thanks! Just a few minor comments.

@ddysher
Copy link
Contributor Author

ddysher commented Dec 11, 2014

Comment addressed. Thanks!

)

type NodeCondition struct {
Kind NodeConditionKind
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add json and description tags to all the fields.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/api Indicates an issue on api area. area/nodecontroller
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants