Add spot-instances machine-api proposal #199

JoelSpeed · 2020-02-04T16:57:15Z

This adds a proposal for supporting Spot Instances in AWS, GCP and Azure as part of the Machine API

elmiko

this all sounds reasonable to me, just had a few questions inline

elmiko · 2020-02-04T17:32:19Z

enhancements/machine-api/spot-instances.md

+
+## Summary
+
+Enable OCP users to leverage cheaper, non-guaranteed instances to back Machine API Machines.


would this be for all types of machines? (eg control and compute planes)

There's nothing that I'm aware of that would restrict users from using this for control plane machines, the Machine API doesn't really know which machines are which roles.

I think we just strongly recommend in the docs that people use common sense and don't use it for control plane machines

Am not sure docs are enough for this. Etcd with two masters down bricks the cluster.

agree w/ @sttts there is no valid scenario for running the control plane in its current form on spot instances.

While I agree that there is no scenario where you would want to run the control-plane on spot instances, I think doing anything more than documenting users shouldn't do this might add a lot of extra complexity to this proposal and MAPI in general

As I understand it at the moment, the Machine API does not know what the machine it is creating is for. It get's userdata from MCO to apply to the instance when it creates it, but there's nothing in the spec of MAPI to say, "this one is a control-plane, do something special". The role of the machine is a black box to MAPI so having some check to make sure control-plane machines don't run on spot instances would require adding some extra API field which says control-plane: true, but then what validates that that field is actually the truth, to do that, you'd have to understand and interpret the userdata right?

Secondly, at the moment, there is no automation around control-plane machines, they are created by the installer normally, and if a user wants to add a new machine they can, but it requires several manual steps to reconfigure etcd and DNS. If we assume most people use the installer for creating control-plane machines, then, even if they did specify extra config for the spot instance creation, this wouldn't be observed by the terraform parts of the installer and as such, the control-plane machines would end up on on-demand instances anyway?

I'm relatively new to Openshift and still learning though, my understanding my be incorrect and I'm definitely open to other ideas on how to solve this problem

Since we don't create control plane machines as spot types, I think an end user ignoring documentation and creating control plane machines that are spot types (which would require multiple, very specific actions on their part) is along the same lines of foot-shooting as deleting random instances in your AWS account. We can't prevent users from doing super bad things.

We could detect after the fact that machines are control plane nodes, but as previously mentioned, the actual roles are determined via ignition payload from MCO, so it's not ideal to try to determine that ahead of time.

I agree. This would be the same as protecting against setting the right value for the credentials secret, the right ignition secret, etc which we don't. Master machines are known to be pets today which let the installer set a desired input. When/if a controlPlane controller exists we could enforce opinionated validations and gating there.

enhancements/machine-api/spot-instances.md

enxebre · 2020-02-05T11:13:17Z

enhancements/machine-api/spot-instances.md

+
+###### Termination Notices
+Termination notices on AWS are provided via the EC2 metadata service up to 2 minutes before an instance is due to be preempted.
+A DaemonSet should be deployed on the cluster which runs Pods on nodes labelled `machine.openshift.io/terminate-on-preemption: true`.


Can we specify:

What would run this daemonSet, I'd assume mao.

Where does this controller implementation lives, I'd assume a new controller in each provider actuator repository. This binary could then be run by mao from the provider image similar to https://github.com/openshift/machine-api-operator/blob/master/pkg/operator/operator.go#L284 and https://github.com/openshift/machine-api-operator/blob/master/pkg/operator/sync.go#L225-L237

I'm a little uncertain on how this would work at the moment, my current thinking is:

New controller code for each termination controller goes into the respective actuator repositories

This makes a new binary which goes into each mapi provider image for the cloud provider

MAO syncs the daemonset in https://github.com/openshift/machine-api-operator/blob/e55d2b984245439082c3cd8f212b8c9c3380bbbe/pkg/operator/sync.go#L64-L74 if it is supported by the provider

daemonset set's entrypoint to the second binary in the image that has been added

Does that make sense?

enxebre · 2020-02-05T11:26:58Z

cc @listey

enhancements/machine-api/spot-instances.md

cgwalters · 2020-02-05T13:56:59Z

enhancements/machine-api/spot-instances.md

+
+### Goals
+
+- Provide consistent behaviour of Machines running on non-guaranteed instances across cloud providers


This seems great as a goal...but I think it's going to be hard for anyone using this at scale to avoid needing to look at and understand provider-specific semantics and pricing.

Which, reading down seems to be the case (the provider pricing bits are all provider-specific). We're mostly just exposing the baseline infrastructure for integrating this nicely into OpenShift.

Maybe just call out as a goal something like

Support sophisticated provider-specific automation driven by higher level (3rd party), e.g. code that dynamically adjusts price targets based on workload

Yeah that's fair, I wrote the goals before doing most of the research 😅 Will reword

enhancements/machine-api/spot-instances.md

michaelgugino · 2020-02-05T16:29:06Z

enhancements/machine-api/spot-instances.md

+A DaemonSet should be deployed on the cluster which runs Pods on nodes labelled `machine.openshift.io/terminate-on-preemption: true`.
+Users of Spot instances could optionally add this label to their Machine’s spec to opt in to this feature.
+
+The DaemonSet itself would poll the EC2 metadata service for a termination notice.


If we give the daemonset privilege to delete the machine, we have very high privileged pod running on a worker.

Maybe set a health-check that fails on the daemonset when we get the 200 OK from metadata, and have a controller watch pods in a particular namespace.

I think i like the require MHC if you want to use spot instances and termination triggers the MHC to fail early approach more than deleting the machine object..

In that line the daemonset could set the node conditions the MHC is configured to watch.

I think there are several ways that this could be achieved, depending on how we want to break down the privileges that the pods have and how much complexity we want.

Daemonset that has privileges to delete machines (as currently proposed)

Potentially dangerous as highly privileged, bad actor could delete all machines

Daemonset that has privileges to set a condition on any node, for MHC to notice

If the MHC sees this condition, should it terminate immediately? If not, can we afford a delay?

Is this any different from above? Bad actor could apply condition to all nodes and by proxy, delete all Machines

Daemonset that uses Node's credentials to add condition to Node

MHC would have to immediately act on this condition

Using Node account limits bad actors capabilities to just the node that the pod is on, safer?

Is reusing Node account ok? Grants more privileges than necessary, not intended use, could break in future?

Some (to be defined) controller creates a Pod for each node with it's own role and service account

Can limit privileges to only be able to delete the Machine for the Node the Pod is running on

Alternatively could only be able to update status of Node Pod is running on

More controllers, more complexity than other solutions

I'm proposing a daemonset to schedule a pod on each spot node. The daemonset simply polls the metadata service. If it gets the shutdown notification, fail the readiness check for that pod. Have a controller watching pods in whatever namespace the daemonset is in. When a pod fails readiness (after an initial startup delay), that controller deletes the associated machine.

I think this might work, that you mentioned:

Daemonset that has privileges to set a condition on any node, for MHC to notice

The pod is going to have access to instance metadata, host networking, so it might as well have an account that lets it set conditions on the node. Is it possible to give it permissions to only set conditions on the local node? Having it able to mark this condition on other nodes would be very bad.

MHC would need to respond immediately without delay. This should be simple enough to configure.

I'm proposing a daemonset to schedule a pod on each spot node. The daemonset simply polls the metadata service. If it gets the shutdown notification, fail the readiness check for that pod. Have a controller watching pods in whatever namespace the daemonset is in. When a pod fails readiness (after an initial startup delay), that controller deletes the associated machine.

My only concern with this is adding additional delays that cannot really be afforded when you have a short notice period. If the health check period for the pod is say, 5s, and the pod itself polls the metadata every 5s, you could easily lose 10s of a 30s period on GCP or Azure.

The pod is going to have access to instance metadata, host networking, so it might as well have an account that lets it set conditions on the node. Is it possible to give it permissions to only set conditions on the local node? Having it able to mark this condition on other nodes would be very bad.

The third and fourth options in my comment above have permissions to restrict to only setting conditions on a single node, I think the fourth option is likely to be the most complex to implement, but has the best security aspect if we are giving any permissions to the pods doing the health checking that is.

MHC would need to respond immediately without delay. This should be simple enough to configure.

Ack, agreed!

GCP sends a signal, so presumably we'd catch that almost immediately.

I think we have a variety of options. Most of them are fine with me, as long as we're not running a really privileged pod on each node (eg, a pod that can delete machines). We'll probably figure out which method is best by implementing and iterating.

I've added a detailed explanation of the way I think we should proceed with this. It's a little more complex than I had originally hoped but should be relatively safe in terms of not giving pods too much privilege.

I am wondering if it should be a separate proposal now though due to its complexity 🤔

Let's try to keep this simple initially. I think current system design already give us all we need:

Actuators must signal new machines/nodes as a spot instances, i.e label it with machine.openshift.io/spot (name to be decided).

MAO deploys a new daemonSet with a nodeSelector that matches the well known spot label.

The provider image/binary to run by the controller own by the new daemonSet is chosen by mao per cloud (as we already do for actuators. The source code can indeed live along with each actuator and the binary be packed in the same image).

This controller does its provider specific magic to realise imminent termination and signal the machine for deletion (or if there're concerns with that, it can set a node condition and delegates on the optional existence of a MHC to signal the deletion based on the conditions. Though the mapi-controllers service account is privileged to manipulate machines by design).

enhancements/machine-api/spot-instances.md

abhinavdahiya · 2020-02-05T23:50:07Z

enhancements/machine-api/spot-instances.md

+- The actuator should not attempt to verify that an instance can be created before attempting to create the instance
+  - If the cloud provider does not have capacity, the Machine Health Checker can (given required MHC) remove the Machine after a period.
+    MachineSet will ensure the correct number of Machines are created.
+
+- There’s a [2 hours time frame](https://github.com/openshift/cluster-machine-approver/blob/master/csr_check.go#L38) in OpenShift for a machine to become a node.
+  If the Spot request is not satisfied during this time frame, it is assumed that the instance will never become a node.
+  This machine would be remediated by a given MHC.


given required MHC
it makes it sound like we can't use spot instances without MHC because otherwise we'll never get replacement machines..

So are we going to validate if spot feature is being used, the machineset must have MHC attached?

I think we might want to start by documenting this. As we gather more real use feedback on MHC I'd like us to explore a more generic higher level abstraction which would possibly own and run both machineSets and MHC underneath, e.g machineDeployment/nodePool or similar.

lmilbaum · 2020-02-06T13:43:35Z

If it is relevant, I utilized spotinst few years ago, when provided services. It saved ~80% of the EC2 costs.

enhancements/machine-api/spot-instances.md

derekwaynecarr · 2020-02-06T15:03:53Z

@staebler @dgoodwin please review, it will likely want to be exposed in Hive as well in future.

enhancements/machine-api/spot-instances.md

enxebre · 2020-02-21T12:50:40Z

enhancements/machine-api/spot-instances.md

+If a MachineHealthCheck were deployed, these Machines would be considered unhealthy and would be deleted, creating a new Machine in its place.
+
+With or without an MHC, the scale request would never manifest in new compute capacity joining the cluster.
+After a half hour period, the autoscaler considers the MachineSet to be out of sync and rescales it to remove the `Failed` instances.


After a half hour period

This is tied to your particular autoscaler settings.

enxebre · 2020-02-21T12:51:39Z

enhancements/machine-api/spot-instances.md

+the autoscaler would deem these Machines as having unregistered nodes and, after a 15 minute period,
+would request these unregistered nodes be deleted, mark the MachineSet as unhealthy and attempt to scale an alternative MachineSet.
+This, while still not perfect, is preferable to the current state of the autoscaler;
+assuming there is a on-demand based MachineSet to fall back to, this would be tried as a backup.


may be add a note this is actually current behaviour for aws provider as well kubernetes/autoscaler#2235

enxebre · 2020-02-21T12:52:41Z

enhancements/machine-api/spot-instances.md

+
+Based on the [working implementation](https://github.com/kubernetes/autoscaler/pull/2235/files) of Spot instances in the AWS autoscaler provider,
+if the autoscaler Machine API implementation were to provide fake provider IDs for Machines that have failed ([example](https://github.com/JoelSpeed/autoscaler/commit/11ebd1ffdadebbb20d2fac9aae30646b4f47dfa9)),
+the autoscaler would deem these Machines as having unregistered nodes and, after a 15 minute period,


after a 15 minute period,

This is tied to your particular autoscaler settings.

enxebre · 2020-02-21T12:54:23Z

enhancements/machine-api/spot-instances.md

+the autoscaler would deem these Machines as having unregistered nodes and, after a 15 minute period,
+would request these unregistered nodes be deleted, mark the MachineSet as unhealthy and attempt to scale an alternative MachineSet.
+This, while still not perfect, is preferable to the current state of the autoscaler;
+assuming there is a on-demand based MachineSet to fall back to, this would be tried as a backup.


may be mention somehow that i.e this is satisfying the expectations for the core autoscaler backoff/healthchecking mechanism

may be mention that for letting the autoscaler choose a spot instances nodeGroup over one with on demand instances pod affinity can be leveraged so only workloads suitable for spots result in scale in/out

enxebre · 2020-03-10T10:52:28Z

Looks like all concerns have been addressed. This seems good as to implement and iterate based on tangible feedback. If any update or granular is required we'll follow up with a separate PR.
/approve

enxebre · 2020-03-11T09:16:26Z

enhancements/machine-api/spot-instances.md

+  - If the cloud provider does not have capacity, the Machine Health Checker can (given required MHC) remove the Machine after a period.
+    MachineSet will ensure the correct number of Machines are created.
+
+- There’s a [2 hours time frame](https://github.com/openshift/cluster-machine-approver/blob/master/csr_check.go#L38) in OpenShift for a machine to become a node.


Alternatively, If we considered this a permanent failure the machine could go failed right away.

enxebre · 2020-03-11T09:17:39Z

enhancements/machine-api/spot-instances.md

+
+To launch an instance as a Spot instance on AWS, a [SpotMarketOptions](https://docs.aws.amazon.com/sdk-for-go/api/service/ec2/#SpotMarketOptions)
+needs to be added to the `RunInstancesInput`. Within this there are 3 options that matter:
+


which particular fields would we be exposing at the user facing API?

This is explained below on lines L#138-L#151

enxebre · 2020-03-11T10:39:26Z

/lgtm

openshift-ci-robot · 2020-03-11T10:40:15Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: enxebre, JoelSpeed

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [enxebre]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Implemented as described in openshift/enhancements#199

openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Feb 4, 2020

openshift-ci-robot requested review from jwforres and soltysh February 4, 2020 17:25

elmiko reviewed Feb 4, 2020

View reviewed changes

enxebre reviewed Feb 5, 2020

View reviewed changes

enhancements/machine-api/spot-instances.md Outdated Show resolved Hide resolved

enxebre reviewed Feb 5, 2020

View reviewed changes

cgwalters reviewed Feb 5, 2020

View reviewed changes

JoelSpeed force-pushed the spot-instances branch from 81dd085 to 6be76c4 Compare February 5, 2020 15:42

michaelgugino reviewed Feb 5, 2020

View reviewed changes

enhancements/machine-api/spot-instances.md Show resolved Hide resolved

michaelgugino reviewed Feb 5, 2020

View reviewed changes

enhancements/machine-api/spot-instances.md Outdated Show resolved Hide resolved

michaelgugino reviewed Feb 5, 2020

View reviewed changes

enhancements/machine-api/spot-instances.md Show resolved Hide resolved

michaelgugino reviewed Feb 5, 2020

View reviewed changes

enhancements/machine-api/spot-instances.md Outdated Show resolved Hide resolved

abhinavdahiya reviewed Feb 5, 2020

View reviewed changes

derekwaynecarr reviewed Feb 6, 2020

View reviewed changes

enhancements/machine-api/spot-instances.md Show resolved Hide resolved

staebler reviewed Feb 7, 2020

View reviewed changes

enhancements/machine-api/spot-instances.md Show resolved Hide resolved

enhancements/machine-api/spot-instances.md Outdated Show resolved Hide resolved

abhinavdahiya mentioned this pull request Feb 10, 2020

RFE: Support for spot and reserved instances on AWS openshift/installer#1287

Closed

enxebre reviewed Feb 21, 2020

View reviewed changes

CecileRobertMichon mentioned this pull request Feb 24, 2020

Add support for Azure Spot VMs kubernetes-sigs/cluster-api-provider-azure#407

Closed

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 10, 2020

enxebre reviewed Mar 11, 2020

View reviewed changes

Add spot-instances machine-api proposal

7414cf4

JoelSpeed force-pushed the spot-instances branch from c1e55c4 to 7414cf4 Compare March 11, 2020 10:36

openshift-ci-robot assigned enxebre Mar 11, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 11, 2020

openshift-merge-robot merged commit ac5ae47 into openshift:master Mar 11, 2020

enxebre mentioned this pull request Mar 20, 2020

[OCPCLOUD-801] Add support for Spot Instances in MachineProviderConfig openshift/cluster-api-provider-aws#306

Merged

JoelSpeed added a commit to JoelSpeed/cluster-api-provider-aws that referenced this pull request Mar 23, 2020

Add SpotMarketOptions to AWSMachineProviderConfig

cb85b1e

Implemented as described in openshift/enhancements#199


		## Summary

		Enable OCP users to leverage cheaper, non-guaranteed instances to back Machine API Machines.


		### Goals

		- Provide consistent behaviour of Machines running on non-guaranteed instances across cloud providers


		To launch an instance as a Spot instance on AWS, a [SpotMarketOptions](https://docs.aws.amazon.com/sdk-for-go/api/service/ec2/#SpotMarketOptions)
		needs to be added to the `RunInstancesInput`. Within this there are 3 options that matter:

Add spot-instances machine-api proposal #199

Add spot-instances machine-api proposal #199

Conversation

JoelSpeed commented Feb 4, 2020

elmiko left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

enxebre commented Feb 5, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

enxebre Feb 21, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

enxebre Feb 6, 2020 • edited Loading

Choose a reason for hiding this comment

lmilbaum commented Feb 6, 2020

derekwaynecarr commented Feb 6, 2020

enxebre Feb 21, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

enxebre commented Mar 10, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

enxebre commented Mar 11, 2020

openshift-ci-robot commented Mar 11, 2020

enxebre Feb 21, 2020 •

edited

Loading

enxebre Feb 6, 2020 •

edited

Loading

enxebre Feb 21, 2020 •

edited

Loading