Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add spot-instances machine-api proposal #199

Merged
merged 1 commit into from
Mar 11, 2020

Conversation

JoelSpeed
Copy link
Contributor

This adds a proposal for supporting Spot Instances in AWS, GCP and Azure as part of the Machine API

@openshift-ci-robot openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Feb 4, 2020
Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this all sounds reasonable to me, just had a few questions inline


## Summary

Enable OCP users to leverage cheaper, non-guaranteed instances to back Machine API Machines.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would this be for all types of machines? (eg control and compute planes)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's nothing that I'm aware of that would restrict users from using this for control plane machines, the Machine API doesn't really know which machines are which roles.

I think we just strongly recommend in the docs that people use common sense and don't use it for control plane machines

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am not sure docs are enough for this. Etcd with two masters down bricks the cluster.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree w/ @sttts there is no valid scenario for running the control plane in its current form on spot instances.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I agree that there is no scenario where you would want to run the control-plane on spot instances, I think doing anything more than documenting users shouldn't do this might add a lot of extra complexity to this proposal and MAPI in general

As I understand it at the moment, the Machine API does not know what the machine it is creating is for. It get's userdata from MCO to apply to the instance when it creates it, but there's nothing in the spec of MAPI to say, "this one is a control-plane, do something special". The role of the machine is a black box to MAPI so having some check to make sure control-plane machines don't run on spot instances would require adding some extra API field which says control-plane: true, but then what validates that that field is actually the truth, to do that, you'd have to understand and interpret the userdata right?

Secondly, at the moment, there is no automation around control-plane machines, they are created by the installer normally, and if a user wants to add a new machine they can, but it requires several manual steps to reconfigure etcd and DNS. If we assume most people use the installer for creating control-plane machines, then, even if they did specify extra config for the spot instance creation, this wouldn't be observed by the terraform parts of the installer and as such, the control-plane machines would end up on on-demand instances anyway?

I'm relatively new to Openshift and still learning though, my understanding my be incorrect and I'm definitely open to other ideas on how to solve this problem

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we don't create control plane machines as spot types, I think an end user ignoring documentation and creating control plane machines that are spot types (which would require multiple, very specific actions on their part) is along the same lines of foot-shooting as deleting random instances in your AWS account. We can't prevent users from doing super bad things.

We could detect after the fact that machines are control plane nodes, but as previously mentioned, the actual roles are determined via ignition payload from MCO, so it's not ideal to try to determine that ahead of time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. This would be the same as protecting against setting the right value for the credentials secret, the right ignition secret, etc which we don't. Master machines are known to be pets today which let the installer set a desired input. When/if a controlPlane controller exists we could enforce opinionated validations and gating there.

enhancements/machine-api/spot-instances.md Show resolved Hide resolved

###### Termination Notices
Termination notices on AWS are provided via the EC2 metadata service up to 2 minutes before an instance is due to be preempted.
A DaemonSet should be deployed on the cluster which runs Pods on nodes labelled `machine.openshift.io/terminate-on-preemption: true`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we specify:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little uncertain on how this would work at the moment, my current thinking is:

Does that make sense?

@enxebre
Copy link
Member

enxebre commented Feb 5, 2020

cc @listey

enhancements/machine-api/spot-instances.md Show resolved Hide resolved

### Goals

- Provide consistent behaviour of Machines running on non-guaranteed instances across cloud providers
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems great as a goal...but I think it's going to be hard for anyone using this at scale to avoid needing to look at and understand provider-specific semantics and pricing.

Which, reading down seems to be the case (the provider pricing bits are all provider-specific). We're mostly just exposing the baseline infrastructure for integrating this nicely into OpenShift.

Maybe just call out as a goal something like

  • Support sophisticated provider-specific automation driven by higher level (3rd party), e.g. code that dynamically adjusts price targets based on workload

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's fair, I wrote the goals before doing most of the research 😅 Will reword

enhancements/machine-api/spot-instances.md Show resolved Hide resolved
A DaemonSet should be deployed on the cluster which runs Pods on nodes labelled `machine.openshift.io/terminate-on-preemption: true`.
Users of Spot instances could optionally add this label to their Machine’s spec to opt in to this feature.

The DaemonSet itself would poll the EC2 metadata service for a termination notice.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we give the daemonset privilege to delete the machine, we have very high privileged pod running on a worker.

Maybe set a health-check that fails on the daemonset when we get the 200 OK from metadata, and have a controller watch pods in a particular namespace.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think i like the require MHC if you want to use spot instances and termination triggers the MHC to fail early approach more than deleting the machine object..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that line the daemonset could set the node conditions the MHC is configured to watch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are several ways that this could be achieved, depending on how we want to break down the privileges that the pods have and how much complexity we want.

  • Daemonset that has privileges to delete machines (as currently proposed)
    • Potentially dangerous as highly privileged, bad actor could delete all machines
  • Daemonset that has privileges to set a condition on any node, for MHC to notice
    • If the MHC sees this condition, should it terminate immediately? If not, can we afford a delay?
    • Is this any different from above? Bad actor could apply condition to all nodes and by proxy, delete all Machines
  • Daemonset that uses Node's credentials to add condition to Node
    • MHC would have to immediately act on this condition
    • Using Node account limits bad actors capabilities to just the node that the pod is on, safer?
    • Is reusing Node account ok? Grants more privileges than necessary, not intended use, could break in future?
  • Some (to be defined) controller creates a Pod for each node with it's own role and service account
    • Can limit privileges to only be able to delete the Machine for the Node the Pod is running on
    • Alternatively could only be able to update status of Node Pod is running on
    • More controllers, more complexity than other solutions

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm proposing a daemonset to schedule a pod on each spot node. The daemonset simply polls the metadata service. If it gets the shutdown notification, fail the readiness check for that pod. Have a controller watching pods in whatever namespace the daemonset is in. When a pod fails readiness (after an initial startup delay), that controller deletes the associated machine.

I think this might work, that you mentioned:

Daemonset that has privileges to set a condition on any node, for MHC to notice

The pod is going to have access to instance metadata, host networking, so it might as well have an account that lets it set conditions on the node. Is it possible to give it permissions to only set conditions on the local node? Having it able to mark this condition on other nodes would be very bad.

MHC would need to respond immediately without delay. This should be simple enough to configure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm proposing a daemonset to schedule a pod on each spot node. The daemonset simply polls the metadata service. If it gets the shutdown notification, fail the readiness check for that pod. Have a controller watching pods in whatever namespace the daemonset is in. When a pod fails readiness (after an initial startup delay), that controller deletes the associated machine.

My only concern with this is adding additional delays that cannot really be afforded when you have a short notice period. If the health check period for the pod is say, 5s, and the pod itself polls the metadata every 5s, you could easily lose 10s of a 30s period on GCP or Azure.

The pod is going to have access to instance metadata, host networking, so it might as well have an account that lets it set conditions on the node. Is it possible to give it permissions to only set conditions on the local node? Having it able to mark this condition on other nodes would be very bad.

The third and fourth options in my comment above have permissions to restrict to only setting conditions on a single node, I think the fourth option is likely to be the most complex to implement, but has the best security aspect if we are giving any permissions to the pods doing the health checking that is.

MHC would need to respond immediately without delay. This should be simple enough to configure.

Ack, agreed!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GCP sends a signal, so presumably we'd catch that almost immediately.

I think we have a variety of options. Most of them are fine with me, as long as we're not running a really privileged pod on each node (eg, a pod that can delete machines). We'll probably figure out which method is best by implementing and iterating.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a detailed explanation of the way I think we should proceed with this. It's a little more complex than I had originally hoped but should be relatively safe in terms of not giving pods too much privilege.

I am wondering if it should be a separate proposal now though due to its complexity 🤔

Copy link
Member

@enxebre enxebre Feb 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try to keep this simple initially. I think current system design already give us all we need:

  • Actuators must signal new machines/nodes as a spot instances, i.e label it with machine.openshift.io/spot (name to be decided).
  • MAO deploys a new daemonSet with a nodeSelector that matches the well known spot label.
  • The provider image/binary to run by the controller own by the new daemonSet is chosen by mao per cloud (as we already do for actuators. The source code can indeed live along with each actuator and the binary be packed in the same image).
  • This controller does its provider specific magic to realise imminent termination and signal the machine for deletion (or if there're concerns with that, it can set a node condition and delegates on the optional existence of a MHC to signal the deletion based on the conditions. Though the mapi-controllers service account is privileged to manipulate machines by design).

Comment on lines 71 to 77
- The actuator should not attempt to verify that an instance can be created before attempting to create the instance
- If the cloud provider does not have capacity, the Machine Health Checker can (given required MHC) remove the Machine after a period.
MachineSet will ensure the correct number of Machines are created.

- There’s a [2 hours time frame](https://github.com/openshift/cluster-machine-approver/blob/master/csr_check.go#L38) in OpenShift for a machine to become a node.
If the Spot request is not satisfied during this time frame, it is assumed that the instance will never become a node.
This machine would be remediated by a given MHC.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given required MHC
it makes it sound like we can't use spot instances without MHC because otherwise we'll never get replacement machines..

So are we going to validate if spot feature is being used, the machineset must have MHC attached?

Copy link
Member

@enxebre enxebre Feb 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might want to start by documenting this. As we gather more real use feedback on MHC I'd like us to explore a more generic higher level abstraction which would possibly own and run both machineSets and MHC underneath, e.g machineDeployment/nodePool or similar.

@lmilbaum
Copy link

lmilbaum commented Feb 6, 2020

If it is relevant, I utilized spotinst few years ago, when provided services. It saved ~80% of the EC2 costs.

@derekwaynecarr
Copy link
Member

@staebler @dgoodwin please review, it will likely want to be exposed in Hive as well in future.

If a MachineHealthCheck were deployed, these Machines would be considered unhealthy and would be deleted, creating a new Machine in its place.

With or without an MHC, the scale request would never manifest in new compute capacity joining the cluster.
After a half hour period, the autoscaler considers the MachineSet to be out of sync and rescales it to remove the `Failed` instances.
Copy link
Member

@enxebre enxebre Feb 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a half hour period

This is tied to your particular autoscaler settings.

the autoscaler would deem these Machines as having unregistered nodes and, after a 15 minute period,
would request these unregistered nodes be deleted, mark the MachineSet as unhealthy and attempt to scale an alternative MachineSet.
This, while still not perfect, is preferable to the current state of the autoscaler;
assuming there is a on-demand based MachineSet to fall back to, this would be tried as a backup.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be add a note this is actually current behaviour for aws provider as well kubernetes/autoscaler#2235


Based on the [working implementation](https://github.com/kubernetes/autoscaler/pull/2235/files) of Spot instances in the AWS autoscaler provider,
if the autoscaler Machine API implementation were to provide fake provider IDs for Machines that have failed ([example](https://github.com/JoelSpeed/autoscaler/commit/11ebd1ffdadebbb20d2fac9aae30646b4f47dfa9)),
the autoscaler would deem these Machines as having unregistered nodes and, after a 15 minute period,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after a 15 minute period,

This is tied to your particular autoscaler settings.

the autoscaler would deem these Machines as having unregistered nodes and, after a 15 minute period,
would request these unregistered nodes be deleted, mark the MachineSet as unhealthy and attempt to scale an alternative MachineSet.
This, while still not perfect, is preferable to the current state of the autoscaler;
assuming there is a on-demand based MachineSet to fall back to, this would be tried as a backup.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be mention somehow that i.e this is satisfying the expectations for the core autoscaler backoff/healthchecking mechanism

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be mention that for letting the autoscaler choose a spot instances nodeGroup over one with on demand instances pod affinity can be leveraged so only workloads suitable for spots result in scale in/out

@enxebre
Copy link
Member

enxebre commented Mar 10, 2020

Looks like all concerns have been addressed. This seems good as to implement and iterate based on tangible feedback. If any update or granular is required we'll follow up with a separate PR.
/approve

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 10, 2020
- If the cloud provider does not have capacity, the Machine Health Checker can (given required MHC) remove the Machine after a period.
MachineSet will ensure the correct number of Machines are created.

- There’s a [2 hours time frame](https://github.com/openshift/cluster-machine-approver/blob/master/csr_check.go#L38) in OpenShift for a machine to become a node.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, If we considered this a permanent failure the machine could go failed right away.


To launch an instance as a Spot instance on AWS, a [SpotMarketOptions](https://docs.aws.amazon.com/sdk-for-go/api/service/ec2/#SpotMarketOptions)
needs to be added to the `RunInstancesInput`. Within this there are 3 options that matter:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which particular fields would we be exposing at the user facing API?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is explained below on lines L#138-L#151

@enxebre
Copy link
Member

enxebre commented Mar 11, 2020

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 11, 2020
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: enxebre, JoelSpeed

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit ac5ae47 into openshift:master Mar 11, 2020
JoelSpeed added a commit to JoelSpeed/cluster-api-provider-aws that referenced this pull request Mar 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.