-
Notifications
You must be signed in to change notification settings - Fork 476
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add spot-instances machine-api proposal #199
Add spot-instances machine-api proposal #199
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this all sounds reasonable to me, just had a few questions inline
|
||
## Summary | ||
|
||
Enable OCP users to leverage cheaper, non-guaranteed instances to back Machine API Machines. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would this be for all types of machines? (eg control and compute planes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's nothing that I'm aware of that would restrict users from using this for control plane machines, the Machine API doesn't really know which machines are which roles.
I think we just strongly recommend in the docs that people use common sense and don't use it for control plane machines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am not sure docs are enough for this. Etcd with two masters down bricks the cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree w/ @sttts there is no valid scenario for running the control plane in its current form on spot instances.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I agree that there is no scenario where you would want to run the control-plane on spot instances, I think doing anything more than documenting users shouldn't do this might add a lot of extra complexity to this proposal and MAPI in general
As I understand it at the moment, the Machine API does not know what the machine it is creating is for. It get's userdata from MCO to apply to the instance when it creates it, but there's nothing in the spec of MAPI to say, "this one is a control-plane, do something special". The role of the machine is a black box to MAPI so having some check to make sure control-plane machines don't run on spot instances would require adding some extra API field which says control-plane: true
, but then what validates that that field is actually the truth, to do that, you'd have to understand and interpret the userdata right?
Secondly, at the moment, there is no automation around control-plane machines, they are created by the installer normally, and if a user wants to add a new machine they can, but it requires several manual steps to reconfigure etcd and DNS. If we assume most people use the installer for creating control-plane machines, then, even if they did specify extra config for the spot instance creation, this wouldn't be observed by the terraform parts of the installer and as such, the control-plane machines would end up on on-demand instances anyway?
I'm relatively new to Openshift and still learning though, my understanding my be incorrect and I'm definitely open to other ideas on how to solve this problem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we don't create control plane machines as spot types, I think an end user ignoring documentation and creating control plane machines that are spot types (which would require multiple, very specific actions on their part) is along the same lines of foot-shooting as deleting random instances in your AWS account. We can't prevent users from doing super bad things.
We could detect after the fact that machines are control plane nodes, but as previously mentioned, the actual roles are determined via ignition payload from MCO, so it's not ideal to try to determine that ahead of time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. This would be the same as protecting against setting the right value for the credentials secret, the right ignition secret, etc which we don't. Master machines are known to be pets today which let the installer set a desired input. When/if a controlPlane controller exists we could enforce opinionated validations and gating there.
|
||
###### Termination Notices | ||
Termination notices on AWS are provided via the EC2 metadata service up to 2 minutes before an instance is due to be preempted. | ||
A DaemonSet should be deployed on the cluster which runs Pods on nodes labelled `machine.openshift.io/terminate-on-preemption: true`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we specify:
- What would run this daemonSet, I'd assume mao.
- Where does this controller implementation lives, I'd assume a new controller in each provider actuator repository. This binary could then be run by mao from the provider image similar to https://github.com/openshift/machine-api-operator/blob/master/pkg/operator/operator.go#L284 and https://github.com/openshift/machine-api-operator/blob/master/pkg/operator/sync.go#L225-L237
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little uncertain on how this would work at the moment, my current thinking is:
- New controller code for each termination controller goes into the respective actuator repositories
- This makes a new binary which goes into each mapi provider image for the cloud provider
- MAO syncs the daemonset in https://github.com/openshift/machine-api-operator/blob/e55d2b984245439082c3cd8f212b8c9c3380bbbe/pkg/operator/sync.go#L64-L74 if it is supported by the provider
- daemonset set's entrypoint to the second binary in the image that has been added
Does that make sense?
cc @listey |
|
||
### Goals | ||
|
||
- Provide consistent behaviour of Machines running on non-guaranteed instances across cloud providers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems great as a goal...but I think it's going to be hard for anyone using this at scale to avoid needing to look at and understand provider-specific semantics and pricing.
Which, reading down seems to be the case (the provider pricing bits are all provider-specific). We're mostly just exposing the baseline infrastructure for integrating this nicely into OpenShift.
Maybe just call out as a goal something like
- Support sophisticated provider-specific automation driven by higher level (3rd party), e.g. code that dynamically adjusts price targets based on workload
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's fair, I wrote the goals before doing most of the research 😅 Will reword
81dd085
to
6be76c4
Compare
A DaemonSet should be deployed on the cluster which runs Pods on nodes labelled `machine.openshift.io/terminate-on-preemption: true`. | ||
Users of Spot instances could optionally add this label to their Machine’s spec to opt in to this feature. | ||
|
||
The DaemonSet itself would poll the EC2 metadata service for a termination notice. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we give the daemonset privilege to delete the machine, we have very high privileged pod running on a worker.
Maybe set a health-check that fails on the daemonset when we get the 200 OK from metadata, and have a controller watch pods in a particular namespace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think i like the require MHC if you want to use spot instances and termination triggers the MHC to fail early approach more than deleting the machine object..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that line the daemonset could set the node conditions the MHC is configured to watch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are several ways that this could be achieved, depending on how we want to break down the privileges that the pods have and how much complexity we want.
- Daemonset that has privileges to delete machines (as currently proposed)
- Potentially dangerous as highly privileged, bad actor could delete all machines
- Daemonset that has privileges to set a condition on any node, for MHC to notice
- If the MHC sees this condition, should it terminate immediately? If not, can we afford a delay?
- Is this any different from above? Bad actor could apply condition to all nodes and by proxy, delete all Machines
- Daemonset that uses Node's credentials to add condition to Node
- MHC would have to immediately act on this condition
- Using Node account limits bad actors capabilities to just the node that the pod is on, safer?
- Is reusing Node account ok? Grants more privileges than necessary, not intended use, could break in future?
- Some (to be defined) controller creates a Pod for each node with it's own role and service account
- Can limit privileges to only be able to delete the Machine for the Node the Pod is running on
- Alternatively could only be able to update status of Node Pod is running on
- More controllers, more complexity than other solutions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm proposing a daemonset to schedule a pod on each spot node. The daemonset simply polls the metadata service. If it gets the shutdown notification, fail the readiness check for that pod. Have a controller watching pods in whatever namespace the daemonset is in. When a pod fails readiness (after an initial startup delay), that controller deletes the associated machine.
I think this might work, that you mentioned:
Daemonset that has privileges to set a condition on any node, for MHC to notice
The pod is going to have access to instance metadata, host networking, so it might as well have an account that lets it set conditions on the node. Is it possible to give it permissions to only set conditions on the local node? Having it able to mark this condition on other nodes would be very bad.
MHC would need to respond immediately without delay. This should be simple enough to configure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm proposing a daemonset to schedule a pod on each spot node. The daemonset simply polls the metadata service. If it gets the shutdown notification, fail the readiness check for that pod. Have a controller watching pods in whatever namespace the daemonset is in. When a pod fails readiness (after an initial startup delay), that controller deletes the associated machine.
My only concern with this is adding additional delays that cannot really be afforded when you have a short notice period. If the health check period for the pod is say, 5s, and the pod itself polls the metadata every 5s, you could easily lose 10s of a 30s period on GCP or Azure.
The pod is going to have access to instance metadata, host networking, so it might as well have an account that lets it set conditions on the node. Is it possible to give it permissions to only set conditions on the local node? Having it able to mark this condition on other nodes would be very bad.
The third and fourth options in my comment above have permissions to restrict to only setting conditions on a single node, I think the fourth option is likely to be the most complex to implement, but has the best security aspect if we are giving any permissions to the pods doing the health checking that is.
MHC would need to respond immediately without delay. This should be simple enough to configure.
Ack, agreed!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GCP sends a signal, so presumably we'd catch that almost immediately.
I think we have a variety of options. Most of them are fine with me, as long as we're not running a really privileged pod on each node (eg, a pod that can delete machines). We'll probably figure out which method is best by implementing and iterating.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a detailed explanation of the way I think we should proceed with this. It's a little more complex than I had originally hoped but should be relatively safe in terms of not giving pods too much privilege.
I am wondering if it should be a separate proposal now though due to its complexity 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's try to keep this simple initially. I think current system design already give us all we need:
- Actuators must signal new machines/nodes as a spot instances, i.e label it with
machine.openshift.io/spot
(name to be decided). - MAO deploys a new daemonSet with a nodeSelector that matches the well known spot label.
- The provider image/binary to run by the controller own by the new daemonSet is chosen by mao per cloud (as we already do for actuators. The source code can indeed live along with each actuator and the binary be packed in the same image).
- This controller does its provider specific magic to realise imminent termination and signal the machine for deletion (or if there're concerns with that, it can set a node condition and delegates on the optional existence of a MHC to signal the deletion based on the conditions. Though the mapi-controllers service account is privileged to manipulate machines by design).
- The actuator should not attempt to verify that an instance can be created before attempting to create the instance | ||
- If the cloud provider does not have capacity, the Machine Health Checker can (given required MHC) remove the Machine after a period. | ||
MachineSet will ensure the correct number of Machines are created. | ||
|
||
- There’s a [2 hours time frame](https://github.com/openshift/cluster-machine-approver/blob/master/csr_check.go#L38) in OpenShift for a machine to become a node. | ||
If the Spot request is not satisfied during this time frame, it is assumed that the instance will never become a node. | ||
This machine would be remediated by a given MHC. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
given required MHC
it makes it sound like we can't use spot instances without MHC because otherwise we'll never get replacement machines..
So are we going to validate if spot feature is being used, the machineset must have MHC attached?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we might want to start by documenting this. As we gather more real use feedback on MHC I'd like us to explore a more generic higher level abstraction which would possibly own and run both machineSets and MHC underneath, e.g machineDeployment/nodePool or similar.
If it is relevant, I utilized spotinst few years ago, when provided services. It saved ~80% of the EC2 costs. |
If a MachineHealthCheck were deployed, these Machines would be considered unhealthy and would be deleted, creating a new Machine in its place. | ||
|
||
With or without an MHC, the scale request would never manifest in new compute capacity joining the cluster. | ||
After a half hour period, the autoscaler considers the MachineSet to be out of sync and rescales it to remove the `Failed` instances. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After a half hour period
This is tied to your particular autoscaler settings.
the autoscaler would deem these Machines as having unregistered nodes and, after a 15 minute period, | ||
would request these unregistered nodes be deleted, mark the MachineSet as unhealthy and attempt to scale an alternative MachineSet. | ||
This, while still not perfect, is preferable to the current state of the autoscaler; | ||
assuming there is a on-demand based MachineSet to fall back to, this would be tried as a backup. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may be add a note this is actually current behaviour for aws provider as well kubernetes/autoscaler#2235
|
||
Based on the [working implementation](https://github.com/kubernetes/autoscaler/pull/2235/files) of Spot instances in the AWS autoscaler provider, | ||
if the autoscaler Machine API implementation were to provide fake provider IDs for Machines that have failed ([example](https://github.com/JoelSpeed/autoscaler/commit/11ebd1ffdadebbb20d2fac9aae30646b4f47dfa9)), | ||
the autoscaler would deem these Machines as having unregistered nodes and, after a 15 minute period, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after a 15 minute period,
This is tied to your particular autoscaler settings.
the autoscaler would deem these Machines as having unregistered nodes and, after a 15 minute period, | ||
would request these unregistered nodes be deleted, mark the MachineSet as unhealthy and attempt to scale an alternative MachineSet. | ||
This, while still not perfect, is preferable to the current state of the autoscaler; | ||
assuming there is a on-demand based MachineSet to fall back to, this would be tried as a backup. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may be mention somehow that i.e this is satisfying the expectations for the core autoscaler backoff/healthchecking mechanism
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may be mention that for letting the autoscaler choose a spot instances nodeGroup over one with on demand instances pod affinity can be leveraged so only workloads suitable for spots result in scale in/out
Looks like all concerns have been addressed. This seems good as to implement and iterate based on tangible feedback. If any update or granular is required we'll follow up with a separate PR. |
- If the cloud provider does not have capacity, the Machine Health Checker can (given required MHC) remove the Machine after a period. | ||
MachineSet will ensure the correct number of Machines are created. | ||
|
||
- There’s a [2 hours time frame](https://github.com/openshift/cluster-machine-approver/blob/master/csr_check.go#L38) in OpenShift for a machine to become a node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, If we considered this a permanent failure the machine could go failed right away.
|
||
To launch an instance as a Spot instance on AWS, a [SpotMarketOptions](https://docs.aws.amazon.com/sdk-for-go/api/service/ec2/#SpotMarketOptions) | ||
needs to be added to the `RunInstancesInput`. Within this there are 3 options that matter: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which particular fields would we be exposing at the user facing API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is explained below on lines L#138-L#151
c1e55c4
to
7414cf4
Compare
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: enxebre, JoelSpeed The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Implemented as described in openshift/enhancements#199
This adds a proposal for supporting Spot Instances in AWS, GCP and Azure as part of the Machine API