Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cloudprovider: add DigitalOcean #2245

Merged
merged 2 commits into from
Aug 9, 2019

Conversation

fatih
Copy link
Contributor

@fatih fatih commented Aug 8, 2019

This adds a new cluster autoscaler for DigitalOcean Kubernetes offering.

  • Because there is no native NodeGroup offering in DigitalOcean this autoscaler can be used only within a Managed DigitalOcean Kubernetes cluster.
  • Customers will be able to enable/disable it by adding tags to their node pools. Some of the valid tags are:
k8s-cluster-autoscaler-enabled:true
k8s-cluster-autoscaler-min:3
k8s-cluster-autoscaler-max:10
  • As suggested, I didn't vendored the client (cloudprovider/digitalocean/godo) and added it as a sub package.

For reviewers:

I've tried to add each individual commit in a packaged way. The first commit (776bba5) adds our DigitealOcean API package called godo. The second and following commits are all related to the actual DigitalOcean cloud provider.

cc @timoreimann @snormore

cc @andrewsykim @MaciekPytel

Closes #254

@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Aug 8, 2019
@fatih
Copy link
Contributor Author

fatih commented Aug 8, 2019

I've closed the previous PR as it was based on my personal repository. This new PR is the same, but the changes are based on https://github.com/digitalocean/autoscaler

@fejta-bot
Copy link

Unknown CLA label state. Rechecking for CLA labels.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/check-cla

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 8, 2019
@andrewsykim
Copy link
Member

/lgtm
/assign @MaciekPytel

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 8, 2019

```
minimum number of nodes: 1
maximum number of nodes: 200
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is significantly different UX from other providers - I think all (?) existing providers require explicitly setting at least max limit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the default limit for a given node group right now. It's out upper limit. I can change it to 50 or something else if that makes more sense to you? In the UI (cloud.digitalocean.com) we're going to expose a UI that will have sensible pre-default values.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have any specific recommendation for the value. I was referring to the fact that there is a default at all, rather than an explicit requirement to specify a value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean. This is just a way of limiting the hard maximum we allow. The user experience will be different and our API will validate requests.

Copy link
Contributor

@MaciekPytel MaciekPytel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments, but overall looks quite good. Seems like @andrewsykim is also reviewing this? Happy to approve once he lgtms.

Also - feel free to add digitalocean/OWNERS file with yourself and / or whoever is interested in maintaining this from DO side. Core developers have only so much review capacity and as the number of cloudproviders grow we don't want to block internal fixes / improvements on our reviews.



```
make build-binary
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommended building in docker using either make build-in-docker or make release. Building outside of docker is for developer convenience, but it's best effort only and we know it doesn't work in some environments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just tested it and seems like our API depends on https://github.com/google/go-querystring. Can I vendor this or should I also copy this as a subfolder and rewrite the import paths of our API?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mwielgus what do you think about this one?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far we have been very strict on adding any cloud-provider-specific dependencies to global CA vendors. To keep our integrity and not create precedent(even for small Google library) we would kindly ask you to contain everything what your cloud provider needs in your directory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mwielgus gotcha. I vendored everything locally into the digitalocean folder. I re-ran make build-in-docker again and now it works without any issue. I've force pushed the changes as I've rebased and squashed everything.

}

for _, node := range nodes {
klog.V(5).Infof("checking node have: %q want: %q", node.Id, nodeID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will produce incredible amount of logs in a decently sized cluster. There are some V(5) logs that are sometimes useful for debugging, but this make V(5) unusable - can you make it V(6) or more?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made it V(6).

rl *cloudprovider.ResourceLimiter,
) cloudprovider.CloudProvider {
var configFile io.ReadCloser
if opts.CloudConfig != "" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't that error/fatal if CloudConfig == ""?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This resembles the same layout as the other cloud providers. It doesn't have to error, because in the next step , newManager() will throw an error because of an invalid configuration.

# Configuration

The `cluster-autoscaler` dynamically runs based on tags associated with node
pools. These are the current valid tags:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is completely different from how you run CA on different providers. Any particular reason for this design?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This let us dynamically enable/disable the cluster-autoscaler and make it by default as opt-out. So to opt-in you need to add those tags. Because our UI is not ready yet, this will allow some customers to enable it by adding the tags themself. Also this doesn't require us fiddle with CA flag arguments. It's easier to maintain and ship cluster-autoscaler and requires zero to minimum interaction once it's deployed. Is there something you believe is wrong in this approach?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I didn't realize you're planning to host CA pod (or at least start it automatically). I don't think there is anything inherently wrong with your approach, it's just inconsistent with other providers.
Given that autodiscovery already works differently for each provider I don't think it's a blocker for merging this either. Just a suboptimal experience for someone moving between clouds / running multicloud / etc.

min, err := strconv.Atoi(value)
if err != nil {
return nil, fmt.Errorf("invalid minimum nodes: %q", value)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explicitly return an error for min=0? Currently you just silently override it to 1. Many other providers support scale-to-0 so it's possible for someone to assume digitalocean also supports it and just set it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't support 0 yet :/ Hence I return an error here. This is one of our short comings on our end we're working on fixing it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comment is that you don't actually return any error. You just treat 0 as 'not set' and silently default to 1.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly it seems like you can also set max to a negative value as long as you don't set min?

Copy link
Contributor Author

@fatih fatih Aug 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah you're right. Added this test case and it fails:

{
	name: "bad tags - max is set to negative, no min",
	tags: []string{
		"k8s-cluster-autoscaler-enabled:true",
		"k8s-cluster-autoscaler-max:-5",
	},
	wantErr: true,
},

Making sure to fix it.

targetSize, delta, updatedNodePool.Count)
}

return nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You seem to cache TargetSize, in which case you should update it after changing it.

n.clusterID, n.id, nodeID, err)
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update cached TargetSize?

return fmt.Errorf("couldn't increase size to %d (delta: %d). Current size is: %d",
targetSize, delta, updatedNodePool.Count)
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update cached TargetSize?

if n.nodePool == nil {
return nil, errors.New("node pool instance is not created")
}
return toInstances(n.nodePool.Nodes), nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To get proper error handling you should return 'placeholder' instances for any instance that doesn't exist yet (n.nodePool.Count - len(n.nodePool.Nodes). Unless these is always 0?

It's fine to add it in a separate PR later. This PR is a recent example of this: #2235 and could be used as a reference.

@andrewsykim
Copy link
Member

Left a few comments, but overall looks quite good. Seems like @andrewsykim is also reviewing this? Happy to approve once he lgtms.

Left most of my review comments in the previous PR #2227. Overall PR LGTM as well.

Also - feel free to add digitalocean/OWNERS file with yourself and / or whoever is interested in maintaining this from DO side. Core developers have only so much review capacity and as the number of cloudproviders grow we don't want to block internal fixes / improvements on our reviews.

@fatih feel free to add myself here as well if you need someone in the OWNERs file (note you need to be an kubernetes org member for this to work).

@fatih
Copy link
Contributor Author

fatih commented Aug 9, 2019

@andrewsykim sounds good thanks! I've added you to reviewers section and our @timoreimann as I think he is a Kubernetes Org Member.

@k8s-ci-robot k8s-ci-robot added do-not-merge/invalid-owners-file Indicates that a PR should not merge because it has an invalid OWNERS file in it. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Aug 9, 2019
case "max":
max, err := strconv.Atoi(value)
if err != nil {
return nil, fmt.Errorf("invalid minimum nodes: %q", value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/minimum/maximum

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@MaciekPytel
Copy link
Contributor

Please squash your commits once you're ready to merge, so that there are just 2: adding client and implementation.

@fatih fatih force-pushed the do-cloudprovider branch from ea1c84b to 18abbbe Compare August 9, 2019 11:01
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/invalid-owners-file Indicates that a PR should not merge because it has an invalid OWNERS file in it. label Aug 9, 2019
@fatih
Copy link
Contributor Author

fatih commented Aug 9, 2019

Please squash your commits once you're ready to merge, so that there are just 2: adding client and implementation.

@MaciekPytel just did it. The PR now only contains two commits, including recent fixes around zero minimum and negative maxes.

There is only one conversation that is not resolved though for us: #2245 (comment) What should I do here?

@fatih fatih force-pushed the do-cloudprovider branch from 18abbbe to c2e07c0 Compare August 9, 2019 13:15
@fatih
Copy link
Contributor Author

fatih commented Aug 9, 2019

@MaciekPytel everything is settled down now. I've fixed all your comments (I think I didn't missed anything). I also squashed and splitted all my changes into two commits. The first commit contains the client (including the dependencies) inside the digitalocean folder, and the second commit contains the actual cloudprovider implementation. PTAL

Copy link
Contributor

@mwielgus mwielgus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 9, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mwielgus

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 9, 2019
@k8s-ci-robot k8s-ci-robot merged commit 8303a23 into kubernetes:master Aug 9, 2019
@micahhausler
Copy link
Member

I'm new to this project, but why is the whole godo source copied in a subdirectory instead of just add godo as a vendored dependency? This seems to break go mod in awful ways

@fatih
Copy link
Contributor Author

fatih commented Aug 30, 2019

@micahhausler I also wanted to use go mod, but the requirement was that new cloud providers are not allowed to vendor any code. Hence we had to copy all dependencies into subfolders.

@micahhausler
Copy link
Member

Hmm that seems strange, where is that documented?

@fatih
Copy link
Contributor Author

fatih commented Aug 30, 2019

I'm no longer involved with this project, please ask #sig-autoscaling in Slack or @MaciekPytel.

snormore pushed a commit to digitalocean/autoscaler that referenced this pull request Sep 17, 2019
snormore pushed a commit to digitalocean/autoscaler that referenced this pull request Sep 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement DigitalOcean cloud provider
7 participants