cloudprovider: add DigitalOcean #2245

fatih · 2019-08-08T12:26:26Z

This adds a new cluster autoscaler for DigitalOcean Kubernetes offering.

Because there is no native NodeGroup offering in DigitalOcean this autoscaler can be used only within a Managed DigitalOcean Kubernetes cluster.
Customers will be able to enable/disable it by adding tags to their node pools. Some of the valid tags are:

k8s-cluster-autoscaler-enabled:true
k8s-cluster-autoscaler-min:3
k8s-cluster-autoscaler-max:10

As suggested, I didn't vendored the client (cloudprovider/digitalocean/godo) and added it as a sub package.

For reviewers:

I've tried to add each individual commit in a packaged way. The first commit (776bba5) adds our DigitealOcean API package called godo. The second and following commits are all related to the actual DigitalOcean cloud provider.

cc @timoreimann @snormore

cc @andrewsykim @MaciekPytel

Closes #254

fatih · 2019-08-08T12:30:44Z

I've closed the previous PR as it was based on my personal repository. This new PR is the same, but the changes are based on https://github.com/digitalocean/autoscaler

fejta-bot · 2019-08-08T14:47:37Z

Unknown CLA label state. Rechecking for CLA labels.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/check-cla

andrewsykim · 2019-08-08T16:51:22Z

/lgtm
/assign @MaciekPytel

MaciekPytel · 2019-08-08T16:53:43Z

cluster-autoscaler/cloudprovider/digitalocean/README.md

+
+```
+minimum number of nodes: 1
+maximum number of nodes: 200


That is significantly different UX from other providers - I think all (?) existing providers require explicitly setting at least max limit.

This is the default limit for a given node group right now. It's out upper limit. I can change it to 50 or something else if that makes more sense to you? In the UI (cloud.digitalocean.com) we're going to expose a UI that will have sensible pre-default values.

I don't have any specific recommendation for the value. I was referring to the fact that there is a default at all, rather than an explicit requirement to specify a value.

I see what you mean. This is just a way of limiting the hard maximum we allow. The user experience will be different and our API will validate requests.

MaciekPytel

Left a few comments, but overall looks quite good. Seems like @andrewsykim is also reviewing this? Happy to approve once he lgtms.

Also - feel free to add digitalocean/OWNERS file with yourself and / or whoever is interested in maintaining this from DO side. Core developers have only so much review capacity and as the number of cloudproviders grow we don't want to block internal fixes / improvements on our reviews.

MaciekPytel · 2019-08-08T16:57:25Z

cluster-autoscaler/cloudprovider/digitalocean/README.md

+
+
+```
+make build-binary


I recommended building in docker using either make build-in-docker or make release. Building outside of docker is for developer convenience, but it's best effort only and we know it doesn't work in some environments.

I've just tested it and seems like our API depends on https://github.com/google/go-querystring. Can I vendor this or should I also copy this as a subfolder and rewrite the import paths of our API?

@mwielgus what do you think about this one?

So far we have been very strict on adding any cloud-provider-specific dependencies to global CA vendors. To keep our integrity and not create precedent(even for small Google library) we would kindly ask you to contain everything what your cloud provider needs in your directory.

@mwielgus gotcha. I vendored everything locally into the digitalocean folder. I re-ran make build-in-docker again and now it works without any issue. I've force pushed the changes as I've rebased and squashed everything.

cluster-autoscaler/cloudprovider/digitalocean/digitalocean_cloud_provider.go

MaciekPytel · 2019-08-08T17:05:06Z

cluster-autoscaler/cloudprovider/digitalocean/digitalocean_cloud_provider.go

+		}
+
+		for _, node := range nodes {
+			klog.V(5).Infof("checking node have: %q want: %q", node.Id, nodeID)


This will produce incredible amount of logs in a decently sized cluster. There are some V(5) logs that are sometimes useful for debugging, but this make V(5) unusable - can you make it V(6) or more?

Made it V(6).

MaciekPytel · 2019-08-08T17:07:25Z

cluster-autoscaler/cloudprovider/digitalocean/digitalocean_cloud_provider.go

+	rl *cloudprovider.ResourceLimiter,
+) cloudprovider.CloudProvider {
+	var configFile io.ReadCloser
+	if opts.CloudConfig != "" {


Shouldn't that error/fatal if CloudConfig == ""?

This resembles the same layout as the other cloud providers. It doesn't have to error, because in the next step , newManager() will throw an error because of an invalid configuration.

MaciekPytel · 2019-08-08T17:08:30Z

cluster-autoscaler/cloudprovider/digitalocean/README.md

+# Configuration
+
+The `cluster-autoscaler` dynamically runs based on tags associated with node
+pools. These are the current valid tags:


This is completely different from how you run CA on different providers. Any particular reason for this design?

This let us dynamically enable/disable the cluster-autoscaler and make it by default as opt-out. So to opt-in you need to add those tags. Because our UI is not ready yet, this will allow some customers to enable it by adding the tags themself. Also this doesn't require us fiddle with CA flag arguments. It's easier to maintain and ship cluster-autoscaler and requires zero to minimum interaction once it's deployed. Is there something you believe is wrong in this approach?

Oh, I didn't realize you're planning to host CA pod (or at least start it automatically). I don't think there is anything inherently wrong with your approach, it's just inconsistent with other providers.
Given that autodiscovery already works differently for each provider I don't think it's a blocker for merging this either. Just a suboptimal experience for someone moving between clouds / running multicloud / etc.

MaciekPytel · 2019-08-08T17:20:29Z

cluster-autoscaler/cloudprovider/digitalocean/digitalocean_manager.go

+			min, err := strconv.Atoi(value)
+			if err != nil {
+				return nil, fmt.Errorf("invalid minimum nodes: %q", value)
+			}


Explicitly return an error for min=0? Currently you just silently override it to 1. Many other providers support scale-to-0 so it's possible for someone to assume digitalocean also supports it and just set it.

We don't support 0 yet :/ Hence I return an error here. This is one of our short comings on our end we're working on fixing it.

My comment is that you don't actually return any error. You just treat 0 as 'not set' and silently default to 1.

Interestingly it seems like you can also set max to a negative value as long as you don't set min?

Yeah you're right. Added this test case and it fails:

{ name: "bad tags - max is set to negative, no min", tags: []string{ "k8s-cluster-autoscaler-enabled:true", "k8s-cluster-autoscaler-max:-5", }, wantErr: true, },

Making sure to fix it.

MaciekPytel · 2019-08-08T17:22:59Z

cluster-autoscaler/cloudprovider/digitalocean/digitalocean_node_group.go

+			targetSize, delta, updatedNodePool.Count)
+	}
+
+	return nil


You seem to cache TargetSize, in which case you should update it after changing it.

MaciekPytel · 2019-08-08T17:23:32Z

cluster-autoscaler/cloudprovider/digitalocean/digitalocean_node_group.go

+				n.clusterID, n.id, nodeID, err)
+		}
+	}
+


Update cached TargetSize?

MaciekPytel · 2019-08-08T17:24:02Z

cluster-autoscaler/cloudprovider/digitalocean/digitalocean_node_group.go

+		return fmt.Errorf("couldn't increase size to %d (delta: %d). Current size is: %d",
+			targetSize, delta, updatedNodePool.Count)
+	}
+


Update cached TargetSize?

MaciekPytel · 2019-08-08T17:27:51Z

cluster-autoscaler/cloudprovider/digitalocean/digitalocean_node_group.go

+	if n.nodePool == nil {
+		return nil, errors.New("node pool instance is not created")
+	}
+	return toInstances(n.nodePool.Nodes), nil


To get proper error handling you should return 'placeholder' instances for any instance that doesn't exist yet (n.nodePool.Count - len(n.nodePool.Nodes). Unless these is always 0?

It's fine to add it in a separate PR later. This PR is a recent example of this: #2235 and could be used as a reference.

andrewsykim · 2019-08-08T18:00:43Z

Left a few comments, but overall looks quite good. Seems like @andrewsykim is also reviewing this? Happy to approve once he lgtms.

Left most of my review comments in the previous PR #2227. Overall PR LGTM as well.

Also - feel free to add digitalocean/OWNERS file with yourself and / or whoever is interested in maintaining this from DO side. Core developers have only so much review capacity and as the number of cloudproviders grow we don't want to block internal fixes / improvements on our reviews.

@fatih feel free to add myself here as well if you need someone in the OWNERs file (note you need to be an kubernetes org member for this to work).

fatih · 2019-08-09T08:36:31Z

@andrewsykim sounds good thanks! I've added you to reviewers section and our @timoreimann as I think he is a Kubernetes Org Member.

MaciekPytel · 2019-08-09T10:16:24Z

cluster-autoscaler/cloudprovider/digitalocean/digitalocean_manager.go

+		case "max":
+			max, err := strconv.Atoi(value)
+			if err != nil {
+				return nil, fmt.Errorf("invalid minimum nodes: %q", value)


s/minimum/maximum

MaciekPytel · 2019-08-09T10:59:31Z

Please squash your commits once you're ready to merge, so that there are just 2: adding client and implementation.

fatih · 2019-08-09T11:02:46Z

Please squash your commits once you're ready to merge, so that there are just 2: adding client and implementation.

@MaciekPytel just did it. The PR now only contains two commits, including recent fixes around zero minimum and negative maxes.

There is only one conversation that is not resolved though for us: #2245 (comment) What should I do here?

fatih · 2019-08-09T13:19:04Z

@MaciekPytel everything is settled down now. I've fixed all your comments (I think I didn't missed anything). I also squashed and splitted all my changes into two commits. The first commit contains the client (including the dependencies) inside the digitalocean folder, and the second commit contains the actual cloudprovider implementation. PTAL

mwielgus

/lgtm
/approve

k8s-ci-robot · 2019-08-09T13:33:50Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mwielgus

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster-autoscaler/OWNERS~~ [mwielgus]
~~hack/OWNERS~~ [mwielgus]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

micahhausler · 2019-08-30T17:38:33Z

I'm new to this project, but why is the whole godo source copied in a subdirectory instead of just add godo as a vendored dependency? This seems to break go mod in awful ways

fatih · 2019-08-30T17:43:28Z

@micahhausler I also wanted to use go mod, but the requirement was that new cloud providers are not allowed to vendor any code. Hence we had to copy all dependencies into subfolders.

micahhausler · 2019-08-30T17:44:45Z

Hmm that seems strange, where is that documented?

fatih · 2019-08-30T17:45:38Z

I'm no longer involved with this project, please ask #sig-autoscaling in Slack or @MaciekPytel.

cloudprovider: add DigitalOcean

k8s-ci-robot requested review from aleksandra-malinowska and krzysztof-jastrzebski August 8, 2019 12:26

k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Aug 8, 2019

fatih mentioned this pull request Aug 8, 2019

cloudprovider: add DigitalOcean #2227

Closed

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 8, 2019

k8s-ci-robot assigned MaciekPytel and andrewsykim Aug 8, 2019

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 8, 2019

MaciekPytel reviewed Aug 8, 2019

View reviewed changes

k8s-ci-robot added do-not-merge/invalid-owners-file Indicates that a PR should not merge because it has an invalid OWNERS file in it. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Aug 9, 2019

MaciekPytel reviewed Aug 9, 2019

View reviewed changes

fatih force-pushed the do-cloudprovider branch from ea1c84b to 18abbbe Compare August 9, 2019 11:01

k8s-ci-robot removed the do-not-merge/invalid-owners-file Indicates that a PR should not merge because it has an invalid OWNERS file in it. label Aug 9, 2019

fatih added 2 commits August 9, 2019 16:10

cloudprovider: add godo, DigitalOcean API package

2395c16

cloudprovider: add DigitalOcean cloud provider

c2e07c0

fatih force-pushed the do-cloudprovider branch from 18abbbe to c2e07c0 Compare August 9, 2019 13:15

mwielgus approved these changes Aug 9, 2019

View reviewed changes

k8s-ci-robot assigned mwielgus Aug 9, 2019

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 9, 2019

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 9, 2019

k8s-ci-robot merged commit 8303a23 into kubernetes:master Aug 9, 2019

towca mentioned this pull request Aug 12, 2019

Unit tests failing after PR #2245 #2252

Closed

snormore pushed a commit to digitalocean/autoscaler that referenced this pull request Sep 17, 2019

Merge pull request kubernetes#2245 from digitalocean/do-cloudprovider

122b34b

cloudprovider: add DigitalOcean

snormore pushed a commit to digitalocean/autoscaler that referenced this pull request Sep 17, 2019

Merge pull request kubernetes#2245 from digitalocean/do-cloudprovider

829126b

cloudprovider: add DigitalOcean

danielmellado mentioned this pull request Oct 24, 2019

Rebase to upstream/cluster-autoscaler-release-1.16 openshift/kubernetes-autoscaler#119

Closed

cloudprovider: add DigitalOcean #2245

cloudprovider: add DigitalOcean #2245

Conversation

fatih commented Aug 8, 2019

fatih commented Aug 8, 2019

fejta-bot commented Aug 8, 2019

andrewsykim commented Aug 8, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MaciekPytel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fatih Aug 9, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrewsykim commented Aug 8, 2019

fatih commented Aug 9, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MaciekPytel commented Aug 9, 2019

fatih commented Aug 9, 2019

fatih commented Aug 9, 2019

mwielgus left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Aug 9, 2019

micahhausler commented Aug 30, 2019

fatih commented Aug 30, 2019

micahhausler commented Aug 30, 2019

fatih commented Aug 30, 2019

fatih Aug 9, 2019 •

edited

Loading