mult-node/aws: Add support for HA controllers. #147

eliaslevy · 2015-11-16T18:45:55Z

This

replaces the single controller with an autoscale group
replaces the static controller IP address and elastic controller IP address with an internal elastic load balancer and assigns it a domain name using Route 53
clusters etcd in the controller nodes using the monstantoco/etcd-aws-cluster container.

Podmaster was already configured in the controller, so the scheduler and controller-manager were already configured to support HA.

The monstantoco/etcd-aws-cluster is used rather than etcd's discovery mechanism or a static configuration as it makes use of AWS tools to discover the members of the controllers autoscale group. That allows the container to not only of starting a new cluster, but of managing removal and addition of cluster members as the membership of the autoscale group changes, for a fully automated solution.

You probably don't want to merge this as is. On my own branch I've moved to native AWS networking. Therefore, I've made no attempt to configure etcd on worker nodes. Flannel appeared to be the only etcd user on workers, and with it gone on my branch, I've removed reference of etcd from the worker's configuration. I've also closed incoming access over the Internet and use a VPN access server to access the cluster's VPC, so the load balancer and hostname assign to it are internal in this pull, whereas you probably want to make them both internal and external if you are keeping external access to the cluster.

Conflicts: multi-node/aws/pkg/cluster/cluster.go multi-node/aws/pkg/cluster/stack_template.go

The apiserver loadbalancer is internal and attached to the Kubernetes subnet. An internal hostname for the LB is registered with Route 53. It has the form "kubernetes.<cluster_name>.cluster.local". This LB name is added to the apiserver TLS server certificate. The worker kubelets are pointed to the LB name. Conflicts: multi-node/aws/pkg/cluster/cluster.go

Rather than use etcd's discovery mechanism or a static configuration, this makes use of the monsantoco/ectd-aws-cluster container, which makes use of AWS tools to discover the members of the controllers autoscale group. This container is capable not only of starting a new cluster, but of managing removal and addition of cluster members as the membership of the autoscale group changes, for a fully automated solution. References and access to etcd are removed from worker nodes, as my branch no longer makes use of flannel, which was the only etcd client on the workers.

bcwaldon · 2015-12-14T22:24:07Z

multi-node/aws/pkg/cluster/cloud_config_controller.go


  units:
+  - name: etcd-peers.service


Were you able to explore any other options than this? Is there no way to assign static IPs at deploy time and have a static initial-cluster option?

I didn't care much for using the etcd discovery service. I could have gone with elastic IPs, but after reading http://engineering.monsanto.com/2015/06/12/etcd-clustering/ this seemed like a better approach to me.

It also seems like using the ASG would be racy, when we need absolute determinism to bootstrap this cluster. How does this work that makes it safe to run in parallel on several machines?

Given that we want to run in HA mode, then by definition it has to work in parallel on several machines. ;-) So far we've had no issues with it, but I am guessing there may be some scenarios not handled. Alas, I've not explored them. Might want to ping the Monsanto folks that wrote it. They probably have some experience running it on larger cluster for far longer and can give you a better idea about failure scenarios.

Ok, that's good to hear. I'm wary of relying on a tool for management of etcd that I have zero experience with, and haven't even begun to understand. I'm also thinking that the controller cluster should be separated from the etcd cluster soon. Have you considered this separation?

mgoodness · 2015-12-23T23:49:28Z

@bcwaldon I'm using the Monsanto approach in exactly the scenario you described. The etcd ASG spins up first, using the etcd-aws-cluster container for node registration. Then the Kubernetes controller cluster is started, which also makes use of the container. The etcd ASG is passed in through an environment variable, and etcd-aws-cluster configures proxy mode. Here's the relevant section of my controller cloud_config:

    - name: etcd-peers.service
      command: start
      content: |
        [Unit]
        Description=Write peers to file (for bootstrapping cluster)
        Requires=docker.service
        After=docker.service

        [Service]
        ExecStartPre=/usr/bin/docker pull monsantoco/etcd-aws-cluster:latest
        ExecStart=/usr/bin/docker run --rm -e PROXY_ASG=${ETCD_ASG} \
          -e ETCD_CLIENT_SCHEME=https -e ETCD_PEER_SCHEME=https \
          -e ETCD_CURLOPTS="--cacert /etc/ssl/etcd/ca.pem \
            --cert /etc/ssl/etcd/client.pem \
            --key /etc/ssl/etcd/client-key.pem" \
          -v /etc/sysconfig/:/etc/sysconfig/ \
          -v /etc/ssl/etcd:/etc/ssl/etcd \
          monsantoco/etcd-aws-cluster:latest
        ExecStartPost=/usr/bin/chown -R etcd /etc/ssl/etcd
        Restart=on-failure
        RestartSec=10

It's worked very well so far. There's another project out there that is using a single Go executable to do the same kind of dynamic bootstrapping; it was inspired by the Monsanto work. Not nearly as flexible, though.

danielwhatmuff · 2016-02-18T05:52:50Z

Is support for HA controllers coming soon?

One step forward to achieve high-availability throught the cluster. This change allows you to specify multiple subnets in cluster.yaml to make workers' ASG spread instances over those subnets. Differentiating each subnet's availability zone results in H/A of workers. Beware that this change itself does nothing with H/A of masters. Possibly relates to coreos#147, coreos#100

aaronlevy · 2016-08-11T18:31:04Z

I am going to close this PR as the underlying functionality has changed significantly since the time this was proposed. If this should be re-opened, please let me know

eliaslevy added 3 commits November 16, 2015 10:32

multi-node/aws: Replace single controller with autoscale group.

1bb618f

Conflicts: multi-node/aws/pkg/cluster/cluster.go multi-node/aws/pkg/cluster/stack_template.go

aaronlevy added platform/AWS needs review labels Nov 30, 2015

bcwaldon reviewed Dec 14, 2015
View reviewed changes

bcwaldon added reviewed/needs work and removed needs review labels Dec 21, 2015

eliaslevy mentioned this pull request Jan 11, 2016

Create clusters with HA masters by default #90

Closed

eliaslevy mentioned this pull request Mar 25, 2016

AWS Feature: HA master kubernetes/kubernetes#23479

Closed

mumoshu mentioned this pull request Apr 27, 2016

kube-aws: Support Multi-AZ workers on AWS #439

Merged

aaronlevy closed this Aug 11, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mult-node/aws: Add support for HA controllers. #147

mult-node/aws: Add support for HA controllers. #147

eliaslevy commented Nov 16, 2015

bcwaldon Dec 14, 2015

eliaslevy Dec 14, 2015

bcwaldon Dec 14, 2015

eliaslevy Dec 14, 2015

bcwaldon Dec 14, 2015

mgoodness commented Dec 23, 2015

danielwhatmuff commented Feb 18, 2016

aaronlevy commented Aug 11, 2016

mult-node/aws: Add support for HA controllers. #147

mult-node/aws: Add support for HA controllers. #147

Conversation

eliaslevy commented Nov 16, 2015

bcwaldon Dec 14, 2015

Choose a reason for hiding this comment

eliaslevy Dec 14, 2015

Choose a reason for hiding this comment

bcwaldon Dec 14, 2015

Choose a reason for hiding this comment

eliaslevy Dec 14, 2015

Choose a reason for hiding this comment

bcwaldon Dec 14, 2015

Choose a reason for hiding this comment

mgoodness commented Dec 23, 2015

danielwhatmuff commented Feb 18, 2016

aaronlevy commented Aug 11, 2016