Production Quality Deployment #9

pieterlange · 2016-10-29T13:41:03Z

pieterlange · 2016-10-29T13:55:54Z

Initial comments on this:

The etcd setup still needs some love. I have a monkeypatch at https://github.com/pieterlange/kube-aws/tree/feature/external-etcd for pointing the cluster to an external etcd cluster (i'm using https://crewjam.com/etcd-aws/). This is not a clean solution. I think we need to:

provide a decent method of referring to an external etcd cluster
backport some essential features like backup and cluster recovery from https://github.com/crewjam/etcd-aws

Some work is being done to have an entirely self-hosted kubernetes cluster (with etcd running as petset in kubernetes itself?) but from an ops PoV this feels like way too many moving parts at the moment.

As for elasticsearch/heapster i tend to move in the exact opposite direction: i'd rather host elasticsearch inside of the cluster. I'm also not sure if this should be part of a default installation.

camilb · 2016-10-29T15:59:45Z

@pieterlange Tried to integrate your work on "coreos/coreos-kubernetes#629" with "coreos/coreos-kubernetes#608" and https://github.com/crewjam/etcd-aws , a couple of weeks ago.

Current setup:

Controllers in ASG with External ELB
Workers in ASG.
ETCD nodes in ASG with Internal ELB.
Moved all the cloudformation definitions from etcd-aws into stack-template.json in kube-aws, so you can configure everything from kube-aws.

Without TLS it works great, the cluster recovers fine. With TLS, ETCD works fine but doesn't recover and also doesn't remove terminated nodes. Still need to fix these in etcd-aws, especially in the backup part.

Apart from this, the DNS record for ETCD internal ELB is still a manual process at the moment (I set an alias record after the ELB is created), but this can be quickly fixed after.

If anyone is interested working on this, maybe can pick some changes I already made on etcd-aws and kube-aws.

https://github.com/camilb/etcd-aws/tree/ssl

https://github.com/camilb/coreos-kubernetes/tree/etcd-asg

pieterlange · 2016-10-30T09:57:23Z

This is great! Thanks @camilb, this will definitely save time if the project goes in that direction.

For reference, there's some notes on self-hosted etcd in the self-hosted design docs. Maybe @aaronlevy can chip in if external-etcd is a good deployment strategy.

mumoshu · 2016-11-08T01:11:36Z

I've posted my thoughts on why I might want to have the "Dedicated controller subnets and routetables" thing at #35 (comment)

aholbreich · 2016-11-15T18:21:01Z

Support deploying to existing VPC (and maybe existing subnet as well?) -- DONE #346

where is this referenced? Where to read about?

pieterlange · 2016-11-15T18:30:33Z

This was not properly linked @aholbreich, but it's in coreos/coreos-kubernetes#346

Deploying to existing subnets was skipped, but if you think you need this please add usecases to #52.

mumoshu · 2016-11-17T02:25:35Z

Just curious but does everyone want auto-scaling of workers based on cluster-autoscaler to be in the list?
Or are you going with AWS native autoscaling?

Currently, cluster-autoscaler wouldn't work as might be expected in kube-aws created clusters.

The autoscaling group should span 1 availability zone for the cluster autoscaler to work. If you want to distribute workloads evenly across zones, set up multiple ASGs, with a cluster autoscaler for each ASG. At the time of writing this, cluster autoscaler is unaware of availability zones and although autoscaling groups can contain instances in multiple availability zones when configured so, the cluster autoscaler can't reliably add nodes to desired zones. That's because AWS AutoScaling determines which zone to add nodes which is out of the control of the cluster autoscaler. For more information, see kubernetes-retired/contrib#1552 (comment).

https://github.com/kubernetes/contrib/blob/master/cluster-autoscaler/cloudprovider/aws/README.md#deployment-specification

Actually, that's why I've originally started the work for #46.

AlmogBaku · 2017-01-15T16:52:36Z

Regarding heapster/elasticsearch:
Currently, in order to use heapster+elasticsearch you can use the Managed ES by AWS; however, the managed ES doesn't support scripted fields which are very useful(and almost required) for understanding the metrics and settings alarms.

mumoshu · 2017-03-22T07:32:43Z

Set up controller and worker AutoscalingGroups to recover from ec2 instance failures

I believe this is already supported in kube-aws as of today

mumoshu · 2017-03-22T07:34:17Z

Dedicated controller subnets and routetables TBD

This is supported since v0.9.4

mumoshu · 2017-03-22T07:36:04Z

Kubelet TLS bootstrapping

This is WIP in #414

mumoshu · 2017-03-22T07:37:15Z

Node pools #46
Cloudformation update/create success signalling #49

Already supported

mumoshu · 2017-03-22T08:34:01Z

@AlmogBaku Thanks for the information!
That's a huge restriction 😢

mumoshu · 2017-03-22T08:36:14Z

Provision AWS ElasticSearch cluster

Btw I'm using GCP Stackdriver Logging for aggregating log messages from my production kube-aws clusters. When there're much nicer alternatives like Stackdriver, do we really need to support ES out-of-box in kube-aws?

mumoshu · 2017-03-22T08:37:45Z

Automatically remove nodes when instances are rotated out of ASG

@pieterlange Is the above sentence meant for rolling-updates of worker/controller/etcd nodes?

pieterlange · 2017-03-22T09:52:40Z

I don't think we need to support ES in kube-aws, but we could have some recommendations.

Automatically remove nodes when instances are rotated out of ASG

I think this referred to removing the nodes from the cluster state, where still required (eg etcd member lists). Removing/draining kubelets is already supported 👍 .

danielfm · 2017-03-22T10:01:24Z

Figure out what we're going to do about automated CSR signing in kube-aws (necessary for self-healing and autoscaling)

I think we should take the same approach taken by kube-adm, which is to automatically approve all requests sent via a specific bootstrap token, making sure this token can only be used for CSR via RBAC (already supported by kube-aws).

danielfm · 2017-03-22T10:12:35Z

Provision AWS ElasticSearch cluster

I have a working solution for ingesting cluster-wide logs to Sumologic. When I have some time, I could add this to kube-aws as an experimental feature.

The same could be done for GCP Stackdriver Logging, and other vendors.

cknowles · 2017-03-22T14:18:09Z

One potential problem with logging is the number of solutions out there. I can recommend fluentd-kubernetes-daemonset and will likely be helping add GCP Stackdriver support to that soon. However, I know some have strong opinions on using other logging tools/frameworks. If might be good to provide some recommendations in the docs.

@noissue

… to hcom-flavour * commit '28b893f91b55ad07545bcf7c871bccad7be1bbd9': @noissue Bump kubbe/coros version

mumoshu · 2017-07-13T03:06:33Z

Enable decommissioning of kubelets when instances are rotated out of the ASG (experimental support for node draining is included now)
Automatically remove nodes when instances are rotated out of ASG ASG

I believe these two in the description are now resolved thanks to @danielfm - a node now pend the rolling-update of an ASG while the node drainer drains pods.

amitkumarj441 · 2017-10-13T16:26:47Z

@pieterlange I'm willing to take Provision AWS Elasticsearch cluster task. Can you give update on this task? so that I can start working on it and can do PRs soon.

pieterlange · 2017-10-13T18:59:10Z

@amitkumarj441 i'm not very active in kube-aws anymore and will close the issue as most of the items have been fixed nowadays.

I am personally running my elasticsearch clusters inside of kubernetes and i also think thats the best way to go forward, but knock yourself out ;-).

amitkumarj441 · 2017-10-13T19:00:13Z

Thanks @pieterlange for letting me know about this.

pieterlange mentioned this issue Oct 29, 2016

[WIP] Selfmanaged etcd cluster coreos/coreos-kubernetes#629

Closed

camilb mentioned this issue Nov 2, 2016

Discrete etcd cluster coreos/coreos-kubernetes#544

Closed

This was referenced Nov 2, 2016

etcd management #27

Closed

move kube-aws development to dedicated repository coreos/coreos-kubernetes#751

Merged

linki mentioned this issue Nov 16, 2016

Consider adopting kube-aws zalando-incubator/kubernetes-on-aws#114

Closed

jollinshead referenced this issue in HotelsDotCom/kube-aws Jul 4, 2017

Merge pull request #9 in RUN/kube-aws from feature/bump-kube-to-1.6.4…

684d2bf

… to hcom-flavour * commit '28b893f91b55ad07545bcf7c871bccad7be1bbd9': @noissue Bump kubbe/coros version

pieterlange closed this as completed Oct 13, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Production Quality Deployment #9

Production Quality Deployment #9

pieterlange commented Oct 29, 2016 •

edited by mumoshu

Loading

pieterlange commented Oct 29, 2016

camilb commented Oct 29, 2016

pieterlange commented Oct 30, 2016

mumoshu commented Nov 8, 2016

aholbreich commented Nov 15, 2016 •

edited by mumoshu

Loading

pieterlange commented Nov 15, 2016

mumoshu commented Nov 17, 2016

AlmogBaku commented Jan 15, 2017

mumoshu commented Mar 22, 2017

mumoshu commented Mar 22, 2017

mumoshu commented Mar 22, 2017

mumoshu commented Mar 22, 2017

mumoshu commented Mar 22, 2017

mumoshu commented Mar 22, 2017

mumoshu commented Mar 22, 2017

pieterlange commented Mar 22, 2017

danielfm commented Mar 22, 2017

danielfm commented Mar 22, 2017 •

edited

Loading

cknowles commented Mar 22, 2017

mumoshu commented Jul 13, 2017

amitkumarj441 commented Oct 13, 2017

pieterlange commented Oct 13, 2017

amitkumarj441 commented Oct 13, 2017

Production Quality Deployment #9

Production Quality Deployment #9

Comments

pieterlange commented Oct 29, 2016 • edited by mumoshu Loading

pieterlange commented Oct 29, 2016

camilb commented Oct 29, 2016

pieterlange commented Oct 30, 2016

mumoshu commented Nov 8, 2016

aholbreich commented Nov 15, 2016 • edited by mumoshu Loading

pieterlange commented Nov 15, 2016

mumoshu commented Nov 17, 2016

AlmogBaku commented Jan 15, 2017

mumoshu commented Mar 22, 2017

mumoshu commented Mar 22, 2017

mumoshu commented Mar 22, 2017

mumoshu commented Mar 22, 2017

mumoshu commented Mar 22, 2017

mumoshu commented Mar 22, 2017

mumoshu commented Mar 22, 2017

pieterlange commented Mar 22, 2017

danielfm commented Mar 22, 2017

danielfm commented Mar 22, 2017 • edited Loading

cknowles commented Mar 22, 2017

mumoshu commented Jul 13, 2017

amitkumarj441 commented Oct 13, 2017

pieterlange commented Oct 13, 2017

amitkumarj441 commented Oct 13, 2017

pieterlange commented Oct 29, 2016 •

edited by mumoshu

Loading

aholbreich commented Nov 15, 2016 •

edited by mumoshu

Loading

danielfm commented Mar 22, 2017 •

edited

Loading