etcd management #27

pieterlange · 2016-11-02T10:47:20Z

There are currently some people forking because we're not sure about the current etcd solution. Lets discuss the issues in this topic. A lot of us seem to center around @crewjam's etcd solution but there's also others:

I have a personal preference for https://crewjam.com/etcd-aws/ (https://github.com/crewjam/etcd-aws) but we should definitely have this conversation as a community (as i tried in the old repo coreos/coreos-kubernetes#629)
Lets combine our efforts @colhom @camilb @dzavalkinolx

Branches for inspiration:

https://github.com/pieterlange/kube-aws/tree/feature/external-etcd (refers to etcd-aws managed by separate cloudformation template)
@camilb's work
- https://github.com/camilb/etcd-aws/tree/ssl (SSL support for etcd-aws)
- https://github.com/camilb/coreos-kubernetes/tree/etcd-asg

Currently missing features:

backup/restore
node cycling
node discovery from ASG
cluster recovery from complete failure

As noted in the overall production readiness issue #9 there's also work being done on etcd being hosted inside of kubernetes itself, which is probably where all of this is going in the end.

pieterlange · 2016-11-02T10:58:57Z

I should add that i've been running a test cluster with my monkeypatch branch and it seems to perform well.

I've been seeing a lot of warnings like these though:

Nov 02 10:54:33 ip-10-50-44-174.eu-west-1.compute.internal kubelet-wrapper[1983]: W1102 10:54:33.799124    1983 reflector.go:330] pkg/kubelet/kubelet.go:384: watch of *api.Service ended with: too old resource version: 30825 (30896)
Nov 02 10:54:54 ip-10-50-44-174.eu-west-1.compute.internal kubelet-wrapper[1983]: W1102 10:54:54.711865    1983 reflector.go:330] pkg/kubelet/config/apiserver.go:43: watch of *api.Pod ended with: too old resource version: 30847 (30919)

I think it's because the controller is configured to talk to the internal ELB and we might be hitting an etcd node that's lagging on replication but i'm not sure. It doesn't seem to break anything though.

camilb · 2016-11-03T18:15:46Z

@pieterlange made some progress on etcd-aws with TLS and just saw this:
https://github.com/coreos/etcd-operator

pieterlange · 2016-11-03T20:27:51Z

Yes, this is obviously where this is all going eventually but i think it might be many moons before all components in that deployment scenario reach a stable state. It would be nice to get some clarity from the coreos crew on deployment plans for this in case i'm completely wrong 😬

Edit: they have, i missed their latest blogs:

mumoshu · 2016-11-04T04:07:21Z

Hey, thanks for bringing up the discussion!

@pieterlange I agree with every missing feature you've listed in the first comment. I believe we should eventually provide standard way(s) to cover those.

At the beginning, as I am looking forward with introducing @crewjam's etcd-aws (I was very impressed when I first read his blog post describing it btw) as a viable option, I'd like to see your branch https://github.com/pieterlange/kube-aws/tree/feature/external-etcd to get merged into master.

Would you mind pull-requesting it so that everyone can start to experiment/feedback(I believe this is what we'd like to have 😄 ) easier using rc binaries of kube-aws?

mumoshu · 2016-11-04T04:09:29Z

@camilb Hi, I've just read through your great work in the your etcd-asg branch!

Though I'm not sure whether we can include the etcd-aws part entirely in kube-aws soon enough or not, I'd like to merge significant parts of your work to support SSL to external etcd in calico-enabled setup camilb/coreos-kubernetes@a7a14a2...29d538b into kube-aws(Am I missed something? Let me know!)

In that way, combined with @pieterlange's work, we can also encourage everyone to try out calico+etcd-aws+ssl setup easier than now, using rc binaries of kube-aws.

Would you mind pull-requesting that part of your work and collaborate more with us? 😃

pieterlange · 2016-11-04T06:51:23Z

I'm not sure we'd want to merge my branch in yet. It's fairly intrusive. I'm also still digesting the etcd-operator news..

Can we work on a experimental/external-etcd branch for this?

mumoshu · 2016-11-04T07:22:04Z

@pieterlange Sure. Anyways, I've created that branch wishing it to become a possible place to merge our effort 👍 https://github.com/coreos/kube-aws/tree/experimental/external-etcd

camilb · 2016-11-04T09:33:18Z

@mumoshu I'll be glad to. The etcd-awspart still require more work, doubles the size of stack-template.json template and still require some manual steps to get it working. I will prepare the PR for the calico with SSL.

mumoshu · 2016-12-01T08:21:36Z

@pieterlange I know you're already working on this, but would you let me explicitly assign this issue to you just for clarity? 🙇
There's many ongoing issues over here and there recently so I'd like more clarity 😉

mumoshu · 2016-12-01T08:23:12Z

@camilb Any chance you could take a look into this recently?

mumoshu · 2017-02-15T02:43:03Z

FYI I've written #298 (comment) about the general concerns about H/A of an etcd cluster and why we don't collocated etcd on controller nodes.

mumoshu · 2017-02-22T23:30:32Z

#332 for the ASG-EBS-EIP/ENI-per-etcd-node strategy of achieving H/A and rolling-updates and an etcd cluster is almost finished 😃
Anyone is working on the backup/restore feature?

mumoshu · 2017-02-24T06:42:19Z

fyi: @hjacobs kindly shared me that his company uses https://github.com/zalando-incubator/stups-etcd-cluster for running dedicated etcd clusters, separately from k8s clusters.
Multi-region etcd cluster sounds exciting 😄

mumoshu · 2017-02-24T06:46:13Z

#332 is generally working with ability to choose which strategy to be used for node identity and an optional support for DNS other than Amazon DNS(related issue: #189)

gianrubio · 2017-02-28T15:56:16Z

@mumoshu kubernetes/kubernetes#40027

mumoshu · 2017-02-28T22:29:17Z

@gianrubio Thanks for the info!
fyi I've commented on the etcd v3 draft documentation referred from there with a question https://docs.google.com/document/d/16ES7N51Xj8r1P5ITan3gPRvwMzoavvX9TqoN8IpEOU8/edit?disco=AAAABHpLq1U

mumoshu · 2017-02-28T22:46:31Z

fyi my question is briefly:

Is it safe to just take an EBS snapshot WHILE etcd is still writing data to the EBS volume(=without freezing the filesystem) and restore from it?
- If yes, does it apply to both etcd v2 and v3?
If it is safe to take an EBS snapshot without freezing the fs, how do we strip metadata(including the identity of the etcd node) from etcd data?
Do we need to take an EBS snapshot per node and restore it to an etcd node with the same identity as included in the EBS snapshot?
Or do we need to wipe its metadata, for v2 by running etcdctl backup and then running etcd with the data and wal directories recreated from the backup, and for v3 by running etcdctl snapshot save and etcdctl snapshot restore?

This change is basically for achieving "Managed HA etcd cluster" with private IPs resolved via public EC2 hostnames stabilized with a pool of EBS and EIP pairs for etcd nodes. After this change, EC2 instances backing "virtual" etcd nodes are managed by an ASG. Supported use-cases: * Automatic recovery from temporary Etcd node failures * Even if all the nodes went down, the cluster recovers eventually as long as the EBS volumes aren't corrupted * Rolling-update of the instance type for etcd nodes without downtime * = Scaling-out of Etcd nodes via NOT modifying the ASG directly BUT indirectly via CloudFormation stack updates * Other use-cases implied by the fact that the nodes are managed by ASGs * You can choose "eip" or "eni" for etcd node(=etcd member) identity via the `etcd.memberIdentityProvider` key in cluster.yaml * `"eip"`, which is the default setting, is recommended * If you want, choose `"eni"`. * If you choose `"eni"`, and your region has less than 3 AZs, setting `etcd.internalDomainName` to something other than default is HIGHLY RECOMMENDED to prepare for disaster recovery * It is an advanced option but DNS other than Amazon DNS could be used (when `memberIdentityProvider` is `"eni"`, `internalDomainName` is set, `manageRecordSets` is `false`, and every EC2 instance has a custom DNS which is capable of resolving FQDNs under `internalDomainName`) Unsupported use-cases: * Automatic recovery from more than `(N-1)/2` permanent Etcd nodes failure. * Requires etcd backups and automatic determination of whether the new etcd cluster should be created or not via `ETCD_INITIAL_CLUSTER_STATE` * Scaling-in of Etcd nodes * Just remains untested because it isn't my primary focus in this area. Contributions are welcomed Relevant issues to be (partly) resolved via this PR: * Part(s) of kubernetes-retired#27 * Wait signal for etcd nodes. See kubernetes-retired#49 * Probably kubernetes-retired#189 kubernetes-retired#260 as this relies on stable EC2 public hostnames and AWS DNS for peer communication and discovery regardless of whether an EC2 instance relies on a custom domain/hostname or not The general idea is to make Etcd nodes "virtual" by retaining the state and the identity of an etcd node in a pair of an EBS volume and an EIP or an ENI, respectively. This way, we can recover/recreate/rolling-update EC2 instances backing etcd nodes without another moving parts like external apps and/or ASG lifecycle hooks, SQS queues, SNS topics, etc. Unlike well-known etcd HA solutions like crewjam/etcd-aws and MonsantoCo/etcd-aws-cluster, this is intended to be a less flexible but a simpler alternative or the basis for introducing a similar solutions like those. * If you rely on Route 53 record sets, don't modify ones initially created by CloudFormation * Doing so breaks CloudFormation stack deletions because it has no way to know about modified record sets and therefore can't cleanly remove them. * To prepare for a disaster recovery for a single-AZ etcd cluster(possible when the user relies on an AWS region with 2 or less AZs), use Route 53 record sets or EIPs to retain network identities among AZs * ENIs and EBS can't be moved to an another AZ * EBS volume can, however, be transferred utilizing a snapshot * Static private IPs via a pool of ENIs dynamically assigned to EC2 instances under control of a single ASG * ENIs can't move around different AZs. What happens when you have 2 ENIs in and 1 ENI in different AZs and the former AZ goes down? Nothing until the AZ comes up! It isn't the degree of H/A I wish to have at all! * Dynamic private IPs via stable hostnames using a pool of EIP&EBS pairs, single ASG * EBS is required in order to achieve "locking" of a pair associated to an etcd instance * First of all, identify the "free" pair by filtering available EBS volumes and try to associate it to the EC2 instance * Successful association of an EBS volume means that the paired EIP can also be associated to the instance without race conditions * EBS can't move around different AZs. What happens when you have 2 pairs in AZ 1 and 1 pair in AZ 2? Once the AZ 2 goes down, the options you can take are (1) manually altering AZ 2 to have 3 etcd nodes and then manually elect a new leader (2) recreate the etcd cluster within AZ 2 by modifying `etcd.subnets[]` point to AZ 2 in cluster.yaml and running `kube-aws update`, ssh into one of nodes and restore etcd state from a backup. Neither is automatic.

mumoshu · 2017-03-02T03:13:35Z

FYI my question was answered thanks to @hongchaodeng 😄

mumoshu · 2017-03-02T03:33:32Z

In a nutshell, I'd like to start with a POC which supports:

Automated backup of etcd cluster

Probably by periodically running etcdctl backup and then aws s3 cp <backup dir> <etcd backup s3 uri>/<storage revision> in the "next" node of a "leader" node at that time
<etcd backup s3 uri> would be specified via cluster.yaml

And then extend it to also support:

Manual recovery of etcd node data

If an EBS data volume for an etcd node is not initialized yet, look for the most fresh backup of etcd cluster, which would have the highest <storage revision>, from <etcd backup s3 uri>/*
And then run etcd with a data dir recovered from it
Easiest way of manually recovering etcd cluster is to recreate your kube-aws cluster completely after adding <etcd backup s3 uri> to your cluster.yaml
Or otherwise you may also trigger this procedure via systemctl

Finally extend it to also support:

(Experimental) Automatic recovery of etcd node data

If an EBS data volume seems to be corrupted i.e. etcd2.service exits early producing a specific error, wipe the data dir and the wal dir and then do 2.

mumoshu · 2017-03-06T02:40:41Z

It turns out, when recreating an etcd cluster from a backup, we need a single founding member and all the subsequent etcd members needs to be added one by one; This(i.e. dynamic reconfiguration of etcd cluster) diverges greatly from a static configuration of etcd cluster we're currently relying on.

Supporting both static and dynamic configuration of etcd members does complicate the implementation.

mumoshu · 2017-03-06T02:43:14Z

Etcd v3, AFAIK supported in k8s 1.6, seems to provide a new way of restoring an etcd cluster in a way more similar to the static configuration according to etcd-io/etcd#2366 (comment)

mumoshu · 2017-03-16T03:06:32Z

Submitted #417 for this.

etcd is now v3
automatic recovery from any number of permanently failed etcd nodes are enabled by default(for now)
- Please suggest a way to toggle it if you'd like to turn it off optionally/or by default!

ReneSaenz · 2017-06-26T02:00:19Z

Hello. I am trying to restore a k8s cluster from etcd data backup. I am able to restore the etcd cluster to it original state (has all the k8s info). However, when I get the k8s cluster rc,services,deployments,etc they are all gone. The k8s cluster is not like before the restore.
What am I missing? Can someone point me in the right direction?

mumoshu · 2017-06-26T04:28:42Z

Hi @ReneSaenz, thanks for trying kube-aws 👍

Several questions came to my mind at this point:

How did you save & restore your k8s cluster?
Which part of your cluster has successfully restored i.e. what has not gone? Only pods, secrets, etc?

redbaron · 2017-06-26T07:47:19Z

@ReneSaenz did you finish etcd backup restore before any controller-managers are started? And certainly not doing a restore when controller-managers are running?

ReneSaenz · 2017-06-26T19:47:37Z

@mumoshu I am not using aws. Sorry for the confusion. The etcd backup is done like this.
etcdctl backup --data-dir /var/lib/etcd --backup-dir /var/tmp/etcdbackup.
Then I create a tar-ball of that directory. The resulting file is then sent to an external nfs server.

@redbaron When before making a backup, I stopped the etcd process. Make the backup, and then restarted etcd process again.
When I restored, I did not stopped any k8s related process. So the restore was done when kube-apiserver, kube-controller, scheduler were running.

…avour-0.9.9 to hcom-flavour * commit '0e116d72ead70121c730d3bc4009f8d562e16912': (24 commits) RUN-788 Add kubectl run parameters Allow toggling Metrics Server installation Correct values for the `kubernetes.io/cluster/<Cluster ID>` tags Resolves kubernetes-retired#1025 Fix dashboard doco links Fix install-kube-system when node drainer is enabled Follow-up for kubernetes-retired#1043 Two fixes to 0.9.9 rc.3 (kubernetes-retired#1043) Update the documentation for Kubernetes Dashboard. Improve the configuration for Kubernetes Dashboard. Fix the creation of all metrics-server resources. Use templated image for metrics-server. Follow-ups for Kubernetes 1.8 Metrics Server addon. (kubernetes-retired#973) Quick start and high availability guides Add rkt container cleanup to journald-cloudwatch-logs service Update Tiller image to v2.7.2 Update kube-dns 1.14.7 Bump Cluster Autoscaler version to 1.0.3 Bump Kubernetes and ETCD version. Support EC2 instance tags per node role This feature will be handy when e.g. your monitoring tools discovers EC2 instances and then groups resource metrics with EC2 instance tags. Fix the default FleetIamRole Closes kubernetes-retired#1022 ...

This change is basically for achieving "Managed HA etcd cluster" with private IPs resolved via public EC2 hostnames stabilized with a pool of EBS and EIP pairs for etcd nodes. After this change, EC2 instances backing "virtual" etcd nodes are managed by an ASG. Supported use-cases: * Automatic recovery from temporary Etcd node failures * Even if all the nodes went down, the cluster recovers eventually as long as the EBS volumes aren't corrupted * Rolling-update of the instance type for etcd nodes without downtime * = Scaling-out of Etcd nodes via NOT modifying the ASG directly BUT indirectly via CloudFormation stack updates * Other use-cases implied by the fact that the nodes are managed by ASGs * You can choose "eip" or "eni" for etcd node(=etcd member) identity via the `etcd.memberIdentityProvider` key in cluster.yaml * `"eip"`, which is the default setting, is recommended * If you want, choose `"eni"`. * If you choose `"eni"`, and your region has less than 3 AZs, setting `etcd.internalDomainName` to something other than default is HIGHLY RECOMMENDED to prepare for disaster recovery * It is an advanced option but DNS other than Amazon DNS could be used (when `memberIdentityProvider` is `"eni"`, `internalDomainName` is set, `manageRecordSets` is `false`, and every EC2 instance has a custom DNS which is capable of resolving FQDNs under `internalDomainName`) Unsupported use-cases: * Automatic recovery from more than `(N-1)/2` permanent Etcd nodes failure. * Requires etcd backups and automatic determination of whether the new etcd cluster should be created or not via `ETCD_INITIAL_CLUSTER_STATE` * Scaling-in of Etcd nodes * Just remains untested because it isn't my primary focus in this area. Contributions are welcomed Relevant issues to be (partly) resolved via this PR: * Part(s) of kubernetes-retired#27 * Wait signal for etcd nodes. See kubernetes-retired#49 * Probably kubernetes-retired#189 kubernetes-retired#260 as this relies on stable EC2 public hostnames and AWS DNS for peer communication and discovery regardless of whether an EC2 instance relies on a custom domain/hostname or not The general idea is to make Etcd nodes "virtual" by retaining the state and the identity of an etcd node in a pair of an EBS volume and an EIP or an ENI, respectively. This way, we can recover/recreate/rolling-update EC2 instances backing etcd nodes without another moving parts like external apps and/or ASG lifecycle hooks, SQS queues, SNS topics, etc. Unlike well-known etcd HA solutions like crewjam/etcd-aws and MonsantoCo/etcd-aws-cluster, this is intended to be a less flexible but a simpler alternative or the basis for introducing a similar solutions like those. * If you rely on Route 53 record sets, don't modify ones initially created by CloudFormation * Doing so breaks CloudFormation stack deletions because it has no way to know about modified record sets and therefore can't cleanly remove them. * To prepare for a disaster recovery for a single-AZ etcd cluster(possible when the user relies on an AWS region with 2 or less AZs), use Route 53 record sets or EIPs to retain network identities among AZs * ENIs and EBS can't be moved to an another AZ * EBS volume can, however, be transferred utilizing a snapshot * Static private IPs via a pool of ENIs dynamically assigned to EC2 instances under control of a single ASG * ENIs can't move around different AZs. What happens when you have 2 ENIs in and 1 ENI in different AZs and the former AZ goes down? Nothing until the AZ comes up! It isn't the degree of H/A I wish to have at all! * Dynamic private IPs via stable hostnames using a pool of EIP&EBS pairs, single ASG * EBS is required in order to achieve "locking" of a pair associated to an etcd instance * First of all, identify the "free" pair by filtering available EBS volumes and try to associate it to the EC2 instance * Successful association of an EBS volume means that the paired EIP can also be associated to the instance without race conditions * EBS can't move around different AZs. What happens when you have 2 pairs in AZ 1 and 1 pair in AZ 2? Once the AZ 2 goes down, the options you can take are (1) manually altering AZ 2 to have 3 etcd nodes and then manually elect a new leader (2) recreate the etcd cluster within AZ 2 by modifying `etcd.subnets[]` point to AZ 2 in cluster.yaml and running `kube-aws update`, ssh into one of nodes and restore etcd state from a backup. Neither is automatic.

pieterlange mentioned this issue Nov 2, 2016

Discrete etcd cluster coreos/coreos-kubernetes#544

Closed

pieterlange mentioned this issue Nov 3, 2016

Production Quality Deployment #9

Closed

22 tasks

pieterlange mentioned this issue Nov 4, 2016

Feature/external etcd #32

Closed

mumoshu mentioned this issue Nov 7, 2016

Use calico with secure ETCD. #33

Merged

mumoshu assigned pieterlange Dec 1, 2016

gianrubio mentioned this issue Dec 6, 2016

Proposal: Clouformation wait for signal #49

Closed

redbaron mentioned this issue Jan 12, 2017

etcd autoscale setup #239

Closed

mumoshu mentioned this issue Feb 1, 2017

Allow to choose between ELB and Route53 round robin for the APIServer #281

Closed

This was referenced Feb 15, 2017

colocate etcd on the masters #298

Closed

Managed HA etcd cluster #332

Merged

mumoshu mentioned this issue Mar 1, 2017

Backups of etcd (AWS) kubernetes/kubernetes#40027

Closed

mumoshu mentioned this issue Mar 6, 2017

etcd v3 support #381

Closed

mumoshu mentioned this issue Mar 16, 2017

Automatic recovery from permanent failures of etcd3 nodes #417

Merged

4 tasks

mumoshu closed this as completed in #417 Apr 4, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

etcd management #27

etcd management #27

pieterlange commented Nov 2, 2016

pieterlange commented Nov 2, 2016

camilb commented Nov 3, 2016

pieterlange commented Nov 3, 2016 •

edited

Loading

mumoshu commented Nov 4, 2016 •

edited

Loading

mumoshu commented Nov 4, 2016 •

edited

Loading

pieterlange commented Nov 4, 2016

mumoshu commented Nov 4, 2016

camilb commented Nov 4, 2016

mumoshu commented Dec 1, 2016 •

edited

Loading

mumoshu commented Dec 1, 2016

mumoshu commented Feb 15, 2017

mumoshu commented Feb 22, 2017

mumoshu commented Feb 24, 2017 •

edited

Loading

mumoshu commented Feb 24, 2017

gianrubio commented Feb 28, 2017

mumoshu commented Feb 28, 2017

mumoshu commented Feb 28, 2017 •

edited

Loading

mumoshu commented Mar 2, 2017

mumoshu commented Mar 2, 2017 •

edited

Loading

mumoshu commented Mar 6, 2017

mumoshu commented Mar 6, 2017

mumoshu commented Mar 16, 2017

ReneSaenz commented Jun 26, 2017

mumoshu commented Jun 26, 2017

redbaron commented Jun 26, 2017

ReneSaenz commented Jun 26, 2017

etcd management #27

etcd management #27

Comments

pieterlange commented Nov 2, 2016

pieterlange commented Nov 2, 2016

camilb commented Nov 3, 2016

pieterlange commented Nov 3, 2016 • edited Loading

mumoshu commented Nov 4, 2016 • edited Loading

mumoshu commented Nov 4, 2016 • edited Loading

pieterlange commented Nov 4, 2016

mumoshu commented Nov 4, 2016

camilb commented Nov 4, 2016

mumoshu commented Dec 1, 2016 • edited Loading

mumoshu commented Dec 1, 2016

mumoshu commented Feb 15, 2017

mumoshu commented Feb 22, 2017

mumoshu commented Feb 24, 2017 • edited Loading

mumoshu commented Feb 24, 2017

gianrubio commented Feb 28, 2017

mumoshu commented Feb 28, 2017

mumoshu commented Feb 28, 2017 • edited Loading

mumoshu commented Mar 2, 2017

mumoshu commented Mar 2, 2017 • edited Loading

mumoshu commented Mar 6, 2017

mumoshu commented Mar 6, 2017

mumoshu commented Mar 16, 2017

ReneSaenz commented Jun 26, 2017

mumoshu commented Jun 26, 2017

redbaron commented Jun 26, 2017

ReneSaenz commented Jun 26, 2017

pieterlange commented Nov 3, 2016 •

edited

Loading

mumoshu commented Nov 4, 2016 •

edited

Loading

mumoshu commented Nov 4, 2016 •

edited

Loading

mumoshu commented Dec 1, 2016 •

edited

Loading

mumoshu commented Feb 24, 2017 •

edited

Loading

mumoshu commented Feb 28, 2017 •

edited

Loading

mumoshu commented Mar 2, 2017 •

edited

Loading