Skip to content
This repository has been archived by the owner on Sep 30, 2020. It is now read-only.

etcd management #27

Closed
4 tasks
pieterlange opened this issue Nov 2, 2016 · 26 comments · Fixed by #417
Closed
4 tasks

etcd management #27

pieterlange opened this issue Nov 2, 2016 · 26 comments · Fixed by #417
Assignees

Comments

@pieterlange
Copy link
Contributor

There are currently some people forking because we're not sure about the current etcd solution. Lets discuss the issues in this topic. A lot of us seem to center around @crewjam's etcd solution but there's also others:

I have a personal preference for https://crewjam.com/etcd-aws/ (https://github.com/crewjam/etcd-aws) but we should definitely have this conversation as a community (as i tried in the old repo coreos/coreos-kubernetes#629)
Lets combine our efforts @colhom @camilb @dzavalkinolx

Branches for inspiration:

Currently missing features:

  • backup/restore
  • node cycling
  • node discovery from ASG
  • cluster recovery from complete failure

As noted in the overall production readiness issue #9 there's also work being done on etcd being hosted inside of kubernetes itself, which is probably where all of this is going in the end.

@pieterlange
Copy link
Contributor Author

I should add that i've been running a test cluster with my monkeypatch branch and it seems to perform well.

I've been seeing a lot of warnings like these though:

Nov 02 10:54:33 ip-10-50-44-174.eu-west-1.compute.internal kubelet-wrapper[1983]: W1102 10:54:33.799124    1983 reflector.go:330] pkg/kubelet/kubelet.go:384: watch of *api.Service ended with: too old resource version: 30825 (30896)
Nov 02 10:54:54 ip-10-50-44-174.eu-west-1.compute.internal kubelet-wrapper[1983]: W1102 10:54:54.711865    1983 reflector.go:330] pkg/kubelet/config/apiserver.go:43: watch of *api.Pod ended with: too old resource version: 30847 (30919)

I think it's because the controller is configured to talk to the internal ELB and we might be hitting an etcd node that's lagging on replication but i'm not sure. It doesn't seem to break anything though.

@camilb
Copy link
Contributor

camilb commented Nov 3, 2016

@pieterlange made some progress on etcd-aws with TLS and just saw this:
https://github.com/coreos/etcd-operator

@pieterlange
Copy link
Contributor Author

pieterlange commented Nov 3, 2016

Yes, this is obviously where this is all going eventually but i think it might be many moons before all components in that deployment scenario reach a stable state. It would be nice to get some clarity from the coreos crew on deployment plans for this in case i'm completely wrong 😬

Edit: they have, i missed their latest blogs:

@mumoshu
Copy link
Contributor

mumoshu commented Nov 4, 2016

Hey, thanks for bringing up the discussion!

@pieterlange I agree with every missing feature you've listed in the first comment. I believe we should eventually provide standard way(s) to cover those.

At the beginning, as I am looking forward with introducing @crewjam's etcd-aws (I was very impressed when I first read his blog post describing it btw) as a viable option, I'd like to see your branch https://github.com/pieterlange/kube-aws/tree/feature/external-etcd to get merged into master.

Would you mind pull-requesting it so that everyone can start to experiment/feedback(I believe this is what we'd like to have 😄 ) easier using rc binaries of kube-aws?

@mumoshu
Copy link
Contributor

mumoshu commented Nov 4, 2016

@camilb Hi, I've just read through your great work in the your etcd-asg branch!

Though I'm not sure whether we can include the etcd-aws part entirely in kube-aws soon enough or not, I'd like to merge significant parts of your work to support SSL to external etcd in calico-enabled setup camilb/coreos-kubernetes@a7a14a2...29d538b into kube-aws(Am I missed something? Let me know!)

In that way, combined with @pieterlange's work, we can also encourage everyone to try out calico+etcd-aws+ssl setup easier than now, using rc binaries of kube-aws.

Would you mind pull-requesting that part of your work and collaborate more with us? 😃

@pieterlange
Copy link
Contributor Author

I'm not sure we'd want to merge my branch in yet. It's fairly intrusive. I'm also still digesting the etcd-operator news..

Can we work on a experimental/external-etcd branch for this?

@mumoshu
Copy link
Contributor

mumoshu commented Nov 4, 2016

@pieterlange Sure. Anyways, I've created that branch wishing it to become a possible place to merge our effort 👍 https://github.com/coreos/kube-aws/tree/experimental/external-etcd

@camilb
Copy link
Contributor

camilb commented Nov 4, 2016

@mumoshu I'll be glad to. The etcd-awspart still require more work, doubles the size of stack-template.json template and still require some manual steps to get it working. I will prepare the PR for the calico with SSL.

@mumoshu
Copy link
Contributor

mumoshu commented Dec 1, 2016

@pieterlange I know you're already working on this, but would you let me explicitly assign this issue to you just for clarity? 🙇
There's many ongoing issues over here and there recently so I'd like more clarity 😉

@mumoshu
Copy link
Contributor

mumoshu commented Dec 1, 2016

@camilb Any chance you could take a look into this recently?

@mumoshu
Copy link
Contributor

mumoshu commented Feb 15, 2017

FYI I've written #298 (comment) about the general concerns about H/A of an etcd cluster and why we don't collocated etcd on controller nodes.

This was referenced Feb 15, 2017
@mumoshu
Copy link
Contributor

mumoshu commented Feb 22, 2017

#332 for the ASG-EBS-EIP/ENI-per-etcd-node strategy of achieving H/A and rolling-updates and an etcd cluster is almost finished 😃
Anyone is working on the backup/restore feature?

@mumoshu
Copy link
Contributor

mumoshu commented Feb 24, 2017

fyi: @hjacobs kindly shared me that his company uses https://github.com/zalando-incubator/stups-etcd-cluster for running dedicated etcd clusters, separately from k8s clusters.
Multi-region etcd cluster sounds exciting 😄

@mumoshu
Copy link
Contributor

mumoshu commented Feb 24, 2017

#332 is generally working with ability to choose which strategy to be used for node identity and an optional support for DNS other than Amazon DNS(related issue: #189)

@gianrubio
Copy link
Contributor

@mumoshu
Copy link
Contributor

mumoshu commented Feb 28, 2017

@gianrubio Thanks for the info!
fyi I've commented on the etcd v3 draft documentation referred from there with a question https://docs.google.com/document/d/16ES7N51Xj8r1P5ITan3gPRvwMzoavvX9TqoN8IpEOU8/edit?disco=AAAABHpLq1U

@mumoshu
Copy link
Contributor

mumoshu commented Feb 28, 2017

fyi my question is briefly:

mumoshu added a commit to mumoshu/kube-aws that referenced this issue Mar 1, 2017
This change is basically for achieving "Managed HA etcd cluster" with private IPs resolved via public EC2 hostnames stabilized with a pool of EBS and EIP pairs for etcd nodes.

After this change, EC2 instances backing "virtual" etcd nodes are managed by an ASG.

Supported use-cases:

* Automatic recovery from temporary Etcd node failures
  * Even if all the nodes went down, the cluster recovers eventually as long as the EBS volumes aren't corrupted
* Rolling-update of the instance type for etcd nodes without downtime
  * = Scaling-out of Etcd nodes via NOT modifying the ASG directly BUT indirectly via CloudFormation stack updates
* Other use-cases implied by the fact that the nodes are managed by ASGs
* You can choose "eip" or "eni" for etcd node(=etcd member) identity via the `etcd.memberIdentityProvider` key in cluster.yaml
  * `"eip"`, which is the default setting, is recommended
  * If you want, choose `"eni"`.
  * If you choose `"eni"`, and your region has less than 3 AZs, setting `etcd.internalDomainName` to something other than default is HIGHLY RECOMMENDED to prepare for disaster recovery
  * It is an advanced option but DNS other than Amazon DNS could be used (when `memberIdentityProvider` is `"eni"`, `internalDomainName` is set, `manageRecordSets` is `false`, and every EC2 instance has a custom DNS which is capable of resolving FQDNs under `internalDomainName`)

Unsupported use-cases:

* Automatic recovery from more than `(N-1)/2` permanent Etcd nodes failure.
  * Requires etcd backups and automatic determination of whether the new etcd cluster should be created or not via `ETCD_INITIAL_CLUSTER_STATE`
* Scaling-in of Etcd nodes
  * Just remains untested because it isn't my primary focus in this area. Contributions are welcomed

Relevant issues to be (partly) resolved via this PR:

* Part(s) of kubernetes-retired#27
* Wait signal for etcd nodes. See kubernetes-retired#49
* Probably kubernetes-retired#189 kubernetes-retired#260 as this relies on stable EC2 public hostnames and AWS DNS for peer communication and discovery regardless of whether an EC2 instance relies on a custom domain/hostname or not

The general idea is to make Etcd nodes "virtual" by retaining the state and the identity of an etcd node in a pair of an EBS volume and an EIP or an ENI, respectively.
This way, we can recover/recreate/rolling-update EC2 instances backing etcd nodes without another moving parts like external apps and/or ASG lifecycle hooks, SQS queues, SNS topics, etc.

Unlike well-known etcd HA solutions like crewjam/etcd-aws and MonsantoCo/etcd-aws-cluster, this is intended to be a less flexible but a simpler alternative or the basis for introducing a similar solutions like those.

* If you rely on Route 53 record sets, don't modify ones initially created by CloudFormation
   * Doing so breaks CloudFormation stack deletions because it has no way to know about modified record sets and therefore can't cleanly remove them.
* To prepare for a disaster recovery for a single-AZ etcd cluster(possible when the user relies on an AWS region with 2 or less AZs), use Route 53 record sets or EIPs to retain network identities among AZs
   * ENIs and EBS can't be moved to an another AZ
   * EBS volume can, however, be transferred utilizing a snapshot

* Static private IPs via a pool of ENIs dynamically assigned to EC2 instances under control of a single ASG
  * ENIs can't move around different AZs. What happens when you have 2 ENIs in and 1 ENI in different AZs and the former AZ goes down? Nothing until the AZ comes up! It isn't the degree of H/A I wish to have at all!
* Dynamic private IPs via stable hostnames using a pool of EIP&EBS pairs, single ASG
  * EBS is required in order to achieve "locking" of a pair associated to an etcd instance
    * First of all, identify the "free" pair by filtering available EBS volumes and try to associate it to the EC2 instance
    * Successful association of an EBS volume means that the paired EIP can also be associated to the instance without race conditions
  * EBS can't move around different AZs. What happens when you have 2 pairs in AZ 1 and 1 pair in AZ 2? Once the AZ 2 goes down, the options you can take are (1) manually altering AZ 2 to have 3 etcd nodes and then manually elect a new leader (2) recreate the etcd cluster within AZ 2 by modifying `etcd.subnets[]` point to AZ 2 in cluster.yaml and running `kube-aws update`, ssh into one of nodes and restore etcd state from a backup. Neither is automatic.
@mumoshu
Copy link
Contributor

mumoshu commented Mar 2, 2017

FYI my question was answered thanks to @hongchaodeng 😄

@mumoshu
Copy link
Contributor

mumoshu commented Mar 2, 2017

In a nutshell, I'd like to start with a POC which supports:

  1. Automated backup of etcd cluster
  • Probably by periodically running etcdctl backup and then aws s3 cp <backup dir> <etcd backup s3 uri>/<storage revision> in the "next" node of a "leader" node at that time
  • <etcd backup s3 uri> would be specified via cluster.yaml

And then extend it to also support:

  1. Manual recovery of etcd node data
  • If an EBS data volume for an etcd node is not initialized yet, look for the most fresh backup of etcd cluster, which would have the highest <storage revision>, from <etcd backup s3 uri>/*
  • And then run etcd with a data dir recovered from it
  • Easiest way of manually recovering etcd cluster is to recreate your kube-aws cluster completely after adding <etcd backup s3 uri> to your cluster.yaml
  • Or otherwise you may also trigger this procedure via systemctl

Finally extend it to also support:

  1. (Experimental) Automatic recovery of etcd node data
  • If an EBS data volume seems to be corrupted i.e. etcd2.service exits early producing a specific error, wipe the data dir and the wal dir and then do 2.

@mumoshu
Copy link
Contributor

mumoshu commented Mar 6, 2017

It turns out, when recreating an etcd cluster from a backup, we need a single founding member and all the subsequent etcd members needs to be added one by one; This(i.e. dynamic reconfiguration of etcd cluster) diverges greatly from a static configuration of etcd cluster we're currently relying on.

Supporting both static and dynamic configuration of etcd members does complicate the implementation.

@mumoshu
Copy link
Contributor

mumoshu commented Mar 6, 2017

Etcd v3, AFAIK supported in k8s 1.6, seems to provide a new way of restoring an etcd cluster in a way more similar to the static configuration according to etcd-io/etcd#2366 (comment)

@mumoshu
Copy link
Contributor

mumoshu commented Mar 16, 2017

Submitted #417 for this.

  • etcd is now v3
  • automatic recovery from any number of permanently failed etcd nodes are enabled by default(for now)
    • Please suggest a way to toggle it if you'd like to turn it off optionally/or by default!

@ReneSaenz
Copy link

Hello. I am trying to restore a k8s cluster from etcd data backup. I am able to restore the etcd cluster to it original state (has all the k8s info). However, when I get the k8s cluster rc,services,deployments,etc they are all gone. The k8s cluster is not like before the restore.
What am I missing? Can someone point me in the right direction?

@mumoshu
Copy link
Contributor

mumoshu commented Jun 26, 2017

Hi @ReneSaenz, thanks for trying kube-aws 👍

Several questions came to my mind at this point:

  • How did you save & restore your k8s cluster?
  • Which part of your cluster has successfully restored i.e. what has not gone? Only pods, secrets, etc?

@redbaron
Copy link
Contributor

@ReneSaenz did you finish etcd backup restore before any controller-managers are started? And certainly not doing a restore when controller-managers are running?

@ReneSaenz
Copy link

@mumoshu I am not using aws. Sorry for the confusion. The etcd backup is done like this.
etcdctl backup --data-dir /var/lib/etcd --backup-dir /var/tmp/etcdbackup.
Then I create a tar-ball of that directory. The resulting file is then sent to an external nfs server.

@redbaron When before making a backup, I stopped the etcd process. Make the backup, and then restarted etcd process again.
When I restored, I did not stopped any k8s related process. So the restore was done when kube-apiserver, kube-controller, scheduler were running.

davidmccormick pushed a commit to HotelsDotCom/kube-aws that referenced this issue Mar 21, 2018
…avour-0.9.9 to hcom-flavour

* commit '0e116d72ead70121c730d3bc4009f8d562e16912': (24 commits)
  RUN-788 Add kubectl run parameters
  Allow toggling Metrics Server installation
  Correct values for the `kubernetes.io/cluster/<Cluster ID>` tags Resolves kubernetes-retired#1025
  Fix dashboard doco links
  Fix install-kube-system when node drainer is enabled Follow-up for kubernetes-retired#1043
  Two fixes to 0.9.9 rc.3 (kubernetes-retired#1043)
  Update the documentation for Kubernetes Dashboard.
  Improve the configuration for Kubernetes Dashboard.
  Fix the creation of all metrics-server resources.
  Use templated image for metrics-server.
  Follow-ups for Kubernetes 1.8
  Metrics Server addon. (kubernetes-retired#973)
  Quick start and high availability guides
  Add rkt container cleanup to journald-cloudwatch-logs service
  Update Tiller image to v2.7.2
  Update kube-dns 1.14.7
  Bump Cluster Autoscaler version to 1.0.3
  Bump Kubernetes and ETCD version.
  Support EC2 instance tags per node role This feature will be handy when e.g. your monitoring tools discovers EC2 instances and then groups resource metrics with EC2 instance tags.
  Fix the default FleetIamRole Closes kubernetes-retired#1022
  ...
kylehodgetts pushed a commit to HotelsDotCom/kube-aws that referenced this issue Mar 27, 2018
This change is basically for achieving "Managed HA etcd cluster" with private IPs resolved via public EC2 hostnames stabilized with a pool of EBS and EIP pairs for etcd nodes.

After this change, EC2 instances backing "virtual" etcd nodes are managed by an ASG.

Supported use-cases:

* Automatic recovery from temporary Etcd node failures
  * Even if all the nodes went down, the cluster recovers eventually as long as the EBS volumes aren't corrupted
* Rolling-update of the instance type for etcd nodes without downtime
  * = Scaling-out of Etcd nodes via NOT modifying the ASG directly BUT indirectly via CloudFormation stack updates
* Other use-cases implied by the fact that the nodes are managed by ASGs
* You can choose "eip" or "eni" for etcd node(=etcd member) identity via the `etcd.memberIdentityProvider` key in cluster.yaml
  * `"eip"`, which is the default setting, is recommended
  * If you want, choose `"eni"`.
  * If you choose `"eni"`, and your region has less than 3 AZs, setting `etcd.internalDomainName` to something other than default is HIGHLY RECOMMENDED to prepare for disaster recovery
  * It is an advanced option but DNS other than Amazon DNS could be used (when `memberIdentityProvider` is `"eni"`, `internalDomainName` is set, `manageRecordSets` is `false`, and every EC2 instance has a custom DNS which is capable of resolving FQDNs under `internalDomainName`)

Unsupported use-cases:

* Automatic recovery from more than `(N-1)/2` permanent Etcd nodes failure.
  * Requires etcd backups and automatic determination of whether the new etcd cluster should be created or not via `ETCD_INITIAL_CLUSTER_STATE`
* Scaling-in of Etcd nodes
  * Just remains untested because it isn't my primary focus in this area. Contributions are welcomed

Relevant issues to be (partly) resolved via this PR:

* Part(s) of kubernetes-retired#27
* Wait signal for etcd nodes. See kubernetes-retired#49
* Probably kubernetes-retired#189 kubernetes-retired#260 as this relies on stable EC2 public hostnames and AWS DNS for peer communication and discovery regardless of whether an EC2 instance relies on a custom domain/hostname or not

The general idea is to make Etcd nodes "virtual" by retaining the state and the identity of an etcd node in a pair of an EBS volume and an EIP or an ENI, respectively.
This way, we can recover/recreate/rolling-update EC2 instances backing etcd nodes without another moving parts like external apps and/or ASG lifecycle hooks, SQS queues, SNS topics, etc.

Unlike well-known etcd HA solutions like crewjam/etcd-aws and MonsantoCo/etcd-aws-cluster, this is intended to be a less flexible but a simpler alternative or the basis for introducing a similar solutions like those.

* If you rely on Route 53 record sets, don't modify ones initially created by CloudFormation
   * Doing so breaks CloudFormation stack deletions because it has no way to know about modified record sets and therefore can't cleanly remove them.
* To prepare for a disaster recovery for a single-AZ etcd cluster(possible when the user relies on an AWS region with 2 or less AZs), use Route 53 record sets or EIPs to retain network identities among AZs
   * ENIs and EBS can't be moved to an another AZ
   * EBS volume can, however, be transferred utilizing a snapshot

* Static private IPs via a pool of ENIs dynamically assigned to EC2 instances under control of a single ASG
  * ENIs can't move around different AZs. What happens when you have 2 ENIs in and 1 ENI in different AZs and the former AZ goes down? Nothing until the AZ comes up! It isn't the degree of H/A I wish to have at all!
* Dynamic private IPs via stable hostnames using a pool of EIP&EBS pairs, single ASG
  * EBS is required in order to achieve "locking" of a pair associated to an etcd instance
    * First of all, identify the "free" pair by filtering available EBS volumes and try to associate it to the EC2 instance
    * Successful association of an EBS volume means that the paired EIP can also be associated to the instance without race conditions
  * EBS can't move around different AZs. What happens when you have 2 pairs in AZ 1 and 1 pair in AZ 2? Once the AZ 2 goes down, the options you can take are (1) manually altering AZ 2 to have 3 etcd nodes and then manually elect a new leader (2) recreate the etcd cluster within AZ 2 by modifying `etcd.subnets[]` point to AZ 2 in cluster.yaml and running `kube-aws update`, ssh into one of nodes and restore etcd state from a backup. Neither is automatic.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants