Skip to content
This repository has been archived by the owner on Sep 30, 2020. It is now read-only.

Managed HA etcd cluster #332

Merged
merged 3 commits into from
Mar 1, 2017
Merged

Conversation

mumoshu
Copy link
Contributor

@mumoshu mumoshu commented Feb 20, 2017

This is a WIP pull request to achieve "Managed HA etcd cluster" with private IPs resolved via public EC2 hostnames stabilized with a pool of EBS and EIP pairs for etcd nodes.

After this change, EC2 instances backing "virtual" etcd nodes are managed by an ASG.

Supported use-cases:

  • Automatic recovery from temporary Etcd node failures
    • Even if all the nodes went down, the cluster recovers eventually as long as the EBS volumes aren't corrupted
  • Rolling-update of the instance type for etcd nodes without downtime
    • = Scaling-out of Etcd nodes via NOT modifying the ASG directly BUT indirectly via CloudFormation stack updates
  • Other use-cases implied by the fact that the nodes are managed by ASGs
  • You can choose "eip" or "eni" for etcd node(=etcd member) identity via the etcd.memberIdentityProvider key in cluster.yaml
    • "eip", which is the default setting, is recommended
    • If you want, choose "eni".
    • If you choose "eni", and your region has less than 3 AZs, setting etcd.internalDomainName to something other than default is HIGHLY RECOMMENDED to prepare for disaster recovery
    • It is an advanced option but DNS other than Amazon DNS could be used (when memberIdentityProvider is "eni", internalDomainName is set, manageRecordSets is false, and every EC2 instance has a custom DNS which is capable of resolving FQDNs under internalDomainName)

Unsupported use-cases:

  • Automatic recovery from more than (N-1)/2 permanent Etcd nodes failure.
    • Requires etcd backups and automatic determination of whether the new etcd cluster should be created or not via ETCD_INITIAL_CLUSTER_STATE
  • Scaling-in of Etcd nodes
    • Just remains untested because it isn't my primary focus in this area. Contributions are welcomed

Relevant issues to be (partly) resolved via this PR:

The general idea is to make Etcd nodes "virtual" by retaining the state and the identity of an etcd node in a pair of an EBS volume and an EIP or an ENI, respectively.
This way, we can recover/recreate/rolling-update EC2 instances backing etcd nodes without another moving parts like external apps and/or ASG lifecycle hooks, SQS queues, SNS topics, etc.

Unlike well-known etcd HA solutions like crewjam/etcd-aws and MonsantoCo/etcd-aws-cluster, this is intended to be a less flexible but a simpler alternative or the basis for introducing a similar solutions like those.

Implementation notes

General rules

  • If you rely on Route 53 record sets, don't modify ones initially created by CloudFormation
    • Doing so breaks CloudFormation stack deletions because it has no way to know about modified record sets and therefore can't cleanly remove them.
  • To prepare for a disaster recovery for a single-AZ etcd cluster(possible when the user relies on an AWS region with 2 or less AZs), use Route 53 record sets or EIPs to retain network identities among AZs
    • ENIs and EBS can't be moved to an another AZ
    • EBS volume can, however, be transferred utilizing a snapshot

Examples of experimented but not employed strategies

  • Static private IPs via a pool of ENIs dynamically assigned to EC2 instances under control of a single ASG
    • ENIs can't move around different AZs. What happens when you have 2 ENIs in and 1 ENI in different AZs and the former AZ goes down? Nothing until the AZ comes up! It isn't the degree of H/A I wish to have at all!
  • Dynamic private IPs via stable hostnames using a pool of EIP&EBS pairs, single ASG
    • EBS is required in order to achieve "locking" of a pair associated to an etcd instance
      • First of all, identify the "free" pair by filtering available EBS volumes and try to associate it to the EC2 instance
      • Successful association of an EBS volume means that the paired EIP can also be associated to the instance without race conditions
    • EBS can't move around different AZs. What happens when you have 2 pairs in AZ 1 and 1 pair in AZ 2? Once the AZ 2 goes down, the options you can take are (1) manually altering AZ 2 to have 3 etcd nodes and then manually elect a new leader (2) recreate the etcd cluster within AZ 2 by modifying etcd.subnets[] point to AZ 2 in cluster.yaml and running kube-aws update, ssh into one of nodes and restore etcd state from a backup. Neither is automatic.

TODOs

  • Move userdata for etcd nodes to S3 (To workaround the cfn limit of 16KB in userdata size, like we've done for worker and controller node
  • Make EBS volumes created but not associated on stack creation
  • Make EIPs created as part of a control-plane stack but not assign them
  • Associate a pair of an EBS volume and an EIP before starting etcd process
  • Make each etcd node be managed under a dedicated ASG
    • Just change EC2:Instance resources in cfn stack templates to a corresponding pair of a launch configuration and an ASG
  • Make each etcd ASG depends on the next etcd ASG
    • So that we can achieve rolling-update of etcd ASGs hence etcd nodes
  • Trigger cfn-signal for the etcd ASG managing an etcd EC2 instance
    • Inject the new env var KUBE_AWS_ETCD_INDEX=<$etcdIndex> and KUBE_AWS_STACK_NAME from the stack template via embedded EC2 userdata
    • cfn-signal -e 0 --region {{.Region}} --resource {{.Etcd.LogicalName}}${{.EtcdIndexEnvVarName}} --stack ${{.StackNameEnvVarName}}
  • Tweak ExecStartPres of the cfn-signal.service unit if necessary
    • cfn-signal.service for etcd nodes is set to Wants=etcd2.service and After=etcd2.service
    • Are these enough to ensure that the etcd2.service is up and running?
      • When a rolling-update is in progress, don't we need to wait until a newly recreated etcd member with outdated data(=persisted in the EBS volume which had been attached to the previously terminated instance replaced by the newly created instance) to catch up with the latest data from running etcd cluster?
    • I'll leave this for further improvement(s)
  • Security/Fail-proof: Prevent attaching/associating wrong EBS volumes and EIPs
  • Various clean-ups
  • Fix tests
  • Pass E2e tests

Non-TODOs(for now)

  • Graceful termination of etcd nodes
    • like kube-node-drainer.service for worker nodes
    • to elect a new leader when the terminating node was the former leader

@mumoshu mumoshu added this to the v0.9.5-rc.1 milestone Feb 20, 2017
@mumoshu
Copy link
Contributor Author

mumoshu commented Feb 20, 2017

Just realized that what I'm trying to achieve is similar in concept to simlodon mentioned in kubernetes/kops#772 😃

@mumoshu
Copy link
Contributor Author

mumoshu commented Feb 20, 2017

Ah, I had been somehow forgetting the fact that an EBS can't move around AZs, too 😢

@mumoshu
Copy link
Contributor Author

mumoshu commented Feb 20, 2017

I'm going to create a dedicated ASG for each etcd instance.
Each etcd ASG would depend on the next ASG so that we can hopefully do rolling-updates of ASGs.

@mumoshu
Copy link
Contributor Author

mumoshu commented Feb 21, 2017

I've verified the current implementation by triggering a rolling-update of instance types for etcd nodes while running k8s conformance test.
The conformance test passed without any failures so I say no visible downtime was observed 😉

@mumoshu
Copy link
Contributor Author

mumoshu commented Feb 21, 2017

My biggest concern left untouched yet is now:

When a rolling-update is in progress, don't we need to wait until a newly recreated etcd member with possible outdated data(=persisted in the EBS volume which had been attached to the previously terminated instance replaced by the newly created instance) to catch up with the latest data from running etcd cluster?

@mumoshu
Copy link
Contributor Author

mumoshu commented Feb 21, 2017

Hi @pieterlange @camilb, could I have your comments/requests regarding the TODOs and Non-TODOs in the description of this PR, if any? 😃

@mumoshu
Copy link
Contributor Author

mumoshu commented Feb 21, 2017

Hi @redbaron, I've implemented my POC to make etcd cluster a bit more H/A.
Believing you're experienced in this area, woud you mind leaving your comments/requests/etc regarding the TODOs and Non-TODOs in the description of this PR? 😃

@mumoshu
Copy link
Contributor Author

mumoshu commented Feb 21, 2017

Updated the supported use-cases in the description:

  • Automatic recovery from temporary Etcd node failures
    • Even if all the nodes went down, the cluster recovers eventually as long as the EBS volumes aren't corrupted


aws ec2 describe-volumes \
--region {{.Region}} \
--filters Name=tag:aws:cloudformation:stack-name,Values=$stack_name Name=tag:kube-aws:owner:role,Values=etcd Name=status,Values=available \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can add Name=availability-zone,Values=$az here and drop jq filter later

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch 👍 Thanks

echo "no etcd volume available in availability zone $az"
fi

vol_id=$(echo "$vol" | jq -r ".VolumeId")
Copy link
Contributor

@redbaron redbaron Feb 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will break if more than 1 volume is found. it shouldn't happen normally, though,but better to guard against it missed that you do it already

Copy link
Contributor Author

@mumoshu mumoshu Feb 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks but I believe I've guardded it properly at line 189 with the final [0] in the jq expression?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just read your updated comment now 😉

@redbaron
Copy link
Contributor

if you have ASG per AZ,then why don't tag ASG with volume id and ENI id to use, instread of going with lengthy and error prone self-discovery process?

@@ -284,6 +284,10 @@ func (c *Cluster) SetDefaults() {
c.Etcd.Subnets = c.PublicSubnets()
}
}

if c.Etcd.InternalDomainName == "" {
c.Etcd.InternalDomainName = fmt.Sprintf("%s.internal", c.ClusterName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not .compute.internal?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now dead-code which had been used for the ENI+Route53 Record Set way of implementing the network identity for etcd nodes.
I can revive it if you really need it rather than the current, EIP way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I revived this to implement memberIdentityProvider: ENI 😃

@redbaron
Copy link
Contributor

is EIP the only way to go? it will be a no-go for us :(

@redbaron
Copy link
Contributor

instead of EIP you can attach ENI in the same way as you do EBS

vol=$(echo "$describe_volume_result" | jq -r ".Volumes[0]")
fi
vol_id=$(echo "$vol" | jq -r ".VolumeId")
eip_alloc_id=$(echo "$vol" | jq -r ".Tags[] | select(.Key == \"kube-aws:etcd:eip-alloc-id\").Value")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like how you tied data (EBS) with network identity (EIP) by tagging EBS.

# Doing so under an EC2 instance under an auto-scaling group would achieve automatic recovery from etcd node failures.

# TODO: Dynamically attach an EBS volume to /dev/xvdf before var-lib-etcd2.mount happens.
# Probably we cant achieve it here but in the "bootstrap" cloud-config embedded in stack-template.json
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary, already addressed TODOs in comments

#!/bin/bash -vxe

instance_id=$(curl http://169.254.169.254/latest/meta-data/instance-id)
private_ip=$(curl http://169.254.169.254/latest/meta-data/local-ipv4)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private_ip is probably an unused shell variable

echo $assumed_hostname > /var/run/coreos/assumed_hostname
echo $eip_alloc_id > /var/run/coreos/eip_alloc_id

- path: /opt/bin/assume-etcd-hostname-with-private-ip
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you go with ENI, then IP addresses will be fixed and Route53 records can be created once and for all by CF

Copy link
Contributor Author

@mumoshu mumoshu Feb 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Certainly but it makes a disaster recovery a bit difficult(though not possible, just slower) as described in #332 (comment)

"Name=key,Values=aws:cloudformation:stack-name" \
--output json \
| jq -r ".Tags[].Value"
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: stack_name should be taken from the env var KUBE_AWS_STACK_NAME injected via the embedded userdata in stack-template.json because it is a more portable way to get the stack name. More concretely, the aws:cloudformation:stack-name and other tags won't be populated when the EC2 instance is created via a spot fleet.

@redbaron
Copy link
Contributor

Automatic recovery from temporary Etcd node failures
Even if all the nodes went down, the cluster recovers eventually as long as the EBS volumes aren't corrupted

Is it so? AFAIK if all nodes go down, then quorum is lost and it requires following disaster-recovery process: https://github.com/coreos/etcd/blob/master/Documentation/v2/admin_guide.md#disaster-recovery

@mumoshu
Copy link
Contributor Author

mumoshu commented Feb 21, 2017

instead of EIP you can attach ENI in the same way as you do EBS

I'm almost ok with the ENI instead of EIP but the ENI way isn't more complete than the EIP.
I just wanted to go with a more complete option.

My concern is in the fact an ENI can't move between different AZs.
It implies that, for an user like me who is in an AWS region with only 2 AZs available hence must rely on a single-AZ etcd cluster, we have to recreate all ENIs (with different private IPs) and therefore all the etcd members and a cluster in an another live AZ when the single AZ goes down.

Using EIPs instead allows us to retain and reassign these IPs hence etcd members and etcd cluster while recreating all the backing ASGs and EC2 instances for etcd in an another AZ.

// BTW, in both cases, we need EBS snapshots to recreate EBS volumes with equivalent data in an another AZ

@mumoshu
Copy link
Contributor Author

mumoshu commented Feb 21, 2017

Is it so? AFAIK if all nodes go down, then quorum is lost and it requires following disaster-recovery process: https://github.com/coreos/etcd/blob/master/Documentation/v2/admin_guide.md#disaster-recovery

Yes, because we retain data in EBS volumes.
There's a brief downtime(approx. 5min) between all the etcd nodes are terminated and then recreated but the cluster starts functioning again after the downtime.

@redbaron
Copy link
Contributor

Yes, because we retain data in EBS volumes.
There's a brief downtime(approx. 5min) between all the etcd nodes are terminated and then recreated but the cluster starts functioning again after the downtime.

Probably that works only for clean shutfown, etcd2 docs say that if quorum is lost then no auto-recovery possible:

However, in extreme circumstances, a cluster might permanently lose enough members such that quorum is irrevocably lost. For example, if a three-node cluster suffered two simultaneous and unrecoverable machine failures, it would be normally impossible for the cluster to restore quorum and continue functioning.

@mumoshu
Copy link
Contributor Author

mumoshu commented Feb 21, 2017

@redbaron AFIAK, the "unrecoverable machine failures" in this case is corrupted EBS volumes.
As long as data in EBS volumes aren't corrupted, it doesn't count to the "extreme circumstances"; it is a recoverable failure.

@mumoshu
Copy link
Contributor Author

mumoshu commented Feb 21, 2017

is EIP the only way to go? it will be a no-go for us :(

Would you mind sharing me why EIP isn't the way to go for you?
You may already know but EIPs are used just to stabilize hostnames, eventually resolved to "private IPs" of etcd nodes (by AWS DNS)

@redbaron
Copy link
Contributor

@redbaron AFIAK, the unrecoverable machine failures in this case is corrupted EBS volumes.
As long as data in EBS volumes aren't corrupted, it doesn't count to the "extreme circumstances"; it is a recoverable failure.

I can't see how it can work if AZ with last available leader is lost AND quorum is lost too.So in your 2 AZ case, if AZ with 2 nodes go down and one of them was leadere when it happened, even if you restore EBS into healthy AZ, etcd shouldn't come up, otherwise they allow silent data loss without explicit permission from operator, which I doubt very much.

@@ -157,7 +157,7 @@ func (c *Cluster) Assets() (cfnstack.Assets, error) {

return cfnstack.NewAssetsBuilder(c.StackName(), c.StackConfig.S3URI).
Add("userdata-controller", c.UserDataController).
Add("userdata-worker", c.UserDataWorker).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a mistake here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No!
Although it is required only by a node pool stack, we had been unnecessarily uploading user-worker here for a control-plane stack, too.

"cloudconfig" : "{{.UserDataEtcd}}"
}
}
},
"Resources": {
"{{.Controller.LogicalName}}": {
"Type": "AWS::AutoScaling::AutoScalingGroup",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 for etcd in asg

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for etcd in ASG also.

@mumoshu
Copy link
Contributor Author

mumoshu commented Feb 21, 2017

I can't see how it can work if AZ with last available leader is lost AND quorum is lost too.So in your 2 AZ case, if AZ with 2 nodes go down and one of them was leadere when it happened, even if you restore EBS into healthy AZ, etcd shouldn't come up, otherwise they allow silent data loss without explicit permission from operator, which I doubt very much.

Thanks!
That's why I'm no longer going to deploy 2-AZ etcd cluster.
As long as your cluster is distributed among an odd number of AZs, I believe my comment at #332 (comment) applies.

@mumoshu
Copy link
Contributor Author

mumoshu commented Mar 1, 2017

Rebased.

@mumoshu
Copy link
Contributor Author

mumoshu commented Mar 1, 2017

E2E tests are passed.

@mumoshu mumoshu merged commit 54eab73 into kubernetes-retired:master Mar 1, 2017
@mumoshu mumoshu deleted the ha-etcd branch March 1, 2017 06:54
@redbaron
Copy link
Contributor

redbaron commented Mar 1, 2017

@mumoshu , this is incredible, thank you

@cknowles
Copy link
Contributor

cknowles commented Mar 13, 2017

@mumoshu I've just managed to upgrade etcd instances to a larger type as an uptime upgrade thanks to you! 🎉

It's seems there was a slight pause on etcd responding in a 3 node cluster when the state was:

  • first new node was up, old node terminated
  • second new node was running but possibly not quite fully linked into cluster yet, old node terminated
  • third new node was not up, old node still running

The pause was circa 20 seconds and various processes including kubectl and dashboard became unresponsive momentarily. I just wanted to check if anyone has seen anything similar before trying to diagnose more? Each of the wait signals was passing after around 5 minutes so it looks like this was etcd related somehow.

@mumoshu
Copy link
Contributor Author

mumoshu commented Mar 13, 2017

@c-knowles I greatly appreciate this kind of feedback! Thanks.
I'm still finding a better way to reduce possible down time like that.

AFAIK, each etcd2 member(=etcd2 process inside a rkt pod) doesn't wait until the member becomes connected and ready to serve requests on its startup, and there's no way to know the member is actually ready.

For example, running etcdctl --peers <first etcd member's advertised peer url> cluster-health would block until all the remaining etcd members until the number meets quorum(2 for your cluster). This incomplete solution hits a chicken-and-egg problem like this and break wait signals. That's why it doesn't wait for an etcd2 member to be ready to avoid down time completely.

For me, the down time was less than 1 sec when I first tried but I suspect the result varies from time to time hence your case.

@cknowles
Copy link
Contributor

cknowles commented Mar 13, 2017

@mumoshu Yeah, it probably varies a little bit. Is there an issue to track this as a known issue? I had a look but didn't find one unless it's part of etcd v2. If not, we should make one for other users who come across this.

@mumoshu
Copy link
Contributor Author

mumoshu commented Mar 13, 2017

@c-knowles There's no github issue I'm aware of 😢
Btw, etcd3 seems to signal systemd for readiness when its systemd unit is set to Type=notify.

@redbaron
Copy link
Contributor

can't we draw dependencies between ASGs, then CF will roll them one by one. It would allow quorum to be maintained all the time

@cknowles
Copy link
Contributor

I've separated the issue out as above so we can track any progress we make on this.

@mumoshu
Copy link
Contributor Author

mumoshu commented Mar 13, 2017

Hi @redbaron!
It is already implemented so. ASGs do get (hence etcd nodes managed by them) replaced one-by-one according to dependencies among ASGs combined with wait signals. Therefore my guess is that we're proceeding to next node too early because of insufficient wait signals #411, which would result in a temporary loss of quorum in an extreme case(=rolling-update happened faster than I'd expected).

@mumoshu
Copy link
Contributor Author

mumoshu commented Mar 13, 2017

@c-knowles Thanks for the good writeup!

@trinitronx
Copy link
Contributor

@mumoshu:

Hello, I've just tested out the new kube-aws v0.9.5-rc.6 yesterday and I was unable to get it to work due to an error: The following resource(s) failed to create: [Controlplane].

After diving further into this, I found that the reason it was failing was really because we have reached our EIP limit. The real error was: The maximum number of addresses has been reached.

So I have a question about this seemingly new kube-aws memberIdentityProvider option for Etcd2. We are already using our limit of EIPs that AWS gives us, and this requirement is new to us. Why is this now necessary instead of using a SRV record for Etcd2 nodes?

It seems to me that SRV record discovery would be cleaner than having to use EIPs for the Etcd2 nodes, and more dynamic as it would not require IP hardcoding into the Etcd2 nodes in a static config file. This forces stored state of the ETCD_INITIAL_CLUSTER variable into files stored on the EBS volume. Having to manage EIPs and EBS volumes in case of failure or scaling seems a bit backwards & would easily hit the 5 EIP limit that AWS has by default with a larger Etcd2 cluster.

My thought was that SRV discovery could allow the CloudFormation template from kube-aws to then just manage a single Route53 SRV record instead based on the Etcd2 AutoScalingGroup. The current implementation of writing a bunch of hardcoded IPs to /var/run/coreos/etcd-environment seems like a bit of an unscalable hack IMHO (no offense intended).

Has this option for Etcd2 been explored yet, or are there any pitfalls / pros / cons involved with this type of configuration?

@trinitronx
Copy link
Contributor

@mumoshu: Also perhaps related: Would waiting for the PR from #417 change the picture any by choosing to avoid Etcd2 for a new cluster, and simply go with Kubernetes v1.6 and Etcd3?

After reading a bit of the surrounding information regarding disaster recovery, this choice seems to be a pivotal moment that would seal our cluster's fate & ease of future maintenance in the event of a disaster. Any thoughts or recommendations here?

@billyteves
Copy link

@mumoshu regarding the ETCD cluster, is the setup for data is in replica? if one server goes down, data are still intact to the other servers? Can you also suggest a recommended etcd storage size?

@jsravn
Copy link

jsravn commented Apr 5, 2017

A bit late to the party, but we've built something similar to the Monsanto solution at https://github.com/sky-uk/etcd-bootstrap. It handles node replacement/cluster expansion&shrinking/new clusters, and we've been using in prod for a while now. It's self contained, just a docker wrapper around etcd that queries the local ASG for the other instances in the group. If you use apiserver on the same node, it's easy to run this and have apiserver hit localhost, with an ELB on the ASG to load balance to the apiservers.

It's not entirely clear to me the benefit of managing EBS/ENI separately - why not just rebuild the node including EBS/ENI? Is that in case the entire cluster dies?

@Vince-Cercury
Copy link

With the ENI as memberIdentityProvider and 3 private subnets in 3 different AZ, how many ETCD instances (per AZ?) should we run, in order to:

  • keep cluster running if one AZ goes down?
  • keep cluster running if two AZ go down?

How many instances minimum do we need if we are ok with an ETCD downtime (until ASG re-creates the instances and re-attach EBS and ENI)?

Alternatively if you could point me to doc, so I can do the maths. I'd be happy to read any doc to help me understand how kube-aws solved the problem.

@redbaron
Copy link
Contributor

@Vincemd , etcd need to maintain quorum at all times to keep cluster running. Quorum is N/2+1 where N is number of etcd nodes.

Therefore if you run 1 node per AZ, you'll continue to maintain quorum in case one AZ goes down.

To tolerate two AZs failure, you'd need to span your cluster across 5 AZs, 2 of which will probably be in another region. I don't have information how happy etcd cluster will be when it sees significant latency increase for certain member of the cluster.

@Vince-Cercury
Copy link

Thanks @redbaron. So I will run 3 nodes. One per AZ we have available in Sydney Region. This will allow 1 AZ failure.

In case 2 nodes are down at the same time from 2 different AZ , would the ASG + ENI + EBS solution from @mumoshu allow the ETCD cluster to recover automatically, with some downtime? Assuming ASG is able to create an EC2 again in the same AZ that were affected (since ENI cannot be moved across from one AZ to another) and re-attach the EBS and ANI fine.
-> I'm just trying to put the case of ENI forward. I somewhat understand the idea, but need to explain it better to my peers.

We had that situation in Sydney recently when 1 AZ was done and another one was affected temporarily. It's also not impossible to see 2 instances fail at the same time from 2 AZ. Rare but not impossible.

If, for some reason ENI+EBS does not help/work, would a manual intervention allow recovering of the cluster by cleaning and allowing to elect a new leader? I think we are fine with downtime in case of 2 Nodes being down, as it's very unlikely. The apps will still run, just Kubernetes won't be able to manage the pods until etcd is fixed, I assume.

@redbaron
Copy link
Contributor

There were some bugs in automatic recovery, which hopefully were ironed out, but in theory yes, it recovers once AZ is back.

Why you push for ENI and not for default EIP? EIP should allow you to restore quorum not waiting for AZ to become available.

@Vince-Cercury
Copy link

Everything has to be private (private subnets only and no EIP allowed).

@redbaron
Copy link
Contributor

Trick is, that even if EIP is used, it is resolved to a private IP address, therefore can be used inside private subents

@Vince-Cercury
Copy link

I got an AWS error though when I tried it because there was not Internet Gateway in my Subnet. Since the subnet is private and I'm not allowed to use IGW

@redbaron
Copy link
Contributor

Are you using amazon DNS?

kylehodgetts pushed a commit to HotelsDotCom/kube-aws that referenced this pull request Mar 27, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants