-
Notifications
You must be signed in to change notification settings - Fork 295
Conversation
Just realized that what I'm trying to achieve is similar in concept to simlodon mentioned in kubernetes/kops#772 😃 |
Ah, I had been somehow forgetting the fact that an EBS can't move around AZs, too 😢 |
I'm going to create a dedicated ASG for each etcd instance. |
I've verified the current implementation by triggering a rolling-update of instance types for etcd nodes while running k8s conformance test. |
My biggest concern left untouched yet is now:
|
Hi @pieterlange @camilb, could I have your comments/requests regarding the TODOs and Non-TODOs in the description of this PR, if any? 😃 |
Hi @redbaron, I've implemented my POC to make etcd cluster a bit more H/A. |
Updated the supported use-cases in the description:
|
|
||
aws ec2 describe-volumes \ | ||
--region {{.Region}} \ | ||
--filters Name=tag:aws:cloudformation:stack-name,Values=$stack_name Name=tag:kube-aws:owner:role,Values=etcd Name=status,Values=available \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can add Name=availability-zone,Values=$az
here and drop jq
filter later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch 👍 Thanks
echo "no etcd volume available in availability zone $az" | ||
fi | ||
|
||
vol_id=$(echo "$vol" | jq -r ".VolumeId") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will break if more than 1 volume is found. it shouldn't happen normally, though,but better to guard against it missed that you do it already
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks but I believe I've guardded it properly at line 189 with the final [0]
in the jq expression?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just read your updated comment now 😉
if you have ASG per AZ,then why don't tag ASG with volume id and ENI id to use, instread of going with lengthy and error prone self-discovery process? |
core/controlplane/config/config.go
Outdated
@@ -284,6 +284,10 @@ func (c *Cluster) SetDefaults() { | |||
c.Etcd.Subnets = c.PublicSubnets() | |||
} | |||
} | |||
|
|||
if c.Etcd.InternalDomainName == "" { | |||
c.Etcd.InternalDomainName = fmt.Sprintf("%s.internal", c.ClusterName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not .compute.internal
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now dead-code which had been used for the ENI+Route53 Record Set way of implementing the network identity for etcd nodes.
I can revive it if you really need it rather than the current, EIP way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I revived this to implement memberIdentityProvider: ENI
😃
is EIP the only way to go? it will be a no-go for us :( |
instead of EIP you can attach ENI in the same way as you do EBS |
vol=$(echo "$describe_volume_result" | jq -r ".Volumes[0]") | ||
fi | ||
vol_id=$(echo "$vol" | jq -r ".VolumeId") | ||
eip_alloc_id=$(echo "$vol" | jq -r ".Tags[] | select(.Key == \"kube-aws:etcd:eip-alloc-id\").Value") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like how you tied data (EBS) with network identity (EIP) by tagging EBS.
# Doing so under an EC2 instance under an auto-scaling group would achieve automatic recovery from etcd node failures. | ||
|
||
# TODO: Dynamically attach an EBS volume to /dev/xvdf before var-lib-etcd2.mount happens. | ||
# Probably we cant achieve it here but in the "bootstrap" cloud-config embedded in stack-template.json |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unnecessary, already addressed TODOs in comments
#!/bin/bash -vxe | ||
|
||
instance_id=$(curl http://169.254.169.254/latest/meta-data/instance-id) | ||
private_ip=$(curl http://169.254.169.254/latest/meta-data/local-ipv4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private_ip
is probably an unused shell variable
echo $assumed_hostname > /var/run/coreos/assumed_hostname | ||
echo $eip_alloc_id > /var/run/coreos/eip_alloc_id | ||
|
||
- path: /opt/bin/assume-etcd-hostname-with-private-ip |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you go with ENI, then IP addresses will be fixed and Route53 records can be created once and for all by CF
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Certainly but it makes a disaster recovery a bit difficult(though not possible, just slower) as described in #332 (comment)
"Name=key,Values=aws:cloudformation:stack-name" \ | ||
--output json \ | ||
| jq -r ".Tags[].Value" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: stack_name should be taken from the env var KUBE_AWS_STACK_NAME
injected via the embedded userdata in stack-template.json because it is a more portable way to get the stack name. More concretely, the aws:cloudformation:stack-name
and other tags won't be populated when the EC2 instance is created via a spot fleet.
Is it so? AFAIK if all nodes go down, then quorum is lost and it requires following disaster-recovery process: https://github.com/coreos/etcd/blob/master/Documentation/v2/admin_guide.md#disaster-recovery |
I'm almost ok with the ENI instead of EIP but the ENI way isn't more complete than the EIP. My concern is in the fact an ENI can't move between different AZs. Using EIPs instead allows us to retain and reassign these IPs hence etcd members and etcd cluster while recreating all the backing ASGs and EC2 instances for etcd in an another AZ. // BTW, in both cases, we need EBS snapshots to recreate EBS volumes with equivalent data in an another AZ |
Yes, because we retain data in EBS volumes. |
Probably that works only for clean shutfown, etcd2 docs say that if quorum is lost then no auto-recovery possible:
|
@redbaron AFIAK, the "unrecoverable machine failures" in this case is corrupted EBS volumes. |
Would you mind sharing me why EIP isn't the way to go for you? |
I can't see how it can work if AZ with last available leader is lost AND quorum is lost too.So in your 2 AZ case, if AZ with 2 nodes go down and one of them was leadere when it happened, even if you restore EBS into healthy AZ, etcd shouldn't come up, otherwise they allow silent data loss without explicit permission from operator, which I doubt very much. |
@@ -157,7 +157,7 @@ func (c *Cluster) Assets() (cfnstack.Assets, error) { | |||
|
|||
return cfnstack.NewAssetsBuilder(c.StackName(), c.StackConfig.S3URI). | |||
Add("userdata-controller", c.UserDataController). | |||
Add("userdata-worker", c.UserDataWorker). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a mistake here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No!
Although it is required only by a node pool stack, we had been unnecessarily uploading user-worker
here for a control-plane stack, too.
"cloudconfig" : "{{.UserDataEtcd}}" | ||
} | ||
} | ||
}, | ||
"Resources": { | ||
"{{.Controller.LogicalName}}": { | ||
"Type": "AWS::AutoScaling::AutoScalingGroup", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 for etcd in asg
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for etcd in ASG also.
Thanks! |
Rebased. |
E2E tests are passed. |
@mumoshu , this is incredible, thank you |
@mumoshu I've just managed to upgrade etcd instances to a larger type as an uptime upgrade thanks to you! 🎉 It's seems there was a slight pause on etcd responding in a 3 node cluster when the state was:
The pause was circa 20 seconds and various processes including kubectl and dashboard became unresponsive momentarily. I just wanted to check if anyone has seen anything similar before trying to diagnose more? Each of the wait signals was passing after around 5 minutes so it looks like this was etcd related somehow. |
@c-knowles I greatly appreciate this kind of feedback! Thanks. AFAIK, each etcd2 member(=etcd2 process inside a rkt pod) doesn't wait until the member becomes connected and ready to serve requests on its startup, and there's no way to know the member is actually ready. For example, running For me, the down time was less than 1 sec when I first tried but I suspect the result varies from time to time hence your case. |
@mumoshu Yeah, it probably varies a little bit. Is there an issue to track this as a known issue? I had a look but didn't find one unless it's part of etcd v2. If not, we should make one for other users who come across this. |
@c-knowles There's no github issue I'm aware of 😢 |
can't we draw dependencies between ASGs, then CF will roll them one by one. It would allow quorum to be maintained all the time |
I've separated the issue out as above so we can track any progress we make on this. |
Hi @redbaron! |
@c-knowles Thanks for the good writeup! |
Hello, I've just tested out the new After diving further into this, I found that the reason it was failing was really because we have reached our EIP limit. The real error was: So I have a question about this seemingly new It seems to me that My thought was that Has this option for Etcd2 been explored yet, or are there any pitfalls / pros / cons involved with this type of configuration? |
@mumoshu: Also perhaps related: Would waiting for the PR from #417 change the picture any by choosing to avoid Etcd2 for a new cluster, and simply go with Kubernetes After reading a bit of the surrounding information regarding disaster recovery, this choice seems to be a pivotal moment that would seal our cluster's fate & ease of future maintenance in the event of a disaster. Any thoughts or recommendations here? |
@mumoshu regarding the ETCD cluster, is the setup for data is in replica? if one server goes down, data are still intact to the other servers? Can you also suggest a recommended etcd storage size? |
A bit late to the party, but we've built something similar to the Monsanto solution at https://github.com/sky-uk/etcd-bootstrap. It handles node replacement/cluster expansion&shrinking/new clusters, and we've been using in prod for a while now. It's self contained, just a docker wrapper around etcd that queries the local ASG for the other instances in the group. If you use apiserver on the same node, it's easy to run this and have apiserver hit localhost, with an ELB on the ASG to load balance to the apiservers. It's not entirely clear to me the benefit of managing EBS/ENI separately - why not just rebuild the node including EBS/ENI? Is that in case the entire cluster dies? |
With the ENI as memberIdentityProvider and 3 private subnets in 3 different AZ, how many ETCD instances (per AZ?) should we run, in order to:
How many instances minimum do we need if we are ok with an ETCD downtime (until ASG re-creates the instances and re-attach EBS and ENI)? Alternatively if you could point me to doc, so I can do the maths. I'd be happy to read any doc to help me understand how kube-aws solved the problem. |
@Vincemd , etcd need to maintain quorum at all times to keep cluster running. Quorum is N/2+1 where N is number of etcd nodes. Therefore if you run 1 node per AZ, you'll continue to maintain quorum in case one AZ goes down. To tolerate two AZs failure, you'd need to span your cluster across 5 AZs, 2 of which will probably be in another region. I don't have information how happy etcd cluster will be when it sees significant latency increase for certain member of the cluster. |
Thanks @redbaron. So I will run 3 nodes. One per AZ we have available in Sydney Region. This will allow 1 AZ failure. In case 2 nodes are down at the same time from 2 different AZ , would the ASG + ENI + EBS solution from @mumoshu allow the ETCD cluster to recover automatically, with some downtime? Assuming ASG is able to create an EC2 again in the same AZ that were affected (since ENI cannot be moved across from one AZ to another) and re-attach the EBS and ANI fine. We had that situation in Sydney recently when 1 AZ was done and another one was affected temporarily. It's also not impossible to see 2 instances fail at the same time from 2 AZ. Rare but not impossible. If, for some reason ENI+EBS does not help/work, would a manual intervention allow recovering of the cluster by cleaning and allowing to elect a new leader? I think we are fine with downtime in case of 2 Nodes being down, as it's very unlikely. The apps will still run, just Kubernetes won't be able to manage the pods until etcd is fixed, I assume. |
There were some bugs in automatic recovery, which hopefully were ironed out, but in theory yes, it recovers once AZ is back. Why you push for ENI and not for default EIP? EIP should allow you to restore quorum not waiting for AZ to become available. |
Everything has to be private (private subnets only and no EIP allowed). |
Trick is, that even if EIP is used, it is resolved to a private IP address, therefore can be used inside private subents |
I got an AWS error though when I tried it because there was not Internet Gateway in my Subnet. Since the subnet is private and I'm not allowed to use IGW |
Are you using amazon DNS? |
Managed HA etcd cluster
This is a WIP pull request to achieve "Managed HA etcd cluster" with private IPs resolved via public EC2 hostnames stabilized with a pool of EBS and EIP pairs for etcd nodes.
After this change, EC2 instances backing "virtual" etcd nodes are managed by an ASG.
Supported use-cases:
etcd.memberIdentityProvider
key in cluster.yaml"eip"
, which is the default setting, is recommended"eni"
."eni"
, and your region has less than 3 AZs, settingetcd.internalDomainName
to something other than default is HIGHLY RECOMMENDED to prepare for disaster recoverymemberIdentityProvider
is"eni"
,internalDomainName
is set,manageRecordSets
isfalse
, and every EC2 instance has a custom DNS which is capable of resolving FQDNs underinternalDomainName
)Unsupported use-cases:
(N-1)/2
permanent Etcd nodes failure.ETCD_INITIAL_CLUSTER_STATE
Relevant issues to be (partly) resolved via this PR:
The general idea is to make Etcd nodes "virtual" by retaining the state and the identity of an etcd node in a pair of an EBS volume and an EIP or an ENI, respectively.
This way, we can recover/recreate/rolling-update EC2 instances backing etcd nodes without another moving parts like external apps and/or ASG lifecycle hooks, SQS queues, SNS topics, etc.
Unlike well-known etcd HA solutions like crewjam/etcd-aws and MonsantoCo/etcd-aws-cluster, this is intended to be a less flexible but a simpler alternative or the basis for introducing a similar solutions like those.
Implementation notes
General rules
Examples of experimented but not employed strategies
etcd.subnets[]
point to AZ 2 in cluster.yaml and runningkube-aws update
, ssh into one of nodes and restore etcd state from a backup. Neither is automatic.TODOs
EC2:Instance
resources in cfn stack templates to a corresponding pair of a launch configuration and an ASGKUBE_AWS_ETCD_INDEX=<$etcdIndex>
andKUBE_AWS_STACK_NAME
from the stack template via embedded EC2 userdatacfn-signal -e 0 --region {{.Region}} --resource {{.Etcd.LogicalName}}${{.EtcdIndexEnvVarName}} --stack ${{.StackNameEnvVarName}}
TweakExecStartPre
s of the cfn-signal.service unit if necessarycfn-signal.service
for etcd nodes is set toWants=etcd2.service
andAfter=etcd2.service
Security/Fail-proof: Prevent attaching/associating wrong EBS volumes and EIPsUsing resource-level permissions for ec2:AttachVolumeNon-TODOs(for now)
kube-node-drainer.service
for worker nodes