-
Notifications
You must be signed in to change notification settings - Fork 466
Conversation
Would love to test this once your happy with it, if your looking for volunteers. |
@wolfeidau the TLS stuff still isn't done yet, not usable. I've been slacking on this one but will circle back very soon- if you have some time, I'm accepting sub-PRs on |
@colhom What a PR! Great job. Btw, would you mind sharing me which part of TLS stuff isn't done?(I've read through your code but couldn't figure it) |
@@ -60,6 +60,8 @@ kmsKeyArn: "{{.KMSKeyARN}}" | |||
# Price (Dollars) to bid for spot instances. Omit for on-demand instances. | |||
# workerSpotPrice: "0.05" | |||
|
|||
# Instance type for etcd node | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Missing descriptions and examples for newly added parameters
#etcdInstanceType: "m3.medium"
# Number of etcd nodes(should be 1, 3, 5 and so on to prevent split-brain)
# etcdCount: 1
# Disk size (GiB) for etcd node
#etcdRootVolumeSize: 30
# Disk size (GiB) for etcd node. This one is user by Etcd to store its key-value data and write-ahead logs
#etcdDataVolumeSize: 30
857324c
to
c5c91b8
Compare
dnsSuffix = "ec2.internal" | ||
} else { | ||
dnsSuffix = fmt.Sprintf("%s.compute.internal", config.Region) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Encountered this yesterday and have patched exactly the same as you did 😄 Great.
@wolfeidau go ahead a give it a whirl! I recommend jumping ahead to #596 and grabbing the HA control plane stuff as well on top of this PR- should all be functional. |
Hi everyone- update on things. I've been experimenting with full cluster upgrades and frankly- statically defined etcd clusters (ie: this PR) do not work well with CloudFormation upgrades. The only strategy that really works well "in general" with CloudFormation updates is doing a rolling replacement on an autoscaling group- and currently we don't have a good solution for running etcd robustly in an ASG. We're currently investigating various possibilities for implementing a dynamic/managed etcd cluster- maybe in an ASG, or more optimally hosting etcd on kubernetes and greatly simplifying the CloudFormation stack as a result. The eventual goal is to entirely decouple Kubernetes cluster level upgrades (control plane, kubelet, etcd) from updating the underlying AWS virtual hardware. Etcd on Kubernetes would be a step in the right direction. Input? |
The band-aide for now would be to manually ensure that the etcd instances are never updated by cloudformation after the are created. This is crappy for a lot of reasons, but the only workable solution until we can have etcd nodes that aren't pets. At least CoreOS can take care of doing the OS upgrades 👍 On It's ugly, but it's a grokable hack that would allow us to limp along and take care of our named etcd pets until we have etcd on Kubernetes ready for prime time. \cc @mumoshu @pieterlange |
@colhom i'm personally looking into implementing https://crewjam.com/etcd-aws/ as my backing cluster. I actually had a bad experience today losing etcd state across a autoscaling group managed by monsanto's solution. This is after 3 months+ of running perfectly etcd-aws-cluster btw. There's a lot of work involved in getting software production ready and @crewjam's solution seems the most fleshed out, so after tryouts chances are i'll be going with that. The integrated backup (recovery from complete cluster failure) is the selling point for me after today. |
@pieterlange sorry to hear about your data loss. Out of morbid curiosity- did your etcd instances fail health checks due to load, causing the ASG to replace them one-by-one until your data was gone? |
Nothing like that, but i did run them (3) on t2.nano's. (with 20 workers and 1 master) I spun up the etcd cluster separately and never really "managed" it.. it's a small miracle i got away with it as long as i did :) |
@pieterlange and everyone -- I'd be happy to work with you on this. We did a bit of research and planning on production use of etcd on AWS which resulted in the repo & blog describing it. One caveat though -- for reasons not at all related to the work, our focus shifted a bit and so we haven't actually put it into production. We got close, and we are still planning to, but it hasn't happened just yet. |
278d69d
to
56a77fe
Compare
56a77fe
to
ca749f8
Compare
ca749f8
to
bbe383e
Compare
@dghubble @aaronlevy @pbx0 PTAL |
Resources map[string]map[string]interface{} `json:"Resources"` | ||
} | ||
|
||
func (c *Cluster) lockEtcdResources(cfSvc *cloudformation.CloudFormation, stackBody string) (string, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a bit more documentation explaining why this is necessary / how the functionality will be replaced (for those of us unfamiliar with cloudformation).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Problem: We must ensure that a cloudformation update never affects the etcd cluster instances.
Reason: When updating ec2 instances, cloudformation may choose a destroy and replace strategy, which has the potential to corrupt the etcd cluster.
Solution: lockEtcdResources
is a 50-line function which only runs on cluster update- it queries the current state of the etcd cluster and ensures the definition does not change across CloudFormation updates.
Concerns over fragility: This is a simple, mechanical operation- but even those can go wrong.
Assuming:
- lockEtcdResources() errors out: the update will fail
- lockEtcdResources() munges the etcd cluster definition: In that case, Cloudformation will attempt to update the etcd cluster instances and fail because they are marked as "non-updatable" when first created. The CloudFormation update will then roll-back, and the user will be left with a non-updatable cluster
In either case, the failure mode is "cluster non-updatable"- which is exactly where every user is today anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^^ Something to that effect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sgtm
Is the separate etcd optional? Are you planning to remove/merge this feature with self-hosted Kubernetes, they seem at odds? Also, the addition of etcd TLS here seems orthogonal. |
Self-hosted would be really nice, but speaking as a lowly sysadmin: there's enough moving parts in this system already. I need to be able to rely on/trust a stable etcd. I'm even proposing to leave this outside of the management of kube-aws altogether in #629! |
|
||
for encKey in $(find /etc/etcd2/ssl/*.pem.enc);do | ||
tmpPath="/tmp/$(basename $encKey).tmp" | ||
docker run --net host --rm -v /etc/etcd2/ssl:/etc/etcd2/ssl --rm quay.io/coreos/awscli aws --region {{.Region}} kms decrypt --ciphertext-blob fileb://$encKey --output text --query Plaintext | base64 --decode > $tmpPath |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would prefer that we use rkt here
No , it's very much mandatory. If etcd is not on separate machines, Kubernetes ec2 instances cannot be updated, which is a big deal It's a pre-req for dynamically scaling compute resources for both control plane and worker pool- something AWS users (including me!) are very excited about. Until self-hosted etcd is stable and production ready, this is a requirement.
Bootkube currently provisions a static single-node etcd instance and points the control plane at it- and that's really it. A single point of non-fault-tolerant failure for your otherwise fault-tolerant control plane. If anything, this would complement self-hosted by making etcd more robust.( > 1 nodes, multi-zone, etc) and really not change anything about how self-hosted works currently, as neither bootkube nor this PR deal at all with self-hosting etcd.
If an etcd interface is exposed to any sort of network, TLS encryption is a must. We have just exposed etcd to the subnet, hence encyption becomes mandatory. If it's only listening on localhost, not as big of a deal. |
Update- just did some fishing around it bootkube- it appears the multi-node bootkube example actually uses a discrete/external multi-node etcd cluster- just like this PR! So, in reality, this would make bootkube on AWS look almost exactly like the multi-node vagrant setup, just plus TLS! |
@colhom that review pass is done. Mostly questions at this stage. @dghubble @pieterlange I'm personally on the fence about this functionality as well - but @colhom is championing for this to be included, and I won't explicitly block on "I feel iffy about this" if @colhom is willing to support long-term. My general concerns: This feels somewhat like a big workaround. If cloud-formation does not support this type of deployment, why are we fighting it to try? Parsing out the etcd-sections to keep them from being changed really does seem fragile to me - but to be fair I'm not very familiar with cloud-formation. We have near term plans to moving toward self-hosted installations and this diverges from that work. This may be fine, it just means we will likely end up with a separate tool (which to be fair, may have happened anyway). kube-aws could generate multiple cloud-formations in the case that you want a larger etcd cluster. I'm still unclear what we gain from trying to force this all into a single deployment (when we have to work around the fact that it doesn't actually work natively). The single master likely works well for a lot of people (kube-up, gke, kops, kube-aws (until now)). I also still question that in-place updates cannot occur - can you remind me what the issue is here? Something about ebs volumes will not be reattached after a cloud-formation update? |
Also, fwiw etcd access was always available to the worker nodes (for flannel) -- so the TLS stuff isn't just now necessary due to running multiple etcd nodes -- it's always been a need. But along those same lines - as of v1.4 (and self-hosted) we're moving toward no worker etcd access at all - so another temporary divergence here. |
owner: root:root | ||
content: | | ||
#!/bin/bash -e | ||
if [[ "$(wipefs -n -p $1 | grep ext4)" == "" ]];then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you actually want to do this (it scares me) use full paths to binaries and long flag names
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to get a less sketchy solution in place.
795e45f
to
011ad7e
Compare
@@ -61,7 +61,7 @@ coreos: | |||
[Service] | |||
Type=oneshot | |||
RemainAfterExit=yes | |||
ExecStart=/opt/bin/ext4-format-volume-once /dev/xvdf | |||
ExecStart=/usr/sbin/mkfs.ext4 -p /dev/xvdf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aaronlevy final answer- wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
disregard my previous comment- after talking to crawford I've reverted to first checking with blkid
, and then leaving the -p
arg to mkfs
for extra safety.
011ad7e
to
795e45f
Compare
@colhom Sure thing. It sounds perfectly reasonable for kube-aws to focus on the needs of this particular cluster design, platform, and users. My notes just highlight conflicts between different cluster designs, but it sounds like that's a non-goal, please disregard. |
@aaronlevy any more review items? I've addressed the conditional |
I do not wish to delay this feature because i think the work is just fine. But maybe it would be better to spin up etcd as a separate cloudformation template - this is exactly what i've been doing and what's reflected in my PR (#629). I guess it kind of depends on if you can live with having to spin up mutliple cloudformation templates. I can :-) |
@pieterlange I have experimented with the separate cf stack approach for etcd- there is too much added complexity in cross-correlating the resources and really not much to be gained (in my opinion). I've discussed with @aaronlevy and this will not merge into master until after the v1.4 hyperkube and associated changes are in (weeks). To unblock further development in this area, we will be merging this PR, HA control plane, cluster upgrades and @mumoshu 's I'll update the community once that branch is ready and passing conformance/e2e tests. |
I am still wondering about this:
|
Also, can you expand on:
Is it not just etcd network addressability (hostnames/ips) that needs to be shared? |
that's basically all i'm doing in the PoC PR. Obviously the situation is a little more hairy than that and my PR is just a complete hack. But it worked fine for me. |
If a CloudFormation update decides it needs to replace a machine, there is no way to ensure the new machine to receives the same EBS volume as the old machine. So our single controller/ebs backed etcd combo is non-updatable via CloudFormation. More generally, any machine running etcd is non-updatable, whether or not it's EBS backed. That's why this PR separates etcd out and quarantines the instances off from updates, until we can host on Kubernetes.
Beyond what @pieterlange mentioned, additional points of difficulty:
In general, breaking the "everything is in this single cloudformation stack" invariant greatly changes the nature of kube-aws, and is out of scope for a problem that is will be solved in a better way within the next 6 months (or so ™️). The hackery necessary to quarantine etcd instances from cloudformation updates boils down to masking updates to a few select subtrees of a cloudformation json document. It fits neatly within the confines of a single function which is referenced in one place. You can literally comment out the function and it's one reference, and it's gone. \cc @aaronlevy |
795e45f
to
985e451
Compare
Against my best intentions i feel like my comment has started to derail/undermine the good work being done here. I trust @colhom's judgment on this item. I suggest discussing the pro's and con's of managing etcd under a separate cloudformation under my PR or a new issue so we don't further pollute this thread. I like the idea of hosting etcd separately and robustly but that's only because of 1 nasty prior experience. |
We created our fork of kube-aws and merged this requested. After ~3 months of experience with I have to admit that this is not a way to go in production. Because 3 etcd nodes have static IPs there is no option to update them without some manual interaction. Since there is no backup of etcd state to somewhere outside cluster (for example S3) manual interactions are quite risky. I decided not to upgrade etcd nodes (by manually editing generated stack template json file). As a result now controller and worker nodes in one of the clusters I manage are running latest CoreOS AMI with Dirty COW bug fixed but etcd nodes run old AMI. I believe the right approach is described in this article: https://crewjam.com/etcd-aws/, i.e. separate ASG for etcd nodes and etcd state mirroring/backup to S3. In this case etcd nodes can be also upgraded by running cloudformation stack update without any manual interaction. |
@dzavalkinolx I also forked |
@dzavalkinolx @camilb i made a new issue for this at kubernetes-retired/kube-aws#27. I think we're not far off from a solution but it's nice to hear everyone's input on this. |
Work in progress- no review necessary yet (unless you're bored ;)