-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[docs] node affinity docs; 0.2.0 release prep #133
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,14 +1,13 @@ | ||
# Dockerfile for building docs is stored in a separate dir from the docs, | ||
# otherwise the generated site will unnecessarily contain the Dockerfile | ||
|
||
FROM python:3.5-alpine | ||
FROM python:3.6-alpine3.9 | ||
LABEL maintainer="The M3DB Authors <m3db@googlegroups.com>" | ||
|
||
WORKDIR /m3db | ||
EXPOSE 8000 | ||
|
||
# mkdocs needs git-fast-import which was stripped from the default git package | ||
# by default to reduce size | ||
RUN pip install mkdocs==0.17.3 mkdocs-material==2.7.3 && \ | ||
RUN pip install mkdocs==0.17.3 mkdocs-material==2.7.3 Pygments>=2.2 pymdown-extensions>=4.11 && \ | ||
apk add --no-cache git-fast-import openssh-client | ||
ENTRYPOINT [ "/bin/ash", "-c" ] |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,192 @@ | ||
# Node Affinity & Cluster Topology | ||
|
||
## Node Affinity | ||
|
||
Kubernetes allows pods to be assigned to nodes based on various critera through [node affinity][k8s-node-affinity]. | ||
|
||
M3DB was built with failure tolerance as a core feature. M3DB's [isolation groups][m3db-isogroups] allow shards to be | ||
placed across failure domains such that the loss of no single domain can cause the cluster to lose quorum. More details | ||
on M3DB's resiliency can be found in the [deployment docs][m3db-deployment]. | ||
|
||
By leveraging Kubernetes' node affinity and M3DB's isolation groups, the operator can guarantee that M3DB pods are | ||
distributed across failure domains. For example, in a Kubernetes cluster spread across 3 zones in a cloud region, the | ||
`isolationGroups` configuration below would guarantee that no single zone failure could degrade the M3DB cluster. | ||
|
||
M3DB is unaware of the underlying zone topology: it just views the isolation groups as `group1`, `group2`, `group3` in | ||
its [placement][m3db-placement]. Thanks to the Kubernetes scheduler, however, these groups are actually scheduled across | ||
separate failure domains. | ||
|
||
```yaml | ||
apiVersion: operator.m3db.io/v1alpha1 | ||
kind: M3DBCluster | ||
... | ||
spec: | ||
replicationFactor: 3 | ||
isolationGroups: | ||
- name: group1 | ||
numInstances: 3 | ||
nodeAffinityTerms: | ||
- key: failure-domain.beta.kubernetes.io/zone | ||
values: | ||
- us-east1-b | ||
- name: group2 | ||
numInstances: 3 | ||
nodeAffinityTerms: | ||
- key: failure-domain.beta.kubernetes.io/zone | ||
values: | ||
- us-east1-c | ||
- name: group3 | ||
numInstances: 3 | ||
nodeAffinityTerms: | ||
- key: failure-domain.beta.kubernetes.io/zone | ||
values: | ||
- us-east1-d | ||
``` | ||
|
||
## Tolerations | ||
|
||
In addition to allowing pods to be assigned to certain nodes via node affinity, Kubernetes allows pods to be _repelled_ | ||
from nodes through [taints][k8s-taints] if they don't tolerate the taint. For example, the following config would ensure: | ||
|
||
1. Pods are spread across zones. | ||
|
||
2. Pods are only assigned to nodes in the `m3db-dedicated-pool` pool. | ||
|
||
3. No other pods could be assigned to those nodes (assuming they were tainted with the taint `m3db-dedicated-taint`). | ||
|
||
```yaml | ||
apiVersion: operator.m3db.io/v1alpha1 | ||
kind: M3DBCluster | ||
... | ||
spec: | ||
replicationFactor: 3 | ||
isolationGroups: | ||
- name: group1 | ||
numInstances: 3 | ||
nodeAffinityTerms: | ||
- key: failure-domain.beta.kubernetes.io/zone | ||
values: | ||
- us-east1-b | ||
- key: nodepool | ||
values: | ||
- m3db-dedicated-pool | ||
- name: group2 | ||
numInstances: 3 | ||
nodeAffinityTerms: | ||
- key: failure-domain.beta.kubernetes.io/zone | ||
values: | ||
- us-east1-c | ||
- key: nodepool | ||
values: | ||
- m3db-dedicated-pool | ||
- name: group3 | ||
numInstances: 3 | ||
nodeAffinityTerms: | ||
- key: failure-domain.beta.kubernetes.io/zone | ||
values: | ||
- us-east1-d | ||
- key: nodepool | ||
values: | ||
- m3db-dedicated-pool | ||
tolerations: | ||
- key: m3db-dedicated | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should this need to be There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So |
||
effect: NoSchedule | ||
operator: Exists | ||
``` | ||
|
||
## Example Affinity Configurations | ||
|
||
### Zonal Cluster | ||
|
||
The examples so far have focused on multi-zone Kubernetes clusters. Some users may only have a cluster in a single zone | ||
and accept the reduced fault tolerance. The following configuration shows how to configure the operator in a zonal | ||
cluster. | ||
|
||
```yaml | ||
apiVersion: operator.m3db.io/v1alpha1 | ||
kind: M3DBCluster | ||
... | ||
spec: | ||
replicationFactor: 3 | ||
isolationGroups: | ||
- name: group1 | ||
numInstances: 3 | ||
nodeAffinityTerms: | ||
- key: failure-domain.beta.kubernetes.io/zone | ||
values: | ||
- us-east1-b | ||
- name: group2 | ||
numInstances: 3 | ||
nodeAffinityTerms: | ||
- key: failure-domain.beta.kubernetes.io/zone | ||
values: | ||
- us-east1-b | ||
- name: group3 | ||
numInstances: 3 | ||
nodeAffinityTerms: | ||
- key: failure-domain.beta.kubernetes.io/zone | ||
values: | ||
- us-east1-b | ||
``` | ||
|
||
### 6 Zone Cluster | ||
|
||
In the above examples we created clusters with 1 isolation group in each of 3 zones. Because `values` within a single | ||
[NodeAffinityTerm][node-affinity-term] are OR'd, we can also spread an isolationgroup across multiple zones. For | ||
example, if we had 6 zones available to us: | ||
|
||
```yaml | ||
apiVersion: operator.m3db.io/v1alpha1 | ||
kind: M3DBCluster | ||
... | ||
spec: | ||
replicationFactor: 3 | ||
isolationGroups: | ||
- name: group1 | ||
numInstances: 3 | ||
nodeAffinityTerms: | ||
- key: failure-domain.beta.kubernetes.io/zone | ||
values: | ||
- us-east1-a | ||
- us-east1-b | ||
- name: group2 | ||
numInstances: 3 | ||
nodeAffinityTerms: | ||
- key: failure-domain.beta.kubernetes.io/zone | ||
values: | ||
- us-east1-c | ||
- us-east1-d | ||
- name: group3 | ||
numInstances: 3 | ||
nodeAffinityTerms: | ||
- key: failure-domain.beta.kubernetes.io/zone | ||
values: | ||
- us-east1-e | ||
- us-east1-f | ||
``` | ||
|
||
### No Affinity | ||
|
||
If there are no failure domains available, one can have a cluster with no affinity where the pods will be scheduled however Kubernetes would place them by default: | ||
|
||
```yaml | ||
apiVersion: operator.m3db.io/v1alpha1 | ||
kind: M3DBCluster | ||
... | ||
spec: | ||
replicationFactor: 3 | ||
isolationGroups: | ||
- name: group1 | ||
numInstances: 3 | ||
- name: group2 | ||
numInstances: 3 | ||
- name: group3 | ||
numInstances: 3 | ||
``` | ||
|
||
[k8s-node-affinity]: https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity | ||
[k8s-taints]: https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/ | ||
[m3db-deployment]: https://docs.m3db.io/operational_guide/replication_and_deployment_in_zones/ | ||
[m3db-isogroups]: https://docs.m3db.io/operational_guide/placement_configuration/#isolation-group | ||
[m3db-placement]: https://docs.m3db.io/operational_guide/placement/ | ||
[node-affinity-term]: ../api/#nodeaffinityterm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the header for this be
0.1.1
?