Skip to content

Commit

Permalink
[docs] node affinity docs; 0.2.0 release prep
Browse files Browse the repository at this point in the history
- Changelog entries for 0.2.0
- Docs on new features (tolerations, expanded node affinity)
  • Loading branch information
schallert committed May 1, 2019
1 parent 795973f commit 8a0d867
Show file tree
Hide file tree
Showing 10 changed files with 411 additions and 88 deletions.
55 changes: 52 additions & 3 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,42 @@
# Changelog

## 0.2.0

The theme of this release is usability improvements and more granular control over node placement.

Features such as specifying etcd endpoints directly on the cluster spec eliminate the need to provide a manual
configuration for custom etcd endpoints. Per-cluster etcd environments will allow users to collocate multiple m3db
clusters on a single etcd cluster.

Users can now specify more complex affinity terms, and specify taints that their cluster tolerates to allow dedicating
specific nodes to M3DB. See the [affinity docs][affinity-docs] for more.

* [FEATURE] Allow specifying of etcd endpoints on M3DBCluster spec ([#99][99])
* [FEATURE] Allow specifying security contexts for M3DB pods ([#107][107])
* [FEATURE] Allow specifying tolerations of m3db pods ([#111][111])
* [FEATURE] Allow specifying pod priority classes ([#119][119])
* [FEATURE] Use a dedicated etcd-environment per-cluster to support sharing etcd clusters ([#99][99])
* [FEATURE] Support more granular node affinity per-isolation group ([#106][106]) ([#131][131])
* [ENHANCEMENT] Change default M3DB bootstrapper config to recover more easily when an entire cluster is taken down
([#112][112])
* [ENHANCEMENT] Build + release with Go 1.12 ([#114][114])
* [ENHANCEMENT] Continuously reconcile configmaps ([#118][118])
* [BUGFIX] Allow unknown protobuf fields to be unmarshalled ([#117][117])
* [BUGFIX] Fix pod removal when removing more than 1 pod at a time ([#125][125])

### Breaking Changes

0.2.0 changes how M3DB stores its cluster topology in etcd to allow for multiple M3DB clusters to share an etcd cluster.
A [migration script][etcd-migrate] is provided to copy etcd data from the old format to the new format. If migrating an
operated cluster, run that script (see script for instructions) and then rolling restart your M3DB pods by deleting them
one at a time.

If using a custom configmap, this same change will require a modification to your configmap. See the
[warning][configmap-warning] in the docs about how to ensure your configmap is compatible.

## 0.1.4

* [ENHANCEMENT] Added the ability to use a specific StorageClass per-isolation group (StatefulSet) for clusters without
* [FEATURE] Added the ability to use a specific StorageClass per-isolation group (StatefulSet) for clusters without
topology aware volume provisioning ([#98][98])
* [BUGFIX] Fixed a bug where pods were incorrectly selected if the cluster had labels ([#100][100])

Expand All @@ -18,13 +52,28 @@

## 0.1.0

* TODO
* Fix helm manifests.

## 0.1.0

* TODO
* Initial release.

[affinity-docs]: https://operator.m3db.io/configuration/node_affinity/
[etcd-migrate]: https://github.com/m3db/m3db-operator/blob/master/scripts/migrate_etcd_0.1_0.2.sh
[configmap-warning]: https://operator.m3db.io/configuration/configuring_m3db/#environment-warning

[94]: https://github.com/m3db/m3db-operator/pull/94
[97]: https://github.com/m3db/m3db-operator/pull/97
[98]: https://github.com/m3db/m3db-operator/pull/98
[100]: https://github.com/m3db/m3db-operator/pull/100
[106]: https://github.com/m3db/m3db-operator/pull/106
[107]: https://github.com/m3db/m3db-operator/pull/107
[111]: https://github.com/m3db/m3db-operator/pull/111
[112]: https://github.com/m3db/m3db-operator/pull/112
[114]: https://github.com/m3db/m3db-operator/pull/114
[117]: https://github.com/m3db/m3db-operator/pull/117
[118]: https://github.com/m3db/m3db-operator/pull/118
[119]: https://github.com/m3db/m3db-operator/pull/119
[99]: https://github.com/m3db/m3db-operator/pull/99
[125]: https://github.com/m3db/m3db-operator/pull/125
[131]: https://github.com/m3db/m3db-operator/pull/131
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ kubectl apply -f https://raw.githubusercontent.com/m3db/m3db-operator/v0.1.4/exa

Apply manifest with your zones specified for isolation groups:

```
```yaml
apiVersion: operator.m3db.io/v1alpha1
kind: M3DBCluster
metadata:
Expand Down
5 changes: 2 additions & 3 deletions docs/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,14 +1,13 @@
# Dockerfile for building docs is stored in a separate dir from the docs,
# otherwise the generated site will unnecessarily contain the Dockerfile

FROM python:3.5-alpine
FROM python:3.6-alpine3.9
LABEL maintainer="The M3DB Authors <m3db@googlegroups.com>"

WORKDIR /m3db
EXPOSE 8000

# mkdocs needs git-fast-import which was stripped from the default git package
# by default to reduce size
RUN pip install mkdocs==0.17.3 mkdocs-material==2.7.3 && \
RUN pip install mkdocs==0.17.3 mkdocs-material==2.7.3 Pygments>=2.2 pymdown-extensions>=4.11 && \
apk add --no-cache git-fast-import openssh-client
ENTRYPOINT [ "/bin/ash", "-c" ]
19 changes: 19 additions & 0 deletions docs/configuration/configuring_m3db.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,4 +7,23 @@ Prometheus reads/writes to the cluster. This template can be found
To apply custom a configuration for the M3DB cluster, one can set the `configMapName` parameter of the cluster [spec] to
an existing configmap.

## Environment Warning

If providing a custom config map, the `env` you specify in your [config][config] **must** be `$NAMESPACE/$NAME`, where
`$NAMESPACE` is the Kubernetes namespace your cluster is in and `$NAME` is the name of the cluster. For example, with
the following cluster:

```yaml
apiVersion: operator.m3db.io/v1alpha1
kind: M3DBCluster
metadata:
name: cluster-a
namespace: production
...
```

The value of `env` in your config **MUST** be `production/cluster-a`. This restriction allows multiple M3DB clusters to
safely share the same etcd cluster.

[spec]: ../api
[config]: https://github.com/m3db/m3db-operator/blob/795973f3329437ced3ac942da440810cd0865235/assets/default-config.yaml#L77
32 changes: 28 additions & 4 deletions docs/configuration/namespaces.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ Namespaces are configured as part of an `m3dbcluster` [spec][api-namespaces].

This preset will store metrics at 10 second resolution for 2 days. For example, in your cluster spec:

```
```yaml
spec:
...
namespaces:
Expand All @@ -24,7 +24,7 @@ spec:

This preset will store metrics at 1 minute resolution for 40 days.

```
```yaml
spec:
...
namespaces:
Expand All @@ -34,8 +34,32 @@ spec:

## Custom Namespaces

You can also define your own custom namespaces by setting the `NamespaceOptions` within a cluster spec. See the
[API][api-ns-options] for all the available fields.
You can also define your own custom namespaces by setting the `NamespaceOptions` within a cluster spec. The
[API][api-ns-options] lists all available fields. As an example, a namespace to store 7 days of data may look like:
```yaml
...
spec:
...
namespaces:
- name: custom-7d
options:
bootstrapEnabled: true
flushEnabled: true
writesToCommitLog: true
cleanupEnabled: true
snapshotEnabled: true
repairEnabled: false
retentionOptions:
retentionPeriodDuration: 168h
blockSizeDuration: 12h
bufferFutureDuration: 20m
bufferPastDuration: 20m
blockDataExpiry: true
blockDataExpiryAfterNotAccessPeriodDuration: 5m
indexOptions:
enabled: true
blockSizeDuration: 12h
```


[api-namespaces]: ../api#namespace
Expand Down
192 changes: 192 additions & 0 deletions docs/configuration/node_affinity.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,192 @@
# Node Affinity & Cluster Topology

## Node Affinity

Kubernetes allows pods to be assigned to nodes based on various critera through [node affinity][k8s-node-affinity].

M3DB was built with failure tolerance as a core feature. M3DB's [isolation groups][m3db-isogroups] allow shards to be
placed across failure domains such that the loss of no single domain can cause the cluster to lose quorum. More details
on M3DB's resiliency can be found in the [deployment docs][m3db-deployment].

By leveraging Kubernetes' node affinity and M3DB's isolation groups, the operator can guarantee that M3DB pods are
distributed across failure domains. For example, in a Kubernetes cluster spread across 3 zones in a cloud region, the
`isolationGroups` below config would guarantee that no single zone failure could degrade the M3DB cluster.

M3DB is unaware of the underlying zone topology: it just views the isolation groups as `group1`, `group2`, `group3` in
its [placement][m3db-placement]. Thanks to the Kubernetes scheduler, however, these groups are actually scheduled across
separate failure domains.

```yaml
apiVersion: operator.m3db.io/v1alpha1
kind: M3DBCluster
...
spec:
replicationFactor: 3
isolationGroups:
- name: group1
numInstances: 3
nodeAffinityTerms:
- key: failure-domain.beta.kubernetes.io/zone
values:
- us-east1-b
- name: group2
numInstances: 3
nodeAffinityTerms:
- key: failure-domain.beta.kubernetes.io/zone
values:
- us-east1-c
- name: group3
numInstances: 3
nodeAffinityTerms:
- key: failure-domain.beta.kubernetes.io/zone
values:
- us-east1-d
```
## Tolerations
In addition to allowing pods to be assigned to certain nodes via node affinity, Kubernetes allows pods to be _repelled_
from nodes through [taints][k8s-taints] if they don't tolerate the taint. For example, the following config would ensure:
1. Pods are spread across zones.
2. Pods are only assigned to nodes in the `m3db-dedicated-pool` pool.

3. No other pods could be assigned to those nodes (assuming they were tainted with the taint `m3db-dedicated-taint`).

```yaml
apiVersion: operator.m3db.io/v1alpha1
kind: M3DBCluster
...
spec:
replicationFactor: 3
isolationGroups:
- name: group1
numInstances: 3
nodeAffinityTerms:
- key: failure-domain.beta.kubernetes.io/zone
values:
- us-east1-b
- key: nodepool
values:
- m3db-dedicated-pool
- name: group2
numInstances: 3
nodeAffinityTerms:
- key: failure-domain.beta.kubernetes.io/zone
values:
- us-east1-c
- key: nodepool
values:
- m3db-dedicated-pool
- name: group3
numInstances: 3
nodeAffinityTerms:
- key: failure-domain.beta.kubernetes.io/zone
values:
- us-east1-d
- key: nodepool
values:
- m3db-dedicated-pool
tolerations:
- key: m3db-dedicated
effect: NoSchedule
operator: Exists
```

## Example Affinity Configurations

### Zonal Cluster

The examples so far have focused on multi-zone Kubernetes clusters. Some users may only have a cluster in a single zone
and accept the reduced fault tolerance. The following configuration shows how to configure the operator in a zonal
cluster.

```yaml
apiVersion: operator.m3db.io/v1alpha1
kind: M3DBCluster
...
spec:
replicationFactor: 3
isolationGroups:
- name: group1
numInstances: 3
nodeAffinityTerms:
- key: failure-domain.beta.kubernetes.io/zone
values:
- us-east1-b
- name: group2
numInstances: 3
nodeAffinityTerms:
- key: failure-domain.beta.kubernetes.io/zone
values:
- us-east1-b
- name: group3
numInstances: 3
nodeAffinityTerms:
- key: failure-domain.beta.kubernetes.io/zone
values:
- us-east1-b
```

### 6 Zone Cluster

In the above examples we created clusters with 1 isolation group in each of 3 zones. Because `values` within a single
[NodeAffinityTerm][node-affinity-term] are OR'd, we can also spread an isolationgroup across multiple zones. For
example, if we had 6 zones available to us:

```yaml
apiVersion: operator.m3db.io/v1alpha1
kind: M3DBCluster
...
spec:
replicationFactor: 3
isolationGroups:
- name: group1
numInstances: 3
nodeAffinityTerms:
- key: failure-domain.beta.kubernetes.io/zone
values:
- us-east1-a
- us-east1-b
- name: group2
numInstances: 3
nodeAffinityTerms:
- key: failure-domain.beta.kubernetes.io/zone
values:
- us-east1-c
- us-east1-d
- name: group3
numInstances: 3
nodeAffinityTerms:
- key: failure-domain.beta.kubernetes.io/zone
values:
- us-east1-e
- us-east1-f
```

### No Affinity

If there are no failure domains available, one can have a cluster with no affinity where the pods will be scheduled however Kubernetes would place them by default:

```yaml
apiVersion: operator.m3db.io/v1alpha1
kind: M3DBCluster
...
spec:
replicationFactor: 3
isolationGroups:
- name: group1
numInstances: 3
- name: group2
numInstances: 3
- name: group3
numInstances: 3
```

[k8s-node-affinity]: https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity
[k8s-taints]: https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/
[m3db-deployment]: https://docs.m3db.io/operational_guide/replication_and_deployment_in_zones/
[m3db-isogroups]: https://docs.m3db.io/operational_guide/placement_configuration/#isolation-group
[m3db-placement]: https://docs.m3db.io/operational_guide/placement/
[node-affinity-term]: ../api/#nodeaffinityterm
Loading

0 comments on commit 8a0d867

Please sign in to comment.