Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[docs] node affinity docs; 0.2.0 release prep #133

Merged
merged 3 commits into from
May 13, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
59 changes: 54 additions & 5 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,42 @@
# Changelog

## 0.2.0

The theme of this release is usability improvements and more granular control over node placement.

Features such as specifying etcd endpoints directly on the cluster spec eliminate the need to provide a manual
configuration for custom etcd endpoints. Per-cluster etcd environments will allow users to collocate multiple m3db
clusters on a single etcd cluster.

Users can now specify more complex affinity terms, and specify taints that their cluster tolerates to allow dedicating
specific nodes to M3DB. See the [affinity docs][affinity-docs] for more.

* [FEATURE] Allow specifying of etcd endpoints on M3DBCluster spec ([#99][99])
* [FEATURE] Allow specifying security contexts for M3DB pods ([#107][107])
* [FEATURE] Allow specifying tolerations of M3DB pods ([#111][111])
* [FEATURE] Allow specifying pod priority classes ([#119][119])
* [FEATURE] Use a dedicated etcd-environment per-cluster to support sharing etcd clusters ([#99][99])
* [FEATURE] Support more granular node affinity per-isolation group ([#106][106]) ([#131][131])
* [ENHANCEMENT] Change default M3DB bootstrapper config to recover more easily when an entire cluster is taken down
([#112][112])
* [ENHANCEMENT] Build + release with Go 1.12 ([#114][114])
* [ENHANCEMENT] Continuously reconcile configmaps ([#118][118])
* [BUGFIX] Allow unknown protobuf fields to be unmarshalled ([#117][117])
* [BUGFIX] Fix pod removal when removing more than 1 pod at a time ([#125][125])

### Breaking Changes

0.2.0 changes how M3DB stores its cluster topology in etcd to allow for multiple M3DB clusters to share an etcd cluster.
A [migration script][etcd-migrate] is provided to copy etcd data from the old format to the new format. If migrating an
operated cluster, run that script (see script for instructions) and then rolling restart your M3DB pods by deleting them
one at a time.

If using a custom configmap, this same change will require a modification to your configmap. See the
[warning][configmap-warning] in the docs about how to ensure your configmap is compatible.

## 0.1.4

* [ENHANCEMENT] Added the ability to use a specific StorageClass per-isolation group (StatefulSet) for clusters without
* [FEATURE] Added the ability to use a specific StorageClass per-isolation group (StatefulSet) for clusters without
topology aware volume provisioning ([#98][98])
* [BUGFIX] Fixed a bug where pods were incorrectly selected if the cluster had labels ([#100][100])

Expand All @@ -13,18 +47,33 @@

## 0.1.2

* Update default cluster ConfigMap to include parameters required by latest m3db.
* Update default cluster ConfigMap to include parameters required by latest M3DB.
* Add event `patch` permission to default RBAC role.

## 0.1.0
## 0.1.1

* TODO
* Fix helm manifests.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the header for this be 0.1.1?


## 0.1.0

* TODO
* Initial release.

[affinity-docs]: https://operator.m3db.io/configuration/node_affinity/
[etcd-migrate]: https://github.com/m3db/m3db-operator/blob/master/scripts/migrate_etcd_0.1_0.2.sh
[configmap-warning]: https://operator.m3db.io/configuration/configuring_m3db/#environment-warning

[94]: https://github.com/m3db/m3db-operator/pull/94
[97]: https://github.com/m3db/m3db-operator/pull/97
[98]: https://github.com/m3db/m3db-operator/pull/98
[99]: https://github.com/m3db/m3db-operator/pull/99
[100]: https://github.com/m3db/m3db-operator/pull/100
[106]: https://github.com/m3db/m3db-operator/pull/106
[107]: https://github.com/m3db/m3db-operator/pull/107
[111]: https://github.com/m3db/m3db-operator/pull/111
[112]: https://github.com/m3db/m3db-operator/pull/112
[114]: https://github.com/m3db/m3db-operator/pull/114
[117]: https://github.com/m3db/m3db-operator/pull/117
[118]: https://github.com/m3db/m3db-operator/pull/118
[119]: https://github.com/m3db/m3db-operator/pull/119
[125]: https://github.com/m3db/m3db-operator/pull/125
[131]: https://github.com/m3db/m3db-operator/pull/131
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ kubectl apply -f https://raw.githubusercontent.com/m3db/m3db-operator/v0.1.4/exa

Apply manifest with your zones specified for isolation groups:

```
```yaml
apiVersion: operator.m3db.io/v1alpha1
kind: M3DBCluster
metadata:
Expand Down
5 changes: 2 additions & 3 deletions docs/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,14 +1,13 @@
# Dockerfile for building docs is stored in a separate dir from the docs,
# otherwise the generated site will unnecessarily contain the Dockerfile

FROM python:3.5-alpine
FROM python:3.6-alpine3.9
LABEL maintainer="The M3DB Authors <m3db@googlegroups.com>"

WORKDIR /m3db
EXPOSE 8000

# mkdocs needs git-fast-import which was stripped from the default git package
# by default to reduce size
RUN pip install mkdocs==0.17.3 mkdocs-material==2.7.3 && \
RUN pip install mkdocs==0.17.3 mkdocs-material==2.7.3 Pygments>=2.2 pymdown-extensions>=4.11 && \
apk add --no-cache git-fast-import openssh-client
ENTRYPOINT [ "/bin/ash", "-c" ]
19 changes: 19 additions & 0 deletions docs/configuration/configuring_m3db.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,4 +7,23 @@ Prometheus reads/writes to the cluster. This template can be found
To apply custom a configuration for the M3DB cluster, one can set the `configMapName` parameter of the cluster [spec] to
an existing configmap.

## Environment Warning

If providing a custom config map, the `env` you specify in your [config][config] **must** be `$NAMESPACE/$NAME`, where
`$NAMESPACE` is the Kubernetes namespace your cluster is in and `$NAME` is the name of the cluster. For example, with
the following cluster:

```yaml
apiVersion: operator.m3db.io/v1alpha1
kind: M3DBCluster
metadata:
name: cluster-a
namespace: production
...
```

The value of `env` in your config **MUST** be `production/cluster-a`. This restriction allows multiple M3DB clusters to
safely share the same etcd cluster.

[spec]: ../api
[config]: https://github.com/m3db/m3db-operator/blob/795973f3329437ced3ac942da440810cd0865235/assets/default-config.yaml#L77
32 changes: 28 additions & 4 deletions docs/configuration/namespaces.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ Namespaces are configured as part of an `m3dbcluster` [spec][api-namespaces].

This preset will store metrics at 10 second resolution for 2 days. For example, in your cluster spec:

```
```yaml
spec:
...
namespaces:
Expand All @@ -24,7 +24,7 @@ spec:

This preset will store metrics at 1 minute resolution for 40 days.

```
```yaml
spec:
...
namespaces:
Expand All @@ -34,8 +34,32 @@ spec:

## Custom Namespaces

You can also define your own custom namespaces by setting the `NamespaceOptions` within a cluster spec. See the
[API][api-ns-options] for all the available fields.
You can also define your own custom namespaces by setting the `NamespaceOptions` within a cluster spec. The
[API][api-ns-options] lists all available fields. As an example, a namespace to store 7 days of data may look like:
```yaml
...
spec:
...
namespaces:
- name: custom-7d
options:
bootstrapEnabled: true
flushEnabled: true
writesToCommitLog: true
cleanupEnabled: true
snapshotEnabled: true
repairEnabled: false
retentionOptions:
retentionPeriodDuration: 168h
blockSizeDuration: 12h
bufferFutureDuration: 20m
bufferPastDuration: 20m
blockDataExpiry: true
blockDataExpiryAfterNotAccessPeriodDuration: 5m
indexOptions:
enabled: true
blockSizeDuration: 12h
```


[api-namespaces]: ../api#namespace
Expand Down
192 changes: 192 additions & 0 deletions docs/configuration/node_affinity.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,192 @@
# Node Affinity & Cluster Topology

## Node Affinity

Kubernetes allows pods to be assigned to nodes based on various critera through [node affinity][k8s-node-affinity].

M3DB was built with failure tolerance as a core feature. M3DB's [isolation groups][m3db-isogroups] allow shards to be
placed across failure domains such that the loss of no single domain can cause the cluster to lose quorum. More details
on M3DB's resiliency can be found in the [deployment docs][m3db-deployment].

By leveraging Kubernetes' node affinity and M3DB's isolation groups, the operator can guarantee that M3DB pods are
distributed across failure domains. For example, in a Kubernetes cluster spread across 3 zones in a cloud region, the
`isolationGroups` configuration below would guarantee that no single zone failure could degrade the M3DB cluster.

M3DB is unaware of the underlying zone topology: it just views the isolation groups as `group1`, `group2`, `group3` in
its [placement][m3db-placement]. Thanks to the Kubernetes scheduler, however, these groups are actually scheduled across
separate failure domains.

```yaml
apiVersion: operator.m3db.io/v1alpha1
kind: M3DBCluster
...
spec:
replicationFactor: 3
isolationGroups:
- name: group1
numInstances: 3
nodeAffinityTerms:
- key: failure-domain.beta.kubernetes.io/zone
values:
- us-east1-b
- name: group2
numInstances: 3
nodeAffinityTerms:
- key: failure-domain.beta.kubernetes.io/zone
values:
- us-east1-c
- name: group3
numInstances: 3
nodeAffinityTerms:
- key: failure-domain.beta.kubernetes.io/zone
values:
- us-east1-d
```

## Tolerations

In addition to allowing pods to be assigned to certain nodes via node affinity, Kubernetes allows pods to be _repelled_
from nodes through [taints][k8s-taints] if they don't tolerate the taint. For example, the following config would ensure:

1. Pods are spread across zones.

2. Pods are only assigned to nodes in the `m3db-dedicated-pool` pool.

3. No other pods could be assigned to those nodes (assuming they were tainted with the taint `m3db-dedicated-taint`).

```yaml
apiVersion: operator.m3db.io/v1alpha1
kind: M3DBCluster
...
spec:
replicationFactor: 3
isolationGroups:
- name: group1
numInstances: 3
nodeAffinityTerms:
- key: failure-domain.beta.kubernetes.io/zone
values:
- us-east1-b
- key: nodepool
values:
- m3db-dedicated-pool
- name: group2
numInstances: 3
nodeAffinityTerms:
- key: failure-domain.beta.kubernetes.io/zone
values:
- us-east1-c
- key: nodepool
values:
- m3db-dedicated-pool
- name: group3
numInstances: 3
nodeAffinityTerms:
- key: failure-domain.beta.kubernetes.io/zone
values:
- us-east1-d
- key: nodepool
values:
- m3db-dedicated-pool
tolerations:
- key: m3db-dedicated
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this need to be m3db-dedicated-pool or is m3db-dedicated and m3db-dedicated-pool somehow related somewhere?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So m3db-dedicated-pool the nodepool and m3db-dedicated the taint are two separate things, I wanted to convey that they could be named differently (like foo and bar even). Maybe I should make them even more different to clarify that? Or can just mention it

effect: NoSchedule
operator: Exists
```

## Example Affinity Configurations

### Zonal Cluster

The examples so far have focused on multi-zone Kubernetes clusters. Some users may only have a cluster in a single zone
and accept the reduced fault tolerance. The following configuration shows how to configure the operator in a zonal
cluster.

```yaml
apiVersion: operator.m3db.io/v1alpha1
kind: M3DBCluster
...
spec:
replicationFactor: 3
isolationGroups:
- name: group1
numInstances: 3
nodeAffinityTerms:
- key: failure-domain.beta.kubernetes.io/zone
values:
- us-east1-b
- name: group2
numInstances: 3
nodeAffinityTerms:
- key: failure-domain.beta.kubernetes.io/zone
values:
- us-east1-b
- name: group3
numInstances: 3
nodeAffinityTerms:
- key: failure-domain.beta.kubernetes.io/zone
values:
- us-east1-b
```

### 6 Zone Cluster

In the above examples we created clusters with 1 isolation group in each of 3 zones. Because `values` within a single
[NodeAffinityTerm][node-affinity-term] are OR'd, we can also spread an isolationgroup across multiple zones. For
example, if we had 6 zones available to us:

```yaml
apiVersion: operator.m3db.io/v1alpha1
kind: M3DBCluster
...
spec:
replicationFactor: 3
isolationGroups:
- name: group1
numInstances: 3
nodeAffinityTerms:
- key: failure-domain.beta.kubernetes.io/zone
values:
- us-east1-a
- us-east1-b
- name: group2
numInstances: 3
nodeAffinityTerms:
- key: failure-domain.beta.kubernetes.io/zone
values:
- us-east1-c
- us-east1-d
- name: group3
numInstances: 3
nodeAffinityTerms:
- key: failure-domain.beta.kubernetes.io/zone
values:
- us-east1-e
- us-east1-f
```

### No Affinity

If there are no failure domains available, one can have a cluster with no affinity where the pods will be scheduled however Kubernetes would place them by default:

```yaml
apiVersion: operator.m3db.io/v1alpha1
kind: M3DBCluster
...
spec:
replicationFactor: 3
isolationGroups:
- name: group1
numInstances: 3
- name: group2
numInstances: 3
- name: group3
numInstances: 3
```

[k8s-node-affinity]: https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#affinity-and-anti-affinity
[k8s-taints]: https://kubernetes.io/docs/concepts/configuration/taint-and-toleration/
[m3db-deployment]: https://docs.m3db.io/operational_guide/replication_and_deployment_in_zones/
[m3db-isogroups]: https://docs.m3db.io/operational_guide/placement_configuration/#isolation-group
[m3db-placement]: https://docs.m3db.io/operational_guide/placement/
[node-affinity-term]: ../api/#nodeaffinityterm
Loading