From 621774a84fcd5f0870b5999d045015a8e571de68 Mon Sep 17 00:00:00 2001 From: David Turner Date: Mon, 22 Oct 2018 16:42:48 +0100 Subject: [PATCH 001/106] Add some docs on cluster coordination --- docs/reference/modules.asciidoc | 6 + docs/reference/modules/coordination.asciidoc | 461 +++++++++++++++++++ 2 files changed, 467 insertions(+) create mode 100644 docs/reference/modules/coordination.asciidoc diff --git a/docs/reference/modules.asciidoc b/docs/reference/modules.asciidoc index 2346fdb4c2b01..f7b8d69338894 100644 --- a/docs/reference/modules.asciidoc +++ b/docs/reference/modules.asciidoc @@ -26,6 +26,10 @@ The modules in this section are: How nodes discover each other to form a cluster. +<>:: + + How the cluster elects a master node and manages the cluster state + <>:: How many nodes need to join the cluster before recovery can start. @@ -85,6 +89,8 @@ include::modules/cluster.asciidoc[] include::modules/discovery.asciidoc[] +include::modules/coordination.asciidoc[] + include::modules/gateway.asciidoc[] include::modules/http.asciidoc[] diff --git a/docs/reference/modules/coordination.asciidoc b/docs/reference/modules/coordination.asciidoc new file mode 100644 index 0000000000000..d8fc1172647d7 --- /dev/null +++ b/docs/reference/modules/coordination.asciidoc @@ -0,0 +1,461 @@ +[[modules-cluster-coordination]] +== Cluster coordination + +The cluster coordination module is responsible for electing a master node and +managing changes to the cluster state. + +[float] +=== Quorum-based decision making + +Electing a master node and changing the cluster state both work robustly by +using multiple nodes, only considering each action to have succeeded on receipt +of responses from a majority of the master-eligible nodes in the cluster. The +advantage of requiring a majority of nodes to respond is that it allows for +nearly half of the master-eligible nodes to fail without preventing the cluster +from making progress, but also does not allow the cluster to "split brain", +i.e. to be partitioned into two pieces each of which may make decisions that +are inconsistent with those of the other piece. + +Elasticsearch allows you to add and remove master-eligible nodes to a running +cluster. In many cases you can do this simply by starting or stopping the nodes +as required, as described in more detail below. As nodes are added or removed +Elasticsearch maintains an optimal level of fault tolerance by updating the +cluster's _configuration_, which is the set of master-eligible nodes whose +responses are counted when making decisions such as electing a new master or +committing a new cluster state. A decision is only made once more than half of +the nodes _in the configuration_ have responded. Usually the cluster +configuration is the same as the set of all the master-eligible nodes that are +currently in the cluster, but in some situations they may be different. As long +as more than half of the nodes in the configuration are still healthy then the +cluster can still make progress. The configuration is managed automatically by +Elasticsearch and stored in the cluster state so you can inspect its current +contents by (TODO API call?). + +The way that the configuration is managed is controlled by the following +settings. (TODO maybe not settings?) + +`cluster.master_nodes_failure_tolerance`:: + + Sets the number of master-eligible nodes whose simultaneous failure the + cluster should be able to tolerate. This imposes a lower bound on the size + of the configuration. Elasticsearch will not remove nodes from the + configuration if their removal would break this bound, and will not permit + this bound to be increased to a value that is too large for the current + configuration. + +The relationship between the number of master-eligible nodes in your cluster, +the size of a majority, and the appropriate value for the +`cluster.master_nodes_failure_tolerance` setting is shown below. + +[cols="<,<,<",options="header",] +|======================================================================================= +|Number of master-eligible nodes |Majority size |`cluster.master_nodes_failure_tolerance` +|1 |1 |0 +|2 |2 |0 +|**3 (recommended)** |**2** |**1** +|4 |3 |1 +|5 |3 |2 +|6 |4 |2 +|7 |4 |3 +|======================================================================================= + +The minimum configuration size is `2 * cluster.master_nodes_failure_tolerance + 1`: + +[cols="<,<",options="header",] +|==================================================================== +|`cluster.master_nodes_failure_tolerance` |Minimum configuration size +|0 |1 +|**1** |**3** +|2 |5 +|3 |7 +|==================================================================== + +It is permissible, but not recommended, to set +`cluster.master_nodes_failure_tolerance` too low for your cluster or, put +differently, to have more master-eligible nodes than the minimum configuration +size. To do so is _safe_ in the sense that the cluster will not suffer from a +split-brain however this setting is configured, but if +`cluster.master_nodes_failure_tolerance` is too low then your cluster may not +tolerate as many failures as expected. + +[float] +==== Even numbers of master-eligible nodes + +There should normally be an odd number of master-eligible nodes in a cluster. +If there is an even number then Elasticsearch will put all but one of them into +the configuration to ensure that the configuration has an odd size. This does +not decrease the failure-tolerance of the cluster, and in fact improves it +slightly: if the cluster is partitioned into two even halves then one of the +halves will contain a majority of the masters and will be able to keep +operating, whereas if all of the master-eligible nodes' votes were counted then +neither side could make any progress in this situation. + +[float] +==== Adding master-eligible nodes + +If you wish to add some master-eligible nodes to your cluster, simply configure +the new nodes to find the existing cluster and start them up. Once the new +nodes have joined the cluster, you may be able to increase the +`cluster.master_nodes_failure_tolerance` setting to match. (TODO do we log +info/warn messages about this?) + +[float] +==== Removing master-eligible nodes + +If you wish to remove some of the master-eligible nodes in your cluster, you +must first reduce the `cluster.master_nodes_failure_tolerance` setting to match +the target cluster size before removing the extraneous nodes. This temporary +situation is the only case in which `cluster.master_nodes_failure_tolerance` +should be set lower than the recommended values above. + +You must also be careful not to remove too many master-eligible nodes all at +the same time. For instance, if you currently have seven master-eligible nodes +and you wish to reduce this to three, you cannot simply stop four of the nodes +all at the same time: to do so would leave only three nodes remaining, which is +less than half of the cluster, which means it cannot take any further actions. +You should remove the nodes one-at-a-time and verify that each node has been +removed from the configuration before moving onto the next one, using the +`await_removal` API: + +[source,js] +-------------------------------------------------- +# Explicit timeout of one minute +GET /_nodes/node_name/await_removal?timeout=1m +# Default timeout of 30 seconds +GET /_nodes/node_name/await_removal +-------------------------------------------------- +// CONSOLE + +The node (or nodes) for whose removal to wait are specified using +<> in place of `node_name` here. + +A special case of this is the case where there are only two master-eligible +nodes and you wish to remove one of them. In this case neither node can be +safely shut down since both nodes are required to reliably make progress, so +you must first explicitly _retire_ one of the nodes. A retired node still works +normally, but Elasticsearch will try and transfer it out of the current +configuration so its vote is no longer required, and will never move a retired +node back into the configuration. Once a node has been retired and is removed +from the configuration, it is safe to shut it down. A node can be retired using +the retirement API: + +[source,js] +-------------------------------------------------- +# Retire node and wait for its removal up to the default timeout of 30 seconds +POST /_nodes/node_name/retire +# Retire node and wait for its removal up to one minute +POST /_nodes/node_name/retire?timeout=1m +-------------------------------------------------- +// CONSOLE + +The node to retire is specified using <> in place +of `node_name` here. If a call to the retirement API fails then the call can +safely be retried. However if a retirement fails then it's possible the node +cannot be removed from the configuration due to the +`cluster.master_nodes_failure_tolerance` setting, so verify that this is set +correctly first. A successful response guarantees that the node has been +removed from the configuration and will not be reinstated. + +A node (or nodes) can be brought back out of retirement using the `unretire` +API: + +[source,js] +-------------------------------------------------- +POST /_nodes/node_name/unretire +-------------------------------------------------- +// CONSOLE + +The node (or nodes) to reinstate are specified using <> in place of `node_name` here. After being brought back out of +retirement they may not immediately be added to the configuration. + +[float] +=== Cluster bootstrapping + +A major risk when starting up a brand-new cluster is that you accidentally form +two separate clusters instead of one. This could lead to data loss: you might +start using both clusters before noticing that anything had gone wrong, and it +might then be impossible to merge them together later. + +To illustrate how this could happen, imagine starting up a three-node cluster +in which each node knows that it is going to be part of a three-node cluster. A +majority of three nodes is two, so normally the first two nodes to discover +each other will form a cluster and the third node will join them a short time +later. However, imagine that four nodes were accidentally started instead of +three: in this case there are enough nodes to form two separate clusters. Of +course if each node is started manually then it's unlikely that too many nodes +are started, but it's certainly possible to get into this situation if using a +more automated orchestrator, particularly if a network partition happens at the +wrong time. + +We avoid this by requiring a separate _cluster bootstrap_ process to take place +on every brand-new cluster. This is only required the first time the whole +cluster starts up: new nodes joining an established cluster can safely obtain +all the information they need from the elected master, and nodes that have +previously been part of a cluster will have stored to disk all the information +required when restarting. + +The simplest way to bootstrap a cluster is to use the +`elasticsearch-bootstrap-cluster` command-line tool: + +[source,txt] +-------------------------------------------------- +$ bin/elasticsearch-bootstrap-cluster --failure-tolerance 1 \ + --node http://10.0.12.1:9200/ --node http://10.0.13.1:9200/ \ + --node https://10.0.14.1:9200/ +-------------------------------------------------- + +The arguments to this tool are the target failure tolerance of the cluster and +the addresses of (some of) its master-eligible nodes. + +If it is not possible to use this tool, you can also bootstrap the cluster via +the API as described here. There are two steps to the bootstrapping process. +Firstly, after all the nodes have started up, created their persistent node +IDs, and discovered each other, the first step is to request a bootstrap +document: + +[source,js] +-------------------------------------------------- +# Return the current bootstrap document immediately +GET /_cluster/bootstrap +# Wait until the node has discovered at least 3 nodes, or 60 seconds has elapsed, +# and then return the bootstrap document +GET /_cluster/bootstrap?wait_for_nodes=3&timeout=60s +-------------------------------------------------- +// CONSOLE + +The boostrap document contains information that the cluster needs to start up, +and looks like the following. + +[source,js] +-------------------------------------------------- +{ + "master_nodes_failure_tolerance": 1, + "master_nodes":[ + {"id":"USpTGYaBSIKbgSUJR2Z9lg"}, + {"id":"gSUJR2Z9lgUSpTGYaBSIKb"}, + {"id":"2Z9lgUSpTgSUYaBSIKbJRG"} + ] +} +-------------------------------------------------- + +It is safe to repeatedly call `GET /_cluster/bootstrap`, and to call it on +different nodes concurrently. This API will yield an error if the receiving +node has already been bootstrapped or has joined an existing cluster. + +Once a bootstrap document has been received, it must then be sent back to the +cluster to finish the bootstrapping process as follows: + +[source,js] +-------------------------------------------------- +# send the bootstrap document back to the cluster +POST /_cluster/bootstrap +{ + "master_nodes_failure_tolerance": 1, + "master_nodes":[ + {"id":"USpTGYaBSIKbgSUJR2Z9lg"}, + {"id":"gSUJR2Z9lgUSpTGYaBSIKb"}, + {"id":"2Z9lgUSpTgSUYaBSIKbJRG"} + ] +} +-------------------------------------------------- +// CONSOLE + +It is safe to repeatedly call `POST /_cluster/bootstrap`, and to call it on +different nodes concurrently, but **it is vitally important** to use the same +bootstrap document in each call. + +It is also possible to select the initial set of nodes in terms of their names +rather than their IDs as follows. + +[source,js] +-------------------------------------------------- +# send the bootstrap document back to the cluster +POST /_cluster/bootstrap +{ + "master_nodes_failure_tolerance": 1, + "master_nodes":[ + {"name":"master-a"}, + {"name":"master-b"}, + {"name":"master-c"} + ] +} +-------------------------------------------------- +// CONSOLE + +This can be useful if the node names are known (and known to be unique) in +advance, and means that the first `GET /_cluster/bootstrap` call is not +necessary. As above, it is safe to repeatedly call `POST /_cluster/bootstrap`, +and to call it on different nodes concurrently, but **it is vitally important** +to use the same bootstrap document in each call. + +[float] +=== Manually-triggered elections + +It is possible to request that a particular node takes over from the elected +master as follows: + +[source,js] +-------------------------------------------------- +POST /_nodes/node_name/start_election +-------------------------------------------------- +// CONSOLE + +Elections are not guaranteed to succeed, and a new leader may be elected at any +time so even if this election does succeed then there may be another election, +so there is no guarantee that the chosen node will be the elected master for +any length of time. + +[float] +=== Unsafe disaster recovery + +In a disaster situation a cluster may have lost half or more of its +master-eligible nodes and therefore be in a state in which it cannot elect a +master. There is no way to recover from this situation without risking data +loss, but if there is no other viable path forwards then this may be necessary. +This can be performed with the following command on a surviving node: + +[source,js] +-------------------------------------------------- +POST /_nodes/_local/force_become_leader +-------------------------------------------------- +// CONSOLE + +This works by reducing `cluster.master_nodes_failure_tolerance` to 0 and then +forcibly overriding the current configuration with one in which the handling +node is the only voting master, so that it forms a quorum on its own. Because +there is a risk of data loss when performing this command it requires the +`accept_data_loss` parameter to be set to `true` in the URL. + +[float] +=== Election scheduling + +Elasticsearch uses an election process to agree on an elected master node, both +at startup and if the existing elected master fails. Any master-eligible node +can start an election, and normally the first election that takes place will +succeed. Elections only usually fail when two nodes both happen to start their +elections at about the same time, so elections are scheduled randomly on each +node to avoid this happening. Nodes will retry elections until a master is +elected, backing off on failure, so that eventually an election will succeed +with arbitrarily high probability. The following settings control the +scheduling of elections. + +`cluster.election.initial_timeout`:: + + Sets the upper bound on how long a node will wait initially, or after a + leader failure, before attempting its first election. This defaults to + `100ms`. + +`cluster.election.back_off_time`:: + + Sets the amount to increase the upper bound on the wait before an election + on each election failure. Note that this is _linear_ backoff. This defaults + to `100ms` + +`cluster.election.max_timeout`:: + + Sets the maximum upper bound on how long a node will wait before attempting + an first election, so that an network partition that lasts for a long time + does not result in excessively sparse elections. This defaults to `10s` + +`cluster.election.duration`:: + + Sets how long each election is allowed to take before a node considers it + to have failed and schedules a retry. This defaults to `500ms`. + +[float] +=== Fault detection + +An elected master periodically checks each of its followers in order to ensure +that they are still connected and healthy, and in turn each follower +periodically checks the health of the elected master. Elasticsearch allows for +these checks occasionally to fail or timeout without taking any action, and +will only consider a node to be truly faulty after a number of consecutive +checks have failed. The following settings control the behaviour of fault +detection. + +`cluster.fault_detection.follower_check.interval`:: + + Sets how long the elected master waits between checks of its followers. + Defaults to `1s`. + +`cluster.fault_detection.follower_check.timeout`:: + + Sets how long the elected master waits for a response to a follower check + before considering it to have failed. Defaults to `30s`. + +`cluster.fault_detection.follower_check.retry_count`:: + + Sets how many consecutive follower check failures must occur before the + elected master considers a follower node to be faulty and removes it from + the cluster. Defaults to `3`. + +`cluster.fault_detection.leader_check.interval`:: + + Sets how long each follower node waits between checks of its leader. + Defaults to `1s`. + +`cluster.fault_detection.leader_check.timeout`:: + + Sets how long each follower node waits for a response to a leader check + before considering it to have failed. Defaults to `30s`. + +`cluster.fault_detection.leader_check.retry_count`:: + + Sets how many consecutive leader check failures must occur before a + follower node considers the elected master to be faulty and attempts to + find or elect a new master. Defaults to `3`. + + +[float] +=== Discovery settings + +TODO move this to the discovery module docs + +Discovery operates in two phases: First, each node "probes" the addresses of +all known nodes by connecting to each address and attempting to identify the +node to which it is connected. Secondly it shares with the remote node a list +of all of its peers and the remote node responds with _its_ peers in turn. The +node then probes all the new nodes about which it just discovered, requests +their peers, and so on, until it has discovered an elected master node or +enough other masterless nodes that it can perform an election. If neither of +these occur quickly enough then it tries again. This process is controlled by +the following settings. + +`discovery.probe.connect_timeout`:: + + Sets how long to wait when attempting to connect to each address. Defaults + to `3s`. + +`discovery.probe.handshake_timeout`:: + + Sets how long to wait when attempting to identify the remote node via a + handshake. Defaults to `1s`. + +`discovery.find_peers_interval`:: + + Sets how long a node will wait before attempting another discovery round. + +`discovery.request_peers_timeout`:: + + Sets how long a node will wait after asking its peers again before + considering the request to have failed. + +[float] +=== Miscellaneous timeouts + +`cluster.join.timeout`:: + + Sets how long a node will wait after sending a request to join a cluster + before it considers the request to have failed and retries. Defaults to + `60s`. + +`cluster.publish.timeout`:: + + Sets how long the elected master will wait after publishing a cluster state + update to receive acknowledgements from all its followers. If this timeout + occurs then the elected master may start to calculate and publish a + subsequent cluster state update, as long as it received enough + acknowledgements to know that the previous publication was committed; if it + did not receive enough acknowledgements to commit the update then it stands + down as the elected leader. From 56d050f51e54d75ce24c53519bdafe280d4b4775 Mon Sep 17 00:00:00 2001 From: David Turner Date: Tue, 23 Oct 2018 08:16:07 +0100 Subject: [PATCH 002/106] Review/rework --- docs/reference/modules/coordination.asciidoc | 95 ++++++++++++-------- 1 file changed, 59 insertions(+), 36 deletions(-) diff --git a/docs/reference/modules/coordination.asciidoc b/docs/reference/modules/coordination.asciidoc index d8fc1172647d7..a9f98e6c63652 100644 --- a/docs/reference/modules/coordination.asciidoc +++ b/docs/reference/modules/coordination.asciidoc @@ -7,14 +7,17 @@ managing changes to the cluster state. [float] === Quorum-based decision making -Electing a master node and changing the cluster state both work robustly by -using multiple nodes, only considering each action to have succeeded on receipt -of responses from a majority of the master-eligible nodes in the cluster. The -advantage of requiring a majority of nodes to respond is that it allows for -nearly half of the master-eligible nodes to fail without preventing the cluster -from making progress, but also does not allow the cluster to "split brain", -i.e. to be partitioned into two pieces each of which may make decisions that -are inconsistent with those of the other piece. +Electing a master node and changing the cluster state are the two fundamental +tasks that master-eligible nodes must work together to perform. It is important +that these activities work robustly even if some nodes have failed, and +Elasticsearch achieves this robustness by involving multiple nodes and only +considering each action to have succeeded on receipt of responses from a +majority of the master-eligible nodes in the cluster. The advantage of +requiring a majority of nodes to respond is that it allows for nearly half of +the master-eligible nodes to fail without preventing the cluster from making +progress, but also does not allow the cluster to "split brain", i.e. to be +partitioned into two pieces each of which may make decisions that are +inconsistent with those of the other piece. Elasticsearch allows you to add and remove master-eligible nodes to a running cluster. In many cases you can do this simply by starting or stopping the nodes @@ -23,16 +26,16 @@ Elasticsearch maintains an optimal level of fault tolerance by updating the cluster's _configuration_, which is the set of master-eligible nodes whose responses are counted when making decisions such as electing a new master or committing a new cluster state. A decision is only made once more than half of -the nodes _in the configuration_ have responded. Usually the cluster +the nodes in the configuration have responded. Usually the cluster configuration is the same as the set of all the master-eligible nodes that are -currently in the cluster, but in some situations they may be different. As long -as more than half of the nodes in the configuration are still healthy then the -cluster can still make progress. The configuration is managed automatically by -Elasticsearch and stored in the cluster state so you can inspect its current -contents by (TODO API call?). +currently in the cluster, but there are some situations in which they may be +different. As long as more than half of the nodes in the configuration are +still healthy then the cluster can still make progress. The configuration is +managed automatically by Elasticsearch and stored in the cluster state so you +can inspect its current contents by (TODO API call?). The way that the configuration is managed is controlled by the following -settings. (TODO maybe not settings?) +setting. (TODO maybe not a setting?) `cluster.master_nodes_failure_tolerance`:: @@ -93,6 +96,12 @@ neither side could make any progress in this situation. [float] ==== Adding master-eligible nodes +It is recommended to have a small, fixed, number of master-eligible nodes in a +cluster, and to scale the cluster up and down by adding and removing +non-master-eligible nodes only. However there are situations in which it may be +necessary to add some master-eligible nodes to a cluster, such as when +migrating a cluster onto a new set of nodes without downtime. + If you wish to add some master-eligible nodes to your cluster, simply configure the new nodes to find the existing cluster and start them up. Once the new nodes have joined the cluster, you may be able to increase the @@ -102,6 +111,12 @@ info/warn messages about this?) [float] ==== Removing master-eligible nodes +It is recommended to have a small, fixed, number of master-eligible nodes in a +cluster, and to scale the cluster up and down by adding and removing +non-master-eligible nodes only. However there are situations in which it may be +necessary to remove some master-eligible nodes to a cluster, such as when +migrating a cluster onto a new set of nodes without downtime. + If you wish to remove some of the master-eligible nodes in your cluster, you must first reduce the `cluster.master_nodes_failure_tolerance` setting to match the target cluster size before removing the extraneous nodes. This temporary @@ -175,7 +190,7 @@ retirement they may not immediately be added to the configuration. A major risk when starting up a brand-new cluster is that you accidentally form two separate clusters instead of one. This could lead to data loss: you might start using both clusters before noticing that anything had gone wrong, and it -might then be impossible to merge them together later. +will then be impossible to merge them together later. To illustrate how this could happen, imagine starting up a three-node cluster in which each node knows that it is going to be part of a three-node cluster. A @@ -206,7 +221,7 @@ $ bin/elasticsearch-bootstrap-cluster --failure-tolerance 1 \ -------------------------------------------------- The arguments to this tool are the target failure tolerance of the cluster and -the addresses of (some of) its master-eligible nodes. +the addresses of (some, preferably all, of) its master-eligible nodes. If it is not possible to use this tool, you can also bootstrap the cluster via the API as described here. There are two steps to the bootstrapping process. @@ -232,9 +247,9 @@ and looks like the following. { "master_nodes_failure_tolerance": 1, "master_nodes":[ - {"id":"USpTGYaBSIKbgSUJR2Z9lg"}, - {"id":"gSUJR2Z9lgUSpTGYaBSIKb"}, - {"id":"2Z9lgUSpTgSUYaBSIKbJRG"} + {"id":"USpTGYaBSIKbgSUJR2Z9lg","name":"master-a"}, + {"id":"gSUJR2Z9lgUSpTGYaBSIKb","name":"master-b"}, + {"id":"2Z9lgUSpTgSUYaBSIKbJRG","name":"master-c"} ] } -------------------------------------------------- @@ -253,20 +268,22 @@ POST /_cluster/bootstrap { "master_nodes_failure_tolerance": 1, "master_nodes":[ - {"id":"USpTGYaBSIKbgSUJR2Z9lg"}, - {"id":"gSUJR2Z9lgUSpTGYaBSIKb"}, - {"id":"2Z9lgUSpTgSUYaBSIKbJRG"} + {"id":"USpTGYaBSIKbgSUJR2Z9lg","name":"master-a"}, + {"id":"gSUJR2Z9lgUSpTGYaBSIKb","name":"master-b"}, + {"id":"2Z9lgUSpTgSUYaBSIKbJRG","name":"master-c"} ] } -------------------------------------------------- // CONSOLE -It is safe to repeatedly call `POST /_cluster/bootstrap`, and to call it on -different nodes concurrently, but **it is vitally important** to use the same -bootstrap document in each call. +This only needs to occur once, on a single master-eligible node in the cluster, +but for robustness it is afe to repeatedly call `POST /_cluster/bootstrap`, and +to call it on different nodes concurrently. However **it is vitally important** +to use the same bootstrap document in each call. -It is also possible to select the initial set of nodes in terms of their names -rather than their IDs as follows. +It is also possible to construct a bootstrap document manually and to specify +the initial set of nodes in terms of their names alone, rather than needing to +know their IDs too: [source,js] -------------------------------------------------- @@ -285,9 +302,10 @@ POST /_cluster/bootstrap This can be useful if the node names are known (and known to be unique) in advance, and means that the first `GET /_cluster/bootstrap` call is not -necessary. As above, it is safe to repeatedly call `POST /_cluster/bootstrap`, -and to call it on different nodes concurrently, but **it is vitally important** -to use the same bootstrap document in each call. +necessary. As above, only a single such call is required but it is safe to +repeatedly call `POST /_cluster/bootstrap`, and to call it on different nodes +concurrently, but **it is vitally important** to use the same bootstrap +document in each call. [float] === Manually-triggered elections @@ -301,10 +319,13 @@ POST /_nodes/node_name/start_election -------------------------------------------------- // CONSOLE -Elections are not guaranteed to succeed, and a new leader may be elected at any -time so even if this election does succeed then there may be another election, -so there is no guarantee that the chosen node will be the elected master for -any length of time. +This request is handled on a best-effort basis only. Handling it involves +cooperation from the currently-elected master and the selected node, and it +will be rejected if it would destabilise the cluster. Also, elections are not +guaranteed to succeed, and a new leader may be elected at any time so even if +this election does succeed then there may be another election soon afterwards. +Therefore there is no guarantee that the chosen node will be the elected master +for any length of time. [float] === Unsafe disaster recovery @@ -325,7 +346,9 @@ This works by reducing `cluster.master_nodes_failure_tolerance` to 0 and then forcibly overriding the current configuration with one in which the handling node is the only voting master, so that it forms a quorum on its own. Because there is a risk of data loss when performing this command it requires the -`accept_data_loss` parameter to be set to `true` in the URL. +`accept_data_loss` parameter to be set to `true` in the URL. Afterwards, once +the cluster has successfully formed, `cluster.master_nodes_failure_tolerance` +should be increased to a suitable value. [float] === Election scheduling From aa6df51bca5d6073c8408e101f63187e2e8d317e Mon Sep 17 00:00:00 2001 From: David Turner Date: Tue, 23 Oct 2018 08:32:59 +0100 Subject: [PATCH 003/106] More review feedback --- docs/reference/modules/coordination.asciidoc | 19 +++++++++++++++++-- 1 file changed, 17 insertions(+), 2 deletions(-) diff --git a/docs/reference/modules/coordination.asciidoc b/docs/reference/modules/coordination.asciidoc index a9f98e6c63652..dca8a28acac2e 100644 --- a/docs/reference/modules/coordination.asciidoc +++ b/docs/reference/modules/coordination.asciidoc @@ -32,7 +32,13 @@ currently in the cluster, but there are some situations in which they may be different. As long as more than half of the nodes in the configuration are still healthy then the cluster can still make progress. The configuration is managed automatically by Elasticsearch and stored in the cluster state so you -can inspect its current contents by (TODO API call?). +can inspect its current contents as follows: + +[source,js] +-------------------------------------------------- +GET /_cluster/state?filter_path=TODO +-------------------------------------------------- +// CONSOLE The way that the configuration is managed is controlled by the following setting. (TODO maybe not a setting?) @@ -184,6 +190,15 @@ The node (or nodes) to reinstate are specified using <> in place of `node_name` here. After being brought back out of retirement they may not immediately be added to the configuration. +The current set of retired nodes is stored in the cluster state and can be +inspected as follows: + +[source,js] +-------------------------------------------------- +GET /_cluster/state?filter_path=TODO +-------------------------------------------------- +// CONSOLE + [float] === Cluster bootstrapping @@ -360,7 +375,7 @@ succeed. Elections only usually fail when two nodes both happen to start their elections at about the same time, so elections are scheduled randomly on each node to avoid this happening. Nodes will retry elections until a master is elected, backing off on failure, so that eventually an election will succeed -with arbitrarily high probability. The following settings control the +(with arbitrarily high probability). The following settings control the scheduling of elections. `cluster.election.initial_timeout`:: From 7c9db23149f593dfa5a9a4ab1fb72df5d39bc27d Mon Sep 17 00:00:00 2001 From: David Turner Date: Tue, 23 Oct 2018 08:49:17 +0100 Subject: [PATCH 004/106] Bootstrapping explanation as NOTE --- docs/reference/modules/coordination.asciidoc | 32 ++++++++++---------- 1 file changed, 16 insertions(+), 16 deletions(-) diff --git a/docs/reference/modules/coordination.asciidoc b/docs/reference/modules/coordination.asciidoc index dca8a28acac2e..4950a1284377b 100644 --- a/docs/reference/modules/coordination.asciidoc +++ b/docs/reference/modules/coordination.asciidoc @@ -202,24 +202,24 @@ GET /_cluster/state?filter_path=TODO [float] === Cluster bootstrapping -A major risk when starting up a brand-new cluster is that you accidentally form -two separate clusters instead of one. This could lead to data loss: you might -start using both clusters before noticing that anything had gone wrong, and it -will then be impossible to merge them together later. - -To illustrate how this could happen, imagine starting up a three-node cluster -in which each node knows that it is going to be part of a three-node cluster. A -majority of three nodes is two, so normally the first two nodes to discover -each other will form a cluster and the third node will join them a short time -later. However, imagine that four nodes were accidentally started instead of -three: in this case there are enough nodes to form two separate clusters. Of -course if each node is started manually then it's unlikely that too many nodes -are started, but it's certainly possible to get into this situation if using a -more automated orchestrator, particularly if a network partition happens at the -wrong time. +There is a risk when starting up a brand-new cluster is that you accidentally +form two separate clusters instead of one. This could lead to data loss: you +might start using both clusters before noticing that anything had gone wrong, +and it will then be impossible to merge them together later. + +NOTE: To illustrate how this could happen, imagine starting up a three-node +cluster in which each node knows that it is going to be part of a three-node +cluster. A majority of three nodes is two, so normally the first two nodes to +discover each other will form a cluster and the third node will join them a +short time later. However, imagine that four nodes were accidentally started +instead of three: in this case there are enough nodes to form two separate +clusters. Of course if each node is started manually then it's unlikely that +too many nodes are started, but it's certainly possible to get into this +situation if using a more automated orchestrator, particularly if a network +partition happens at the wrong time. We avoid this by requiring a separate _cluster bootstrap_ process to take place -on every brand-new cluster. This is only required the first time the whole +on every brand-new cluster. This is only required the very first time the whole cluster starts up: new nodes joining an established cluster can safely obtain all the information they need from the elected master, and nodes that have previously been part of a cluster will have stored to disk all the information From 830eca7576e1ed47f2617733564830847fc967f2 Mon Sep 17 00:00:00 2001 From: David Turner Date: Tue, 23 Oct 2018 09:12:31 +0100 Subject: [PATCH 005/106] Rename to 'POST /_cluster/force_local_node_takeover' --- docs/reference/modules/coordination.asciidoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/reference/modules/coordination.asciidoc b/docs/reference/modules/coordination.asciidoc index 4950a1284377b..89a902083a044 100644 --- a/docs/reference/modules/coordination.asciidoc +++ b/docs/reference/modules/coordination.asciidoc @@ -353,7 +353,7 @@ This can be performed with the following command on a surviving node: [source,js] -------------------------------------------------- -POST /_nodes/_local/force_become_leader +POST /_cluster/force_local_node_takeover -------------------------------------------------- // CONSOLE From d91c924e9b6b514b8952c88bfa03cae9314f2890 Mon Sep 17 00:00:00 2001 From: David Turner Date: Thu, 25 Oct 2018 08:19:25 +0100 Subject: [PATCH 006/106] WIP rolling restarts --- docs/reference/modules/coordination.asciidoc | 31 +++++++++++++++----- 1 file changed, 23 insertions(+), 8 deletions(-) diff --git a/docs/reference/modules/coordination.asciidoc b/docs/reference/modules/coordination.asciidoc index 89a902083a044..48c2c572be45e 100644 --- a/docs/reference/modules/coordination.asciidoc +++ b/docs/reference/modules/coordination.asciidoc @@ -10,14 +10,13 @@ managing changes to the cluster state. Electing a master node and changing the cluster state are the two fundamental tasks that master-eligible nodes must work together to perform. It is important that these activities work robustly even if some nodes have failed, and -Elasticsearch achieves this robustness by involving multiple nodes and only -considering each action to have succeeded on receipt of responses from a -majority of the master-eligible nodes in the cluster. The advantage of -requiring a majority of nodes to respond is that it allows for nearly half of -the master-eligible nodes to fail without preventing the cluster from making -progress, but also does not allow the cluster to "split brain", i.e. to be -partitioned into two pieces each of which may make decisions that are -inconsistent with those of the other piece. +Elasticsearch achieves this robustness by only considering each action to have +succeeded on receipt of responses from a majority of the master-eligible nodes +in the cluster. The advantage of requiring a majority of nodes to respond is +that it allows for nearly half of the master-eligible nodes to fail without +preventing the cluster from making progress, but also does not allow the +cluster to "split brain", i.e. to be partitioned into two pieces each of which +may make decisions that are inconsistent with those of the other piece. Elasticsearch allows you to add and remove master-eligible nodes to a running cluster. In many cases you can do this simply by starting or stopping the nodes @@ -199,6 +198,22 @@ GET /_cluster/state?filter_path=TODO -------------------------------------------------- // CONSOLE +[float] +=== Rolling restarts and migrations + +It is possible to perform some cluster maintenance tasks without taking the +whole cluster offline, such as a <>. A +rolling restart does not require any special handling for the master nodes or +any use of the APIs described here. During a rolling restart the restarting +node will be offline, and this reduces the cluster's ability to tolerate faults +of its other nodes. If it is necessary to avoid this, you can temporarily add a +new master node to the cluster as described above, perform the rolling restart, +and then remove the extra master node again. + +It is also possible to perform a migration of a cluster onto entirely new nodes +without taking the cluster offline. A _rolling migration_ is similar to a +rolling restart, in that it is performed one node at a time. + [float] === Cluster bootstrapping From d03103a3bdabaf459339c84a0b2b7c672c530916 Mon Sep 17 00:00:00 2001 From: David Turner Date: Thu, 25 Oct 2018 11:20:51 +0100 Subject: [PATCH 007/106] Reorder bootstrap section --- docs/reference/modules/coordination.asciidoc | 131 ++++++++----------- 1 file changed, 58 insertions(+), 73 deletions(-) diff --git a/docs/reference/modules/coordination.asciidoc b/docs/reference/modules/coordination.asciidoc index 48c2c572be45e..3479c9a8d264b 100644 --- a/docs/reference/modules/coordination.asciidoc +++ b/docs/reference/modules/coordination.asciidoc @@ -233,44 +233,18 @@ too many nodes are started, but it's certainly possible to get into this situation if using a more automated orchestrator, particularly if a network partition happens at the wrong time. -We avoid this by requiring a separate _cluster bootstrap_ process to take place -on every brand-new cluster. This is only required the very first time the whole -cluster starts up: new nodes joining an established cluster can safely obtain -all the information they need from the elected master, and nodes that have -previously been part of a cluster will have stored to disk all the information -required when restarting. - -The simplest way to bootstrap a cluster is to use the -`elasticsearch-bootstrap-cluster` command-line tool: - -[source,txt] --------------------------------------------------- -$ bin/elasticsearch-bootstrap-cluster --failure-tolerance 1 \ - --node http://10.0.12.1:9200/ --node http://10.0.13.1:9200/ \ - --node https://10.0.14.1:9200/ --------------------------------------------------- - -The arguments to this tool are the target failure tolerance of the cluster and -the addresses of (some, preferably all, of) its master-eligible nodes. - -If it is not possible to use this tool, you can also bootstrap the cluster via -the API as described here. There are two steps to the bootstrapping process. -Firstly, after all the nodes have started up, created their persistent node -IDs, and discovered each other, the first step is to request a bootstrap -document: - -[source,js] --------------------------------------------------- -# Return the current bootstrap document immediately -GET /_cluster/bootstrap -# Wait until the node has discovered at least 3 nodes, or 60 seconds has elapsed, -# and then return the bootstrap document -GET /_cluster/bootstrap?wait_for_nodes=3&timeout=60s --------------------------------------------------- -// CONSOLE - -The boostrap document contains information that the cluster needs to start up, -and looks like the following. +We avoid this by requiring a separate _cluster bootstrapping_ process to take +place on every brand-new cluster. This is only required the very first time the +whole cluster starts up: new nodes joining an established cluster can safely +obtain all the information they need from the elected master, and nodes that +have previously been part of a cluster will have stored to disk all the +information required when restarting. + +A cluster can be bootstrapped by sending a _bootstrap warrant_ to any of its +master-eligible nodes. A bootstrap warrant is a document that contains the +information that the cluster needs to finish forming, including the identities +of the master-eligible nodes that form its first voting configuration, and +looks like this: [source,js] -------------------------------------------------- @@ -284,16 +258,13 @@ and looks like the following. } -------------------------------------------------- -It is safe to repeatedly call `GET /_cluster/bootstrap`, and to call it on -different nodes concurrently. This API will yield an error if the receiving -node has already been bootstrapped or has joined an existing cluster. - -Once a bootstrap document has been received, it must then be sent back to the -cluster to finish the bootstrapping process as follows: +To bootstrap a cluster, the administrator must identify a suitable set of +master-eligible nodes, construct a bootstrap warrant, and pass the warrant to +the `POST /_cluster/bootstrap` API: [source,js] -------------------------------------------------- -# send the bootstrap document back to the cluster +# send the bootstrap warrant back to the cluster POST /_cluster/bootstrap { "master_nodes_failure_tolerance": 1, @@ -307,17 +278,40 @@ POST /_cluster/bootstrap // CONSOLE This only needs to occur once, on a single master-eligible node in the cluster, -but for robustness it is afe to repeatedly call `POST /_cluster/bootstrap`, and -to call it on different nodes concurrently. However **it is vitally important** -to use the same bootstrap document in each call. +but for robustness it is safe to repeatedly call `POST /_cluster/bootstrap`, +and to call it on different nodes concurrently. However **it is vitally +important** to use the same bootstrap warrant in each call. -It is also possible to construct a bootstrap document manually and to specify -the initial set of nodes in terms of their names alone, rather than needing to -know their IDs too: +WARNING: You must pass the same bootstrap warrant to each call to `POST +/_cluster/bootstrap` in order to be sure that only a single cluster forms +during bootstrapping. + +The simplest and safest way to construct a bootstrap warrant is to use the `GET +/_cluster/bootstrap` API: + +[source,js] +-------------------------------------------------- +# Immediately return a bootstrap warrant based on the nodes discovered so far +GET /_cluster/bootstrap +# Wait until the node has discovered at least 3 nodes, or 60 seconds has elapsed, +# and then return the resulting bootstrap warrant +GET /_cluster/bootstrap?wait_for_nodes=3&timeout=60s +-------------------------------------------------- +// CONSOLE + +This API returns a properly-constructed bootstrap warrant that is ready to pass +to the `POST /_cluster/bootstrap` API. It includes all of the master-eligible +nodes that the handling node has discovered via the gossip-based discovery +protocol, and returns an error if fewer nodes have been discovered than +expected. + +It is also possible to construct a bootstrap warrant manually and to specify +the initial set of nodes in terms of their names alone, rather than including +their IDs too: [source,js] -------------------------------------------------- -# send the bootstrap document back to the cluster +# send the bootstrap warrant back to the cluster POST /_cluster/bootstrap { "master_nodes_failure_tolerance": 1, @@ -330,32 +324,23 @@ POST /_cluster/bootstrap -------------------------------------------------- // CONSOLE -This can be useful if the node names are known (and known to be unique) in -advance, and means that the first `GET /_cluster/bootstrap` call is not -necessary. As above, only a single such call is required but it is safe to -repeatedly call `POST /_cluster/bootstrap`, and to call it on different nodes -concurrently, but **it is vitally important** to use the same bootstrap -document in each call. - -[float] -=== Manually-triggered elections +It is safer to include the node IDs, in case two nodes are accidentally started +with the same name. -It is possible to request that a particular node takes over from the elected -master as follows: +This process is implemented in the `elasticsearch-bootstrap-cluster` +command-line tool: -[source,js] +[source,txt] -------------------------------------------------- -POST /_nodes/node_name/start_election +$ bin/elasticsearch-bootstrap-cluster --failure-tolerance 1 \ + --node http://10.0.12.1:9200/ --node http://10.0.13.1:9200/ \ + --node https://10.0.14.1:9200/ -------------------------------------------------- -// CONSOLE -This request is handled on a best-effort basis only. Handling it involves -cooperation from the currently-elected master and the selected node, and it -will be rejected if it would destabilise the cluster. Also, elections are not -guaranteed to succeed, and a new leader may be elected at any time so even if -this election does succeed then there may be another election soon afterwards. -Therefore there is no guarantee that the chosen node will be the elected master -for any length of time. +The arguments to this tool are the target failure tolerance of the cluster and +the addresses of (some, preferably all, of) its master-eligible nodes. The tool +will construct a bootstrap warrant and then bootstrap the cluster, retrying +safely if any step fails. [float] === Unsafe disaster recovery From c762dbafd16303f6f088091d5b9dcbb86a8c4106 Mon Sep 17 00:00:00 2001 From: David Turner Date: Thu, 25 Oct 2018 11:30:23 +0100 Subject: [PATCH 008/106] Finish section on migration/restarts --- docs/reference/modules/coordination.asciidoc | 25 ++++++++++++++------ 1 file changed, 18 insertions(+), 7 deletions(-) diff --git a/docs/reference/modules/coordination.asciidoc b/docs/reference/modules/coordination.asciidoc index 3479c9a8d264b..bb6eabe3f1081 100644 --- a/docs/reference/modules/coordination.asciidoc +++ b/docs/reference/modules/coordination.asciidoc @@ -203,16 +203,27 @@ GET /_cluster/state?filter_path=TODO It is possible to perform some cluster maintenance tasks without taking the whole cluster offline, such as a <>. A -rolling restart does not require any special handling for the master nodes or -any use of the APIs described here. During a rolling restart the restarting -node will be offline, and this reduces the cluster's ability to tolerate faults -of its other nodes. If it is necessary to avoid this, you can temporarily add a -new master node to the cluster as described above, perform the rolling restart, -and then remove the extra master node again. +rolling restart does not require any special handling for the master-eligible +nodes or any use of the APIs described here. During a rolling restart the +restarting node will be offline, and this reduces the cluster's ability to +tolerate faults of its other nodes. If it is necessary to avoid this, you can +temporarily add a new master node to the cluster as described above, perform +the rolling restart, and then remove the extra master node again. It is also possible to perform a migration of a cluster onto entirely new nodes without taking the cluster offline. A _rolling migration_ is similar to a -rolling restart, in that it is performed one node at a time. +rolling restart, in that it is performed one node at a time, and also requires +no special handling for the master-eligible nodes. + +Alternatively a migration can be performed by starting up all the new nodes at +once, each configured to join the existing cluster, migrating all the data +using <>, and then +shutting down the old nodes. Care must be taken not to shut the master-eligible +nodes down too quickly since a majority of the voting configuration is always +required to keep the cluster alive. This can be done with the `GET +/_nodes/node_name/await_removal` API described above, or else the old +master-eligible nodes may all be retired so that they do not form part of the +voting configuration. [float] === Cluster bootstrapping From 8ca2e75000d092d18ea8f0cec926d06b34fa9dbf Mon Sep 17 00:00:00 2001 From: David Turner Date: Thu, 25 Oct 2018 14:55:10 +0100 Subject: [PATCH 009/106] Different auto-config heuristics --- docs/reference/modules/coordination.asciidoc | 199 +++++++++---------- 1 file changed, 92 insertions(+), 107 deletions(-) diff --git a/docs/reference/modules/coordination.asciidoc b/docs/reference/modules/coordination.asciidoc index bb6eabe3f1081..7b5a0aaa9afa1 100644 --- a/docs/reference/modules/coordination.asciidoc +++ b/docs/reference/modules/coordination.asciidoc @@ -22,16 +22,22 @@ Elasticsearch allows you to add and remove master-eligible nodes to a running cluster. In many cases you can do this simply by starting or stopping the nodes as required, as described in more detail below. As nodes are added or removed Elasticsearch maintains an optimal level of fault tolerance by updating the -cluster's _configuration_, which is the set of master-eligible nodes whose -responses are counted when making decisions such as electing a new master or -committing a new cluster state. A decision is only made once more than half of -the nodes in the configuration have responded. Usually the cluster +cluster's _voting configuration_, which is the set of master-eligible nodes +whose responses are counted when making decisions such as electing a new master +or committing a new cluster state. A decision is only made once more than half +of the nodes in the voting configuration have responded. Usually the voting configuration is the same as the set of all the master-eligible nodes that are currently in the cluster, but there are some situations in which they may be -different. As long as more than half of the nodes in the configuration are -still healthy then the cluster can still make progress. The configuration is -managed automatically by Elasticsearch and stored in the cluster state so you -can inspect its current contents as follows: +different. As long as more than half of the nodes in the voting configuration +are still healthy then the cluster can still make progress. + +[float] +==== Auto-reconfiguration + +Nodes may join or leave the cluster, and Elasticsearch reacts by making +corresponding changes to the voting configuration in order to ensure that the +cluster is as resilient as possible. The current voting configuration is +stored in the cluster state so you can inspect its current contents as follows: [source,js] -------------------------------------------------- @@ -39,64 +45,51 @@ GET /_cluster/state?filter_path=TODO -------------------------------------------------- // CONSOLE -The way that the configuration is managed is controlled by the following -setting. (TODO maybe not a setting?) - -`cluster.master_nodes_failure_tolerance`:: - - Sets the number of master-eligible nodes whose simultaneous failure the - cluster should be able to tolerate. This imposes a lower bound on the size - of the configuration. Elasticsearch will not remove nodes from the - configuration if their removal would break this bound, and will not permit - this bound to be increased to a value that is too large for the current - configuration. - -The relationship between the number of master-eligible nodes in your cluster, -the size of a majority, and the appropriate value for the -`cluster.master_nodes_failure_tolerance` setting is shown below. - -[cols="<,<,<",options="header",] -|======================================================================================= -|Number of master-eligible nodes |Majority size |`cluster.master_nodes_failure_tolerance` -|1 |1 |0 -|2 |2 |0 -|**3 (recommended)** |**2** |**1** -|4 |3 |1 -|5 |3 |2 -|6 |4 |2 -|7 |4 |3 -|======================================================================================= - -The minimum configuration size is `2 * cluster.master_nodes_failure_tolerance + 1`: - -[cols="<,<",options="header",] -|==================================================================== -|`cluster.master_nodes_failure_tolerance` |Minimum configuration size -|0 |1 -|**1** |**3** -|2 |5 -|3 |7 -|==================================================================== - -It is permissible, but not recommended, to set -`cluster.master_nodes_failure_tolerance` too low for your cluster or, put -differently, to have more master-eligible nodes than the minimum configuration -size. To do so is _safe_ in the sense that the cluster will not suffer from a -split-brain however this setting is configured, but if -`cluster.master_nodes_failure_tolerance` is too low then your cluster may not -tolerate as many failures as expected. +Larger voting configurations are usually more resilient, so Elasticsearch will +normally prefer to add nodes to the voting configuration once they have joined +the cluster. Similarly, if a node in the voting configuration leaves the +cluster and there is another node in the cluster that is not in the voting +configuration then it is preferable to swap these two nodes in the voting +configuration, leaving its size unchanged but increasing its resilience. + +It is not so straightforward to automatically remove nodes from the voting +configuration after they have left the cluster, and different strategies have +different benefits and drawbacks, so the right choice depends on how the +cluster will be used and is controlled by the following setting. + +`cluster.automatically_shrink_voting_configuration`:: + + Defaults to `true`, meaning that the voting configuration will + automatically shrink, shedding departed nodes, as long as it still contains + at least 3 nodes. If set to `false`, the voting configuration never + automatically shrinks; departed nodes must be removed manually using the + retirement API described below. + +If `cluster.automatically_shrink_voting_configuration` is set to `true`, the +recommended and default setting, and there are at least three master-eligible +nodes in the cluster, then Elasticsearch remains capable of processing +cluster-state updates as long as all but one of its master-eligible nodes are +healthy. There are situations in which it might tolerate the loss of multiple +nodes, but this is not guaranteed under all sequences of cascading failures. If +this setting is set to `false` then departed nodes must be removed from the +voting configuration manually, using the retirement API described below, to +achieve the desired level of resilience. + +Note that Elasticsearch will not suffer from a "split-brain" inconsistency +however it is configured. This setting only affects its availability in the +event of some node failures. [float] ==== Even numbers of master-eligible nodes There should normally be an odd number of master-eligible nodes in a cluster. -If there is an even number then Elasticsearch will put all but one of them into -the configuration to ensure that the configuration has an odd size. This does -not decrease the failure-tolerance of the cluster, and in fact improves it -slightly: if the cluster is partitioned into two even halves then one of the -halves will contain a majority of the masters and will be able to keep -operating, whereas if all of the master-eligible nodes' votes were counted then -neither side could make any progress in this situation. +If there is an even number then Elasticsearch will leave one of them out of the +voting configuration to ensure that it has an odd size. This does not decrease +the failure-tolerance of the cluster, and in fact improves it slightly: if the +cluster is partitioned into two even halves then one of the halves will contain +a majority of the voting configuration and will be able to keep operating, +whereas if all of the master-eligible nodes' votes were counted then neither +side could make any progress in this situation. [float] ==== Adding master-eligible nodes @@ -108,10 +101,9 @@ necessary to add some master-eligible nodes to a cluster, such as when migrating a cluster onto a new set of nodes without downtime. If you wish to add some master-eligible nodes to your cluster, simply configure -the new nodes to find the existing cluster and start them up. Once the new -nodes have joined the cluster, you may be able to increase the -`cluster.master_nodes_failure_tolerance` setting to match. (TODO do we log -info/warn messages about this?) +the new nodes to find the existing cluster and start them up. Elasticsearch +will add the new nodes to the voting configuration if it is appropriate to do +so. [float] ==== Removing master-eligible nodes @@ -122,20 +114,14 @@ non-master-eligible nodes only. However there are situations in which it may be necessary to remove some master-eligible nodes to a cluster, such as when migrating a cluster onto a new set of nodes without downtime. -If you wish to remove some of the master-eligible nodes in your cluster, you -must first reduce the `cluster.master_nodes_failure_tolerance` setting to match -the target cluster size before removing the extraneous nodes. This temporary -situation is the only case in which `cluster.master_nodes_failure_tolerance` -should be set lower than the recommended values above. - -You must also be careful not to remove too many master-eligible nodes all at -the same time. For instance, if you currently have seven master-eligible nodes -and you wish to reduce this to three, you cannot simply stop four of the nodes -all at the same time: to do so would leave only three nodes remaining, which is -less than half of the cluster, which means it cannot take any further actions. -You should remove the nodes one-at-a-time and verify that each node has been -removed from the configuration before moving onto the next one, using the -`await_removal` API: +You must be careful not to remove too many master-eligible nodes all at the +same time. For instance, if you currently have seven master-eligible nodes and +you wish to reduce this to three, you cannot simply stop four of the nodes all +at the same time: to do so would leave only three nodes remaining, which is +less than half of the voting configuration, which means it cannot take any +further actions. You should remove the nodes one-at-a-time and verify that +each node has been removed from the voting configuration before moving onto the +next one, using the `await_removal` API: [source,js] -------------------------------------------------- @@ -153,11 +139,11 @@ A special case of this is the case where there are only two master-eligible nodes and you wish to remove one of them. In this case neither node can be safely shut down since both nodes are required to reliably make progress, so you must first explicitly _retire_ one of the nodes. A retired node still works -normally, but Elasticsearch will try and transfer it out of the current +normally, but Elasticsearch will try and transfer it out of the voting configuration so its vote is no longer required, and will never move a retired -node back into the configuration. Once a node has been retired and is removed -from the configuration, it is safe to shut it down. A node can be retired using -the retirement API: +node back into the voting configuration. Once a node has been successfully +retired, it is safe to shut it down. A node can be retired using the following +API: [source,js] -------------------------------------------------- @@ -170,11 +156,14 @@ POST /_nodes/node_name/retire?timeout=1m The node to retire is specified using <> in place of `node_name` here. If a call to the retirement API fails then the call can -safely be retried. However if a retirement fails then it's possible the node -cannot be removed from the configuration due to the -`cluster.master_nodes_failure_tolerance` setting, so verify that this is set -correctly first. A successful response guarantees that the node has been -removed from the configuration and will not be reinstated. +safely be retried. A successful response guarantees that the node has been +removed from the voting configuration and will not be reinstated. + +Although the retirement API is most useful for removing one node from a +two-node cluster, it is also possible to use it to remove nodes from larger +clusters. In the example described above, shrinking a seven-master-node cluster +down to only have three master nodes, you could retire four of the nodes and +then shut them down simultaneously. A node (or nodes) can be brought back out of retirement using the `unretire` API: @@ -187,7 +176,8 @@ POST /_nodes/node_name/unretire The node (or nodes) to reinstate are specified using <> in place of `node_name` here. After being brought back out of -retirement they may not immediately be added to the configuration. +retirement they might or might not immediately be added to the voting +configuration. The current set of retired nodes is stored in the cluster state and can be inspected as follows: @@ -260,7 +250,6 @@ looks like this: [source,js] -------------------------------------------------- { - "master_nodes_failure_tolerance": 1, "master_nodes":[ {"id":"USpTGYaBSIKbgSUJR2Z9lg","name":"master-a"}, {"id":"gSUJR2Z9lgUSpTGYaBSIKb","name":"master-b"}, @@ -278,7 +267,6 @@ the `POST /_cluster/bootstrap` API: # send the bootstrap warrant back to the cluster POST /_cluster/bootstrap { - "master_nodes_failure_tolerance": 1, "master_nodes":[ {"id":"USpTGYaBSIKbgSUJR2Z9lg","name":"master-a"}, {"id":"gSUJR2Z9lgUSpTGYaBSIKb","name":"master-b"}, @@ -295,7 +283,7 @@ important** to use the same bootstrap warrant in each call. WARNING: You must pass the same bootstrap warrant to each call to `POST /_cluster/bootstrap` in order to be sure that only a single cluster forms -during bootstrapping. +during bootstrapping and therefore to avoid the risk of data loss. The simplest and safest way to construct a bootstrap warrant is to use the `GET /_cluster/bootstrap` API: @@ -325,7 +313,6 @@ their IDs too: # send the bootstrap warrant back to the cluster POST /_cluster/bootstrap { - "master_nodes_failure_tolerance": 1, "master_nodes":[ {"name":"master-a"}, {"name":"master-b"}, @@ -343,15 +330,13 @@ command-line tool: [source,txt] -------------------------------------------------- -$ bin/elasticsearch-bootstrap-cluster --failure-tolerance 1 \ - --node http://10.0.12.1:9200/ --node http://10.0.13.1:9200/ \ - --node https://10.0.14.1:9200/ +$ bin/elasticsearch-bootstrap-cluster --node http://10.0.12.1:9200/ \ + --node http://10.0.13.1:9200/ --node https://10.0.14.1:9200/ -------------------------------------------------- -The arguments to this tool are the target failure tolerance of the cluster and -the addresses of (some, preferably all, of) its master-eligible nodes. The tool -will construct a bootstrap warrant and then bootstrap the cluster, retrying -safely if any step fails. +The arguments to this tool are the addresses of (some, preferably all, of) its +master-eligible nodes. The tool will construct a bootstrap warrant and then +bootstrap the cluster, retrying safely if any step fails. [float] === Unsafe disaster recovery @@ -368,13 +353,13 @@ POST /_cluster/force_local_node_takeover -------------------------------------------------- // CONSOLE -This works by reducing `cluster.master_nodes_failure_tolerance` to 0 and then -forcibly overriding the current configuration with one in which the handling -node is the only voting master, so that it forms a quorum on its own. Because -there is a risk of data loss when performing this command it requires the -`accept_data_loss` parameter to be set to `true` in the URL. Afterwards, once -the cluster has successfully formed, `cluster.master_nodes_failure_tolerance` -should be increased to a suitable value. +This works by forcibly overriding the current voting configuration with one in +which the handling node is the only voting master, so that it forms a quorum on +its own. Because there is a risk of data loss when performing this command it +requires the `accept_data_loss` parameter to be set to `true` in the URL. +Afterwards, once the cluster has successfully formed, +`cluster.master_nodes_failure_tolerance` should be increased to a suitable +value. [float] === Election scheduling From d8ec40b06b58e1ba2c2e504122e9a21a382f2feb Mon Sep 17 00:00:00 2001 From: David Turner Date: Thu, 25 Oct 2018 16:25:45 +0100 Subject: [PATCH 010/106] Move/rework section on cluster maintenance --- docs/reference/modules/coordination.asciidoc | 135 +++++++++---------- 1 file changed, 67 insertions(+), 68 deletions(-) diff --git a/docs/reference/modules/coordination.asciidoc b/docs/reference/modules/coordination.asciidoc index 7b5a0aaa9afa1..66e029786746d 100644 --- a/docs/reference/modules/coordination.asciidoc +++ b/docs/reference/modules/coordination.asciidoc @@ -31,6 +31,28 @@ currently in the cluster, but there are some situations in which they may be different. As long as more than half of the nodes in the voting configuration are still healthy then the cluster can still make progress. +[float] +=== Cluster maintenance, rolling restarts and migrations + +Many cluster maintenance tasks involve temporarily shutting down one or more +nodes and then starting them back up again. By default Elasticsearch can remain +available if one of its master-eligible nodes is taken offline, such as during +a <>, and if multiple nodes are stopped and +then started again then it will automatically recover, such as during a +<>. There is no need to take any further +action with the APIs described here in these cases, because the set of master +nodes is not changing permanently. + +It is also possible to perform a migration of a cluster onto entirely new nodes +without taking the cluster offline, via a _rolling migration_. A rolling +migration is similar to a rolling restart, in that it is performed one node at +a time, and also requires no special handling for the master-eligible nodes as +long as there are at least two of them at all times. + +TODO the above is only true if the maintenance happens slowly enough, otherwise +the configuration might not catch up. Need to add this to the rolling restart +docs. + [float] ==== Auto-reconfiguration @@ -49,8 +71,8 @@ Larger voting configurations are usually more resilient, so Elasticsearch will normally prefer to add nodes to the voting configuration once they have joined the cluster. Similarly, if a node in the voting configuration leaves the cluster and there is another node in the cluster that is not in the voting -configuration then it is preferable to swap these two nodes in the voting -configuration, leaving its size unchanged but increasing its resilience. +configuration then it is preferable to swap these two nodes over, leaving the +size of the voting configuration unchanged but increasing its resilience. It is not so straightforward to automatically remove nodes from the voting configuration after they have left the cluster, and different strategies have @@ -65,19 +87,21 @@ cluster will be used and is controlled by the following setting. automatically shrinks; departed nodes must be removed manually using the retirement API described below. -If `cluster.automatically_shrink_voting_configuration` is set to `true`, the -recommended and default setting, and there are at least three master-eligible -nodes in the cluster, then Elasticsearch remains capable of processing -cluster-state updates as long as all but one of its master-eligible nodes are -healthy. There are situations in which it might tolerate the loss of multiple -nodes, but this is not guaranteed under all sequences of cascading failures. If -this setting is set to `false` then departed nodes must be removed from the -voting configuration manually, using the retirement API described below, to -achieve the desired level of resilience. +NOTE: If `cluster.automatically_shrink_voting_configuration` is set to `true`, +the recommended and default setting, and there are at least three +master-eligible nodes in the cluster, then Elasticsearch remains capable of +processing cluster-state updates as long as all but one of its master-eligible +nodes are healthy. + +There are situations in which Elasticsearch might tolerate the loss of multiple +nodes, but this is not guaranteed under all sequences of failures. If this +setting is set to `false` then departed nodes must be removed from the voting +configuration manually, using the retirement API described below, to achieve +the desired level of resilience. Note that Elasticsearch will not suffer from a "split-brain" inconsistency however it is configured. This setting only affects its availability in the -event of some node failures. +event of the failure of some of its nodes. [float] ==== Even numbers of master-eligible nodes @@ -97,8 +121,7 @@ side could make any progress in this situation. It is recommended to have a small, fixed, number of master-eligible nodes in a cluster, and to scale the cluster up and down by adding and removing non-master-eligible nodes only. However there are situations in which it may be -necessary to add some master-eligible nodes to a cluster, such as when -migrating a cluster onto a new set of nodes without downtime. +desirable to add extra master-eligible nodes to a cluster. If you wish to add some master-eligible nodes to your cluster, simply configure the new nodes to find the existing cluster and start them up. Elasticsearch @@ -111,37 +134,25 @@ so. It is recommended to have a small, fixed, number of master-eligible nodes in a cluster, and to scale the cluster up and down by adding and removing non-master-eligible nodes only. However there are situations in which it may be -necessary to remove some master-eligible nodes to a cluster, such as when -migrating a cluster onto a new set of nodes without downtime. +desirable to remove some master-eligible nodes from a cluster. You must be careful not to remove too many master-eligible nodes all at the same time. For instance, if you currently have seven master-eligible nodes and you wish to reduce this to three, you cannot simply stop four of the nodes all at the same time: to do so would leave only three nodes remaining, which is less than half of the voting configuration, which means it cannot take any -further actions. You should remove the nodes one-at-a-time and verify that -each node has been removed from the voting configuration before moving onto the -next one, using the `await_removal` API: - -[source,js] --------------------------------------------------- -# Explicit timeout of one minute -GET /_nodes/node_name/await_removal?timeout=1m -# Default timeout of 30 seconds -GET /_nodes/node_name/await_removal --------------------------------------------------- -// CONSOLE - -The node (or nodes) for whose removal to wait are specified using -<> in place of `node_name` here. - -A special case of this is the case where there are only two master-eligible -nodes and you wish to remove one of them. In this case neither node can be -safely shut down since both nodes are required to reliably make progress, so -you must first explicitly _retire_ one of the nodes. A retired node still works -normally, but Elasticsearch will try and transfer it out of the voting -configuration so its vote is no longer required, and will never move a retired -node back into the voting configuration. Once a node has been successfully +further actions. + +As long as there are at least three master-eligible nodes in the cluster, as a +general rule it is best to remove nodes one-at-a-time, allowing enough time for +the auto-reconfiguration to take effect after each removal. + +If there are only two master-eligible nodes then neither node can be safely +removed since both are required to reliably make progress, so you must first +explicitly _retire_ one of the nodes. A retired node still works normally, but +Elasticsearch will try and remove it from the voting configuration so its vote +is no longer required, and will never move a retired node back into the voting +configuration after it has been removed. Once a node has been successfully retired, it is safe to shut it down. A node can be retired using the following API: @@ -159,11 +170,15 @@ of `node_name` here. If a call to the retirement API fails then the call can safely be retried. A successful response guarantees that the node has been removed from the voting configuration and will not be reinstated. -Although the retirement API is most useful for removing one node from a -two-node cluster, it is also possible to use it to remove nodes from larger -clusters. In the example described above, shrinking a seven-master-node cluster -down to only have three master nodes, you could retire four of the nodes and -then shut them down simultaneously. +Although the retirement API is most useful for removing a node from a two-node +cluster, it is also possible to use it to remove nodes from larger clusters. If +removing multiple nodes from a cluster it is important not to remove too many +voting nodes too quickly, so that the voting configuration can be updated +between each removal, and this can be achieved by retiring the nodes too to +obtain confirmation that they are no longer in the voting configuration. In +the example described above, shrinking a seven-master-node cluster down to only +have three master nodes, you could retire four of the nodes and then shut them +down simultaneously. A node (or nodes) can be brought back out of retirement using the `unretire` API: @@ -188,32 +203,16 @@ GET /_cluster/state?filter_path=TODO -------------------------------------------------- // CONSOLE -[float] -=== Rolling restarts and migrations +This set is limited in size by the following setting: -It is possible to perform some cluster maintenance tasks without taking the -whole cluster offline, such as a <>. A -rolling restart does not require any special handling for the master-eligible -nodes or any use of the APIs described here. During a rolling restart the -restarting node will be offline, and this reduces the cluster's ability to -tolerate faults of its other nodes. If it is necessary to avoid this, you can -temporarily add a new master node to the cluster as described above, perform -the rolling restart, and then remove the extra master node again. +`cluster.max_retired_nodes`:: -It is also possible to perform a migration of a cluster onto entirely new nodes -without taking the cluster offline. A _rolling migration_ is similar to a -rolling restart, in that it is performed one node at a time, and also requires -no special handling for the master-eligible nodes. - -Alternatively a migration can be performed by starting up all the new nodes at -once, each configured to join the existing cluster, migrating all the data -using <>, and then -shutting down the old nodes. Care must be taken not to shut the master-eligible -nodes down too quickly since a majority of the voting configuration is always -required to keep the cluster alive. This can be done with the `GET -/_nodes/node_name/await_removal` API described above, or else the old -master-eligible nodes may all be retired so that they do not form part of the -voting configuration. + Sets a limits on the number of retired nodes at any one time. Defaults to + `10`. + +Because there can only be a limited number of retired nodes at once, once a +retired node has been destroyed its entry should be removed from the set of +retired nodes using the unretire API. [float] === Cluster bootstrapping From a498e4296922595391627eabe5bdd252dd2262da Mon Sep 17 00:00:00 2001 From: David Turner Date: Thu, 25 Oct 2018 16:49:26 +0100 Subject: [PATCH 011/106] More rewording --- docs/reference/modules/coordination.asciidoc | 18 ++++++++++-------- 1 file changed, 10 insertions(+), 8 deletions(-) diff --git a/docs/reference/modules/coordination.asciidoc b/docs/reference/modules/coordination.asciidoc index 66e029786746d..77528b7e40b34 100644 --- a/docs/reference/modules/coordination.asciidoc +++ b/docs/reference/modules/coordination.asciidoc @@ -58,8 +58,10 @@ docs. Nodes may join or leave the cluster, and Elasticsearch reacts by making corresponding changes to the voting configuration in order to ensure that the -cluster is as resilient as possible. The current voting configuration is -stored in the cluster state so you can inspect its current contents as follows: +cluster is as resilient as possible. The default auto-reconfiguration behaviour +is expected to give the best results for almost all use-cases. The current +voting configuration is stored in the cluster state so you can inspect its +current contents as follows: [source,js] -------------------------------------------------- @@ -136,12 +138,12 @@ cluster, and to scale the cluster up and down by adding and removing non-master-eligible nodes only. However there are situations in which it may be desirable to remove some master-eligible nodes from a cluster. -You must be careful not to remove too many master-eligible nodes all at the -same time. For instance, if you currently have seven master-eligible nodes and -you wish to reduce this to three, you cannot simply stop four of the nodes all -at the same time: to do so would leave only three nodes remaining, which is -less than half of the voting configuration, which means it cannot take any -further actions. +It is important not to remove too many master-eligible nodes all at the same +time. For instance, if there are currently seven master-eligible nodes and you +wish to reduce this to three, it is not possible simply to stop four of the +nodes all at the same time: to do so would leave only three nodes remaining, +which is less than half of the voting configuration, which means the cluster +cannot take any further actions. As long as there are at least three master-eligible nodes in the cluster, as a general rule it is best to remove nodes one-at-a-time, allowing enough time for From eb0aa2f830877cdc1e4e8bca3f0c7b56567d22e0 Mon Sep 17 00:00:00 2001 From: David Turner Date: Thu, 25 Oct 2018 18:43:49 +0100 Subject: [PATCH 012/106] Moar reword --- docs/reference/modules/coordination.asciidoc | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/docs/reference/modules/coordination.asciidoc b/docs/reference/modules/coordination.asciidoc index 77528b7e40b34..28b78510305b3 100644 --- a/docs/reference/modules/coordination.asciidoc +++ b/docs/reference/modules/coordination.asciidoc @@ -11,12 +11,13 @@ Electing a master node and changing the cluster state are the two fundamental tasks that master-eligible nodes must work together to perform. It is important that these activities work robustly even if some nodes have failed, and Elasticsearch achieves this robustness by only considering each action to have -succeeded on receipt of responses from a majority of the master-eligible nodes -in the cluster. The advantage of requiring a majority of nodes to respond is -that it allows for nearly half of the master-eligible nodes to fail without -preventing the cluster from making progress, but also does not allow the -cluster to "split brain", i.e. to be partitioned into two pieces each of which -may make decisions that are inconsistent with those of the other piece. +succeeded on receipt of responses from a _quorum_, a subset of the +master-eligible nodes in the cluster. The advantage of requiring only a subset +of the nodes to respond is that it allows for some of the nodes to fail without +preventing the cluster from making progress, and the quorums are carefully +chosen so as not to allow the cluster to "split brain", i.e. to be partitioned +into two pieces each of which may make decisions that are inconsistent with +those of the other piece. Elasticsearch allows you to add and remove master-eligible nodes to a running cluster. In many cases you can do this simply by starting or stopping the nodes From ebcbe3c7b378455877672225db91957d30f9a1d1 Mon Sep 17 00:00:00 2001 From: David Turner Date: Wed, 31 Oct 2018 10:41:24 +0000 Subject: [PATCH 013/106] Review feedback --- docs/reference/modules/coordination.asciidoc | 222 ++++++++++--------- 1 file changed, 115 insertions(+), 107 deletions(-) diff --git a/docs/reference/modules/coordination.asciidoc b/docs/reference/modules/coordination.asciidoc index 28b78510305b3..eee10c811af8e 100644 --- a/docs/reference/modules/coordination.asciidoc +++ b/docs/reference/modules/coordination.asciidoc @@ -48,7 +48,7 @@ It is also possible to perform a migration of a cluster onto entirely new nodes without taking the cluster offline, via a _rolling migration_. A rolling migration is similar to a rolling restart, in that it is performed one node at a time, and also requires no special handling for the master-eligible nodes as -long as there are at least two of them at all times. +long as there are at least two of them available at all times. TODO the above is only true if the maintenance happens slowly enough, otherwise the configuration might not catch up. Need to add this to the rolling restart @@ -60,9 +60,9 @@ docs. Nodes may join or leave the cluster, and Elasticsearch reacts by making corresponding changes to the voting configuration in order to ensure that the cluster is as resilient as possible. The default auto-reconfiguration behaviour -is expected to give the best results for almost all use-cases. The current -voting configuration is stored in the cluster state so you can inspect its -current contents as follows: +is expected to give the best results in most situation. The current voting +configuration is stored in the cluster state so you can inspect its current +contents as follows: [source,js] -------------------------------------------------- @@ -70,19 +70,29 @@ GET /_cluster/state?filter_path=TODO -------------------------------------------------- // CONSOLE +NOTE: The current voting configuration is not necessarily the same as the set +of all available master-eligible nodes in the cluster. Altering the voting +configuration itself involves taking a vote, so it takes some time to adjust +the configuration as nodes join or leave the cluster. Also, there are +situations where the most resilient configuration includes unavailable nodes, +or does not include some available nodes, and in these situations the voting +configuration will differ from the set of available master-eligible nodes in +the cluster. + Larger voting configurations are usually more resilient, so Elasticsearch will -normally prefer to add nodes to the voting configuration once they have joined -the cluster. Similarly, if a node in the voting configuration leaves the -cluster and there is another node in the cluster that is not in the voting -configuration then it is preferable to swap these two nodes over, leaving the -size of the voting configuration unchanged but increasing its resilience. +normally prefer to add master-eligible nodes to the voting configuration once +they have joined the cluster. Similarly, if a node in the voting configuration +leaves the cluster and there is another master-eligible node in the cluster +that is not in the voting configuration then it is preferable to swap these two +nodes over, leaving the size of the voting configuration unchanged but +increasing its resilience. It is not so straightforward to automatically remove nodes from the voting configuration after they have left the cluster, and different strategies have different benefits and drawbacks, so the right choice depends on how the cluster will be used and is controlled by the following setting. -`cluster.automatically_shrink_voting_configuration`:: +`cluster.auto_shrink_voting_configuration`:: Defaults to `true`, meaning that the voting configuration will automatically shrink, shedding departed nodes, as long as it still contains @@ -90,11 +100,11 @@ cluster will be used and is controlled by the following setting. automatically shrinks; departed nodes must be removed manually using the retirement API described below. -NOTE: If `cluster.automatically_shrink_voting_configuration` is set to `true`, -the recommended and default setting, and there are at least three -master-eligible nodes in the cluster, then Elasticsearch remains capable of -processing cluster-state updates as long as all but one of its master-eligible -nodes are healthy. +NOTE: If `cluster.auto_shrink_voting_configuration` is set to `true`, the +recommended and default setting, and there are at least three master-eligible +nodes in the cluster, then Elasticsearch remains capable of processing +cluster-state updates as long as all but one of its master-eligible nodes are +healthy. There are situations in which Elasticsearch might tolerate the loss of multiple nodes, but this is not guaranteed under all sequences of failures. If this @@ -104,7 +114,8 @@ the desired level of resilience. Note that Elasticsearch will not suffer from a "split-brain" inconsistency however it is configured. This setting only affects its availability in the -event of the failure of some of its nodes. +event of the failure of some of its nodes, and the administrative tasks that +must be performed as nodes join and leave the cluster. [float] ==== Even numbers of master-eligible nodes @@ -118,31 +129,35 @@ a majority of the voting configuration and will be able to keep operating, whereas if all of the master-eligible nodes' votes were counted then neither side could make any progress in this situation. +For instance if there are four master-eligible nodes in the cluster and the +voting configuration contained all of them then any quorum-based decision would +require votes from at least three of them, which means that the cluster can +only tolerate the loss of a single master-eligible node. If this cluster were +split into two equal halves then neither half would contain three +master-eligible nodes so would not be able to make any progress. However if the +voting configuration contains only three of the four master-eligible nodes then +the cluster is still only fully tolerant to the loss of one node, but +quorum-based decisions require votes from two of the three voting nodes. In the +event of an even split, one half will contain two of the three voting nodes so +will remain available. + [float] -==== Adding master-eligible nodes +==== Adding and removing master-eligible nodes -It is recommended to have a small, fixed, number of master-eligible nodes in a -cluster, and to scale the cluster up and down by adding and removing +It is recommended to have a small and fixed number of master-eligible nodes in +a cluster, and to scale the cluster up and down by adding and removing non-master-eligible nodes only. However there are situations in which it may be -desirable to add extra master-eligible nodes to a cluster. +desirable to add or remove some master-eligible nodes to or from a cluster. If you wish to add some master-eligible nodes to your cluster, simply configure the new nodes to find the existing cluster and start them up. Elasticsearch will add the new nodes to the voting configuration if it is appropriate to do so. -[float] -==== Removing master-eligible nodes - -It is recommended to have a small, fixed, number of master-eligible nodes in a -cluster, and to scale the cluster up and down by adding and removing -non-master-eligible nodes only. However there are situations in which it may be -desirable to remove some master-eligible nodes from a cluster. - -It is important not to remove too many master-eligible nodes all at the same -time. For instance, if there are currently seven master-eligible nodes and you -wish to reduce this to three, it is not possible simply to stop four of the -nodes all at the same time: to do so would leave only three nodes remaining, +When removing master-eligible nodes, it is important not to remove too many all +at the same time. For instance, if there are currently seven master-eligible +nodes and you wish to reduce this to three, it is not possible simply to stop +four of the nodes at once: to do so would leave only three nodes remaining, which is less than half of the voting configuration, which means the cluster cannot take any further actions. @@ -174,14 +189,12 @@ safely be retried. A successful response guarantees that the node has been removed from the voting configuration and will not be reinstated. Although the retirement API is most useful for removing a node from a two-node -cluster, it is also possible to use it to remove nodes from larger clusters. If -removing multiple nodes from a cluster it is important not to remove too many -voting nodes too quickly, so that the voting configuration can be updated -between each removal, and this can be achieved by retiring the nodes too to -obtain confirmation that they are no longer in the voting configuration. In -the example described above, shrinking a seven-master-node cluster down to only -have three master nodes, you could retire four of the nodes and then shut them -down simultaneously. +cluster, it is also possible to use it to remove multiple nodes from larger +clusters all at the same time. Retiring a node, or a set of nodes, confirms +that it is no longer part of the voting configuration and can therefore safely +be shut down. In the example described above, shrinking a seven-master-node +cluster down to only have three master nodes, you could retire four of the +nodes and then shut them down simultaneously. A node (or nodes) can be brought back out of retirement using the `unretire` API: @@ -220,53 +233,47 @@ retired nodes using the unretire API. [float] === Cluster bootstrapping -There is a risk when starting up a brand-new cluster is that you accidentally -form two separate clusters instead of one. This could lead to data loss: you -might start using both clusters before noticing that anything had gone wrong, -and it will then be impossible to merge them together later. - -NOTE: To illustrate how this could happen, imagine starting up a three-node -cluster in which each node knows that it is going to be part of a three-node -cluster. A majority of three nodes is two, so normally the first two nodes to -discover each other will form a cluster and the third node will join them a -short time later. However, imagine that four nodes were accidentally started -instead of three: in this case there are enough nodes to form two separate -clusters. Of course if each node is started manually then it's unlikely that -too many nodes are started, but it's certainly possible to get into this -situation if using a more automated orchestrator, particularly if a network -partition happens at the wrong time. - -We avoid this by requiring a separate _cluster bootstrapping_ process to take -place on every brand-new cluster. This is only required the very first time the +When a brand-new cluster starts up for the first time, one of the tasks it must +perform is to elect its first master node, for which it needs to know the set +of master-eligible nodes whose votes should count in this first election. This +initial voting configuration is known as the _bootstrap configuration_. + +It is important that the bootstrap configuration identifies exactly which nodes +should vote in the first election, and it is not sufficient to configure each +node with an expectation of how many nodes there should be in the cluster. It +is also important to note that the bootstrap configuration must come from +outside the cluster: there is no safe way for the cluster to determine the +bootstrap configuration correctly on its own. + +If the bootstrap configuration is not set correctly then there is a risk when +starting up a brand-new cluster is that you accidentally form two separate +clusters instead of one. This could lead to data loss: you might start using +both clusters before noticing that anything had gone wrong, and it will then be +impossible to merge them together later. + +NOTE: To illustrate the problem with configurting each node to expect a certain +cluster size, imagine starting up a three-node cluster in which each node knows +that it is going to be part of a three-node cluster. A majority of three nodes +is two, so normally the first two nodes to discover each other will form a +cluster and the third node will join them a short time later. However, imagine +that four nodes were erroneously started instead of three: in this case there +are enough nodes to form two separate clusters. Of course if each node is +started manually then it's unlikely that too many nodes are started, but it's +certainly possible to get into this situation if using a more automated +orchestrator, particularly if the orchestrator is not resilient to failures +such as network partitions. + +The cluster bootstrapping process is is only required the very first time a whole cluster starts up: new nodes joining an established cluster can safely obtain all the information they need from the elected master, and nodes that have previously been part of a cluster will have stored to disk all the information required when restarting. -A cluster can be bootstrapped by sending a _bootstrap warrant_ to any of its -master-eligible nodes. A bootstrap warrant is a document that contains the -information that the cluster needs to finish forming, including the identities -of the master-eligible nodes that form its first voting configuration, and -looks like this: - -[source,js] --------------------------------------------------- -{ - "master_nodes":[ - {"id":"USpTGYaBSIKbgSUJR2Z9lg","name":"master-a"}, - {"id":"gSUJR2Z9lgUSpTGYaBSIKb","name":"master-b"}, - {"id":"2Z9lgUSpTgSUYaBSIKbJRG","name":"master-c"} - ] -} --------------------------------------------------- - -To bootstrap a cluster, the administrator must identify a suitable set of -master-eligible nodes, construct a bootstrap warrant, and pass the warrant to -the `POST /_cluster/bootstrap` API: +A cluster can be bootstrapped by sending the _bootstrap configuration_ to any +of its master-eligible nodes via the `POST /_cluster/bootstrap` API: [source,js] -------------------------------------------------- -# send the bootstrap warrant back to the cluster POST /_cluster/bootstrap { "master_nodes":[ @@ -281,38 +288,37 @@ POST /_cluster/bootstrap This only needs to occur once, on a single master-eligible node in the cluster, but for robustness it is safe to repeatedly call `POST /_cluster/bootstrap`, and to call it on different nodes concurrently. However **it is vitally -important** to use the same bootstrap warrant in each call. +important** to use exactly the same bootstrap configuration in each call. -WARNING: You must pass the same bootstrap warrant to each call to `POST -/_cluster/bootstrap` in order to be sure that only a single cluster forms +WARNING: You must pass exactly the same bootstrap configuration to each call to +`POST /_cluster/bootstrap` in order to be sure that only a single cluster forms during bootstrapping and therefore to avoid the risk of data loss. -The simplest and safest way to construct a bootstrap warrant is to use the `GET -/_cluster/bootstrap` API: +The simplest and safest way to construct a bootstrap configuration is to use +the `GET /_cluster/bootstrap` API: [source,js] -------------------------------------------------- -# Immediately return a bootstrap warrant based on the nodes discovered so far +# Immediately return a bootstrap configuration based on the nodes discovered so far GET /_cluster/bootstrap -# Wait until the node has discovered at least 3 nodes, or 60 seconds has elapsed, -# and then return the resulting bootstrap warrant +# Wait for up to 60 seconds until the node has discovered at least 3 nodes, +# before returning a bootstrap configuration GET /_cluster/bootstrap?wait_for_nodes=3&timeout=60s -------------------------------------------------- // CONSOLE -This API returns a properly-constructed bootstrap warrant that is ready to pass -to the `POST /_cluster/bootstrap` API. It includes all of the master-eligible -nodes that the handling node has discovered via the gossip-based discovery -protocol, and returns an error if fewer nodes have been discovered than -expected. +This API returns a properly-constructed bootstrap configuration that is ready +to pass back to the `POST /_cluster/bootstrap` API. It includes all of the +master-eligible nodes that the handling node has discovered via the +gossip-based discovery protocol, and returns an error if fewer nodes have been +discovered than expected. -It is also possible to construct a bootstrap warrant manually and to specify -the initial set of nodes in terms of their names alone, rather than including -their IDs too: +It is also possible to construct a bootstrap configuration manually and to +specify the initial set of nodes in terms of their names alone, rather than +including their IDs too: [source,js] -------------------------------------------------- -# send the bootstrap warrant back to the cluster POST /_cluster/bootstrap { "master_nodes":[ @@ -327,8 +333,12 @@ POST /_cluster/bootstrap It is safer to include the node IDs, in case two nodes are accidentally started with the same name. -This process is implemented in the `elasticsearch-bootstrap-cluster` -command-line tool: +[float] +==== Cluster bootstrapping tool + +A simpler way to bootstrap a cluster is to use the +`elasticsearch-bootstrap-cluster` command-line tool which implements the +process described here: [source,txt] -------------------------------------------------- @@ -346,8 +356,9 @@ bootstrap the cluster, retrying safely if any step fails. In a disaster situation a cluster may have lost half or more of its master-eligible nodes and therefore be in a state in which it cannot elect a master. There is no way to recover from this situation without risking data -loss, but if there is no other viable path forwards then this may be necessary. -This can be performed with the following command on a surviving node: +loss (including the loss of indexed documents) but if there is no other viable +path forwards then this may be necessary. This can be performed with the +following command on a surviving node: [source,js] -------------------------------------------------- @@ -355,13 +366,10 @@ POST /_cluster/force_local_node_takeover -------------------------------------------------- // CONSOLE -This works by forcibly overriding the current voting configuration with one in -which the handling node is the only voting master, so that it forms a quorum on -its own. Because there is a risk of data loss when performing this command it -requires the `accept_data_loss` parameter to be set to `true` in the URL. -Afterwards, once the cluster has successfully formed, -`cluster.master_nodes_failure_tolerance` should be increased to a suitable -value. +This forcibly overrides the current voting configuration with one in which the +handling node is the only voting master, so that it forms a quorum on its own. +Because there is a risk of data loss when performing this command it requires +the `accept_data_loss` parameter to be set to `true` in the URL. [float] === Election scheduling From 27a9ffbb7e17ac2255c6960f65db36d9efa9d1cd Mon Sep 17 00:00:00 2001 From: David Turner Date: Wed, 31 Oct 2018 10:42:37 +0000 Subject: [PATCH 014/106] Split sentence --- docs/reference/modules/coordination.asciidoc | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/docs/reference/modules/coordination.asciidoc b/docs/reference/modules/coordination.asciidoc index eee10c811af8e..75dbddd30494f 100644 --- a/docs/reference/modules/coordination.asciidoc +++ b/docs/reference/modules/coordination.asciidoc @@ -38,11 +38,11 @@ are still healthy then the cluster can still make progress. Many cluster maintenance tasks involve temporarily shutting down one or more nodes and then starting them back up again. By default Elasticsearch can remain available if one of its master-eligible nodes is taken offline, such as during -a <>, and if multiple nodes are stopped and -then started again then it will automatically recover, such as during a -<>. There is no need to take any further -action with the APIs described here in these cases, because the set of master -nodes is not changing permanently. +a <>. Furthermore, if multiple nodes are +stopped and then started again then it will automatically recover, such as +during a <>. There is no need to take any +further action with the APIs described here in these cases, because the set of +master nodes is not changing permanently. It is also possible to perform a migration of a cluster onto entirely new nodes without taking the cluster offline, via a _rolling migration_. A rolling From f43414de6ee5c123dcb8a40edc5d52490a48eebb Mon Sep 17 00:00:00 2001 From: David Turner Date: Fri, 2 Nov 2018 08:26:40 +0000 Subject: [PATCH 015/106] Retire -> withdraw vote --- docs/reference/modules/coordination.asciidoc | 99 +++++++++++--------- 1 file changed, 53 insertions(+), 46 deletions(-) diff --git a/docs/reference/modules/coordination.asciidoc b/docs/reference/modules/coordination.asciidoc index 75dbddd30494f..8421eedd7fc4a 100644 --- a/docs/reference/modules/coordination.asciidoc +++ b/docs/reference/modules/coordination.asciidoc @@ -98,7 +98,7 @@ cluster will be used and is controlled by the following setting. automatically shrink, shedding departed nodes, as long as it still contains at least 3 nodes. If set to `false`, the voting configuration never automatically shrinks; departed nodes must be removed manually using the - retirement API described below. + vote withdrawal API described below. NOTE: If `cluster.auto_shrink_voting_configuration` is set to `true`, the recommended and default setting, and there are at least three master-eligible @@ -109,7 +109,7 @@ healthy. There are situations in which Elasticsearch might tolerate the loss of multiple nodes, but this is not guaranteed under all sequences of failures. If this setting is set to `false` then departed nodes must be removed from the voting -configuration manually, using the retirement API described below, to achieve +configuration manually, using the vote withdrawal API described below, to achieve the desired level of resilience. Note that Elasticsearch will not suffer from a "split-brain" inconsistency @@ -167,69 +167,76 @@ the auto-reconfiguration to take effect after each removal. If there are only two master-eligible nodes then neither node can be safely removed since both are required to reliably make progress, so you must first -explicitly _retire_ one of the nodes. A retired node still works normally, but -Elasticsearch will try and remove it from the voting configuration so its vote -is no longer required, and will never move a retired node back into the voting -configuration after it has been removed. Once a node has been successfully -retired, it is safe to shut it down. A node can be retired using the following -API: +inform Elasticsearch that one of the nodes should have its vote withdrawn, +transferring all the voting power to the other node and allowing the node with +the withdrawn vote to be taken offline without preventing the other node from +making progress. A node whose vote has been withdrawn still works normally, +but Elasticsearch will try and remove it from the voting configuration so its +vote is no longer required, and will never automatically move such a node back +into the voting configuration after it has been removed. Once a node's vote has +been successfully withdrawn, it is safe to shut it down with affecting the +cluster's availability. A node's vote can be withdrawn using the following API: [source,js] -------------------------------------------------- -# Retire node and wait for its removal up to the default timeout of 30 seconds -POST /_nodes/node_name/retire -# Retire node and wait for its removal up to one minute -POST /_nodes/node_name/retire?timeout=1m +# Withdraw vote from node and wait for its removal from the voting +# configuration up to the default timeout of 30 seconds +POST /_nodes/node_name/withdraw_vote +# Withdraw vote from node and wait for its removal from the voting +# configuration up to one minute +POST /_nodes/node_name/withdraw_vote?timeout=1m -------------------------------------------------- // CONSOLE -The node to retire is specified using <> in place -of `node_name` here. If a call to the retirement API fails then the call can -safely be retried. A successful response guarantees that the node has been -removed from the voting configuration and will not be reinstated. - -Although the retirement API is most useful for removing a node from a two-node -cluster, it is also possible to use it to remove multiple nodes from larger -clusters all at the same time. Retiring a node, or a set of nodes, confirms -that it is no longer part of the voting configuration and can therefore safely -be shut down. In the example described above, shrinking a seven-master-node -cluster down to only have three master nodes, you could retire four of the -nodes and then shut them down simultaneously. - -A node (or nodes) can be brought back out of retirement using the `unretire` -API: +The node whose vote should be withdrawn is specified using <> in place of `node_name` here. If a call to the vote withdrawal API +fails then the call can safely be retried. A successful response guarantees +that the node has been removed from the voting configuration and will not be +reinstated. + +Although the vote withdrawal API is most useful for removing a node from a +two-node cluster, it is also possible to use it to remove multiple nodes from +larger clusters all at the same time. Withdrawing the vote from a set of nodes +confirms that this set is no longer part of the voting configuration and can +therefore safely be shut down. In the example described above, shrinking a +seven-master-node cluster down to only have three master nodes, you could +withdraw the vote from four of the nodes and then shut them down +simultaneously. + +Withdrawing the vote from a node creates a _voting tombstone_ for that node, +which prevents it from returning to the voting configuration once it has +removed. The current set of voting tombstones is stored in the cluster state +and can be inspected as follows: [source,js] -------------------------------------------------- -POST /_nodes/node_name/unretire +GET /_cluster/state?filter_path=TODO -------------------------------------------------- // CONSOLE -The node (or nodes) to reinstate are specified using <> in place of `node_name` here. After being brought back out of -retirement they might or might not immediately be added to the voting -configuration. +This set is limited in size by the following setting: + +`cluster.max_voting_tombstones`:: + + Sets a limits on the number of voting tombstones at any one time. Defaults + to `10`. -The current set of retired nodes is stored in the cluster state and can be -inspected as follows: +Since voting tombstones are persistent and limited in number, they must be +cleaned up from time to time. If a node's vote is withdrawn because it is to be +shut down permanently then its tombstone can be removed once it is certain +never to return to the cluster. Tombstones can also be removed if they were +created in error or were only required temporarily: [source,js] -------------------------------------------------- -GET /_cluster/state?filter_path=TODO +# Allow the selected nodes back into the voting configuration by removing their +# tombstones +DELETE /_nodes/node_name/withdraw_vote +# Remove all voting tombstones +DELETE /_nodes/_all/withdraw_vote -------------------------------------------------- // CONSOLE -This set is limited in size by the following setting: - -`cluster.max_retired_nodes`:: - - Sets a limits on the number of retired nodes at any one time. Defaults to - `10`. - -Because there can only be a limited number of retired nodes at once, once a -retired node has been destroyed its entry should be removed from the set of -retired nodes using the unretire API. - [float] === Cluster bootstrapping From 1fef44e51c51d465c4a42a366331eebf23ec33c5 Mon Sep 17 00:00:00 2001 From: David Turner Date: Fri, 2 Nov 2018 08:44:32 +0000 Subject: [PATCH 016/106] Typo, and better UUIDs --- docs/reference/modules/coordination.asciidoc | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/reference/modules/coordination.asciidoc b/docs/reference/modules/coordination.asciidoc index 8421eedd7fc4a..ba3821ad5aabd 100644 --- a/docs/reference/modules/coordination.asciidoc +++ b/docs/reference/modules/coordination.asciidoc @@ -258,7 +258,7 @@ clusters instead of one. This could lead to data loss: you might start using both clusters before noticing that anything had gone wrong, and it will then be impossible to merge them together later. -NOTE: To illustrate the problem with configurting each node to expect a certain +NOTE: To illustrate the problem with configuring each node to expect a certain cluster size, imagine starting up a three-node cluster in which each node knows that it is going to be part of a three-node cluster. A majority of three nodes is two, so normally the first two nodes to discover each other will form a @@ -284,9 +284,9 @@ of its master-eligible nodes via the `POST /_cluster/bootstrap` API: POST /_cluster/bootstrap { "master_nodes":[ - {"id":"USpTGYaBSIKbgSUJR2Z9lg","name":"master-a"}, - {"id":"gSUJR2Z9lgUSpTGYaBSIKb","name":"master-b"}, - {"id":"2Z9lgUSpTgSUYaBSIKbJRG","name":"master-c"} + {"id":"gAMDNeJRTX6A_VelgSb84g","name":"master-a"}, + {"id":"t3LZCVGxTf-idQIC8z4A1A","name":"master-b"}, + {"id":"GfwXZYVVSFCOWNT0zcDixQ","name":"master-c"} ] } -------------------------------------------------- From 10020dbcd97907a31359af67bf20fd31ad86d2f1 Mon Sep 17 00:00:00 2001 From: David Turner Date: Fri, 2 Nov 2018 09:26:51 +0000 Subject: [PATCH 017/106] Width --- docs/reference/modules/coordination.asciidoc | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/reference/modules/coordination.asciidoc b/docs/reference/modules/coordination.asciidoc index ba3821ad5aabd..ee927b44875ad 100644 --- a/docs/reference/modules/coordination.asciidoc +++ b/docs/reference/modules/coordination.asciidoc @@ -229,8 +229,8 @@ created in error or were only required temporarily: [source,js] -------------------------------------------------- -# Allow the selected nodes back into the voting configuration by removing their -# tombstones +# Allow the selected nodes back into the voting configuration by +# removing their tombstones DELETE /_nodes/node_name/withdraw_vote # Remove all voting tombstones DELETE /_nodes/_all/withdraw_vote From 05dc68a14ffd0217cf388630331bf02feeb2e4d1 Mon Sep 17 00:00:00 2001 From: David Turner Date: Fri, 2 Nov 2018 09:30:58 +0000 Subject: [PATCH 018/106] Comments & width --- docs/reference/modules/coordination.asciidoc | 13 +++++++++---- 1 file changed, 9 insertions(+), 4 deletions(-) diff --git a/docs/reference/modules/coordination.asciidoc b/docs/reference/modules/coordination.asciidoc index ee927b44875ad..573e46eed1658 100644 --- a/docs/reference/modules/coordination.asciidoc +++ b/docs/reference/modules/coordination.asciidoc @@ -306,11 +306,16 @@ the `GET /_cluster/bootstrap` API: [source,js] -------------------------------------------------- -# Immediately return a bootstrap configuration based on the nodes discovered so far +# Immediately return a bootstrap configuration based on the nodes +# discovered so far. GET /_cluster/bootstrap -# Wait for up to 60 seconds until the node has discovered at least 3 nodes, -# before returning a bootstrap configuration -GET /_cluster/bootstrap?wait_for_nodes=3&timeout=60s +# Return a bootstrap configuration of at least three nodes, or return an +# error if fewer than three nodes have been discovered. +GET /_cluster/bootstrap?wait_for_nodes=3 +# Return a bootstrap configuration of at least three nodes, waiting for +# up to a minute for this many nodes to be discovered before returning +# an error. +GET /_cluster/bootstrap?wait_for_nodes=3&timeout=1m -------------------------------------------------- // CONSOLE From 529a94a55bae03186a8e8a7b2546e31eb1cc95dc Mon Sep 17 00:00:00 2001 From: David Turner Date: Fri, 2 Nov 2018 09:34:56 +0000 Subject: [PATCH 019/106] Reorder --- docs/reference/modules/coordination.asciidoc | 12 +++++------- 1 file changed, 5 insertions(+), 7 deletions(-) diff --git a/docs/reference/modules/coordination.asciidoc b/docs/reference/modules/coordination.asciidoc index 573e46eed1658..19b33d0c9e883 100644 --- a/docs/reference/modules/coordination.asciidoc +++ b/docs/reference/modules/coordination.asciidoc @@ -302,7 +302,11 @@ WARNING: You must pass exactly the same bootstrap configuration to each call to during bootstrapping and therefore to avoid the risk of data loss. The simplest and safest way to construct a bootstrap configuration is to use -the `GET /_cluster/bootstrap` API: +the `GET /_cluster/bootstrap` API. This API returns a properly-constructed +bootstrap configuration that is ready to pass back to the `POST +/_cluster/bootstrap` API. It includes all of the master-eligible nodes that the +handling node has discovered via the gossip-based discovery protocol, and can +return an error if fewer nodes have been discovered than expected. [source,js] -------------------------------------------------- @@ -319,12 +323,6 @@ GET /_cluster/bootstrap?wait_for_nodes=3&timeout=1m -------------------------------------------------- // CONSOLE -This API returns a properly-constructed bootstrap configuration that is ready -to pass back to the `POST /_cluster/bootstrap` API. It includes all of the -master-eligible nodes that the handling node has discovered via the -gossip-based discovery protocol, and returns an error if fewer nodes have been -discovered than expected. - It is also possible to construct a bootstrap configuration manually and to specify the initial set of nodes in terms of their names alone, rather than including their IDs too: From dd3515927c62e512d7b8707da460ee6c95813bd1 Mon Sep 17 00:00:00 2001 From: David Turner Date: Fri, 2 Nov 2018 09:37:58 +0000 Subject: [PATCH 020/106] Reformat JSON --- docs/reference/modules/coordination.asciidoc | 31 +++++++++++++++----- 1 file changed, 23 insertions(+), 8 deletions(-) diff --git a/docs/reference/modules/coordination.asciidoc b/docs/reference/modules/coordination.asciidoc index 19b33d0c9e883..a305bf57b6e5f 100644 --- a/docs/reference/modules/coordination.asciidoc +++ b/docs/reference/modules/coordination.asciidoc @@ -283,10 +283,19 @@ of its master-eligible nodes via the `POST /_cluster/bootstrap` API: -------------------------------------------------- POST /_cluster/bootstrap { - "master_nodes":[ - {"id":"gAMDNeJRTX6A_VelgSb84g","name":"master-a"}, - {"id":"t3LZCVGxTf-idQIC8z4A1A","name":"master-b"}, - {"id":"GfwXZYVVSFCOWNT0zcDixQ","name":"master-c"} + "master_nodes": [ + { + "id": "gAMDNeJRTX6A_VelgSb84g", + "name": "master-a" + }, + { + "id": "t3LZCVGxTf-idQIC8z4A1A", + "name": "master-b" + }, + { + "id": "GfwXZYVVSFCOWNT0zcDixQ", + "name": "master-c" + } ] } -------------------------------------------------- @@ -331,10 +340,16 @@ including their IDs too: -------------------------------------------------- POST /_cluster/bootstrap { - "master_nodes":[ - {"name":"master-a"}, - {"name":"master-b"}, - {"name":"master-c"} + "master_nodes": [ + { + "name": "master-a" + }, + { + "name": "master-b" + }, + { + "name": "master-c" + } ] } -------------------------------------------------- From 40649bd189c59912e8e0bbf93d46311a383312e0 Mon Sep 17 00:00:00 2001 From: David Turner Date: Fri, 2 Nov 2018 09:58:11 +0000 Subject: [PATCH 021/106] Better API for bootstrapping --- docs/reference/modules/coordination.asciidoc | 36 +++++++++++--------- 1 file changed, 20 insertions(+), 16 deletions(-) diff --git a/docs/reference/modules/coordination.asciidoc b/docs/reference/modules/coordination.asciidoc index a305bf57b6e5f..9d3c768cbfef9 100644 --- a/docs/reference/modules/coordination.asciidoc +++ b/docs/reference/modules/coordination.asciidoc @@ -277,11 +277,12 @@ have previously been part of a cluster will have stored to disk all the information required when restarting. A cluster can be bootstrapped by sending the _bootstrap configuration_ to any -of its master-eligible nodes via the `POST /_cluster/bootstrap` API: +of its master-eligible nodes via the `POST /_cluster/bootstrap_configuration` +API: [source,js] -------------------------------------------------- -POST /_cluster/bootstrap +POST /_cluster/bootstrap_configuration { "master_nodes": [ { @@ -302,33 +303,36 @@ POST /_cluster/bootstrap // CONSOLE This only needs to occur once, on a single master-eligible node in the cluster, -but for robustness it is safe to repeatedly call `POST /_cluster/bootstrap`, -and to call it on different nodes concurrently. However **it is vitally -important** to use exactly the same bootstrap configuration in each call. +but for robustness it is safe to repeatedly call `POST +/_cluster/bootstrap_configuration`, and to call it on different nodes +concurrently. However **it is vitally important** to use exactly the same +bootstrap configuration in each call. WARNING: You must pass exactly the same bootstrap configuration to each call to -`POST /_cluster/bootstrap` in order to be sure that only a single cluster forms -during bootstrapping and therefore to avoid the risk of data loss. +`POST /_cluster/bootstrap_configuration` in order to be sure that only a single +cluster forms during bootstrapping and therefore to avoid the risk of data +loss. The simplest and safest way to construct a bootstrap configuration is to use -the `GET /_cluster/bootstrap` API. This API returns a properly-constructed -bootstrap configuration that is ready to pass back to the `POST -/_cluster/bootstrap` API. It includes all of the master-eligible nodes that the -handling node has discovered via the gossip-based discovery protocol, and can -return an error if fewer nodes have been discovered than expected. +the `GET /_cluster/bootstrap_configuration` API. This API returns a +properly-constructed bootstrap configuration that is ready to pass back to the +`POST /_cluster/bootstrap_configuration` API. It includes all of the +master-eligible nodes that the handling node has discovered via the +gossip-based discovery protocol, and can return an error if fewer nodes have +been discovered than expected. [source,js] -------------------------------------------------- # Immediately return a bootstrap configuration based on the nodes # discovered so far. -GET /_cluster/bootstrap +GET /_cluster/bootstrap_configuration # Return a bootstrap configuration of at least three nodes, or return an # error if fewer than three nodes have been discovered. -GET /_cluster/bootstrap?wait_for_nodes=3 +GET /_cluster/bootstrap_configuration?min_size=3 # Return a bootstrap configuration of at least three nodes, waiting for # up to a minute for this many nodes to be discovered before returning # an error. -GET /_cluster/bootstrap?wait_for_nodes=3&timeout=1m +GET /_cluster/bootstrap_configuration?min_size=3&timeout=1m -------------------------------------------------- // CONSOLE @@ -338,7 +342,7 @@ including their IDs too: [source,js] -------------------------------------------------- -POST /_cluster/bootstrap +POST /_cluster/bootstrap_configuration { "master_nodes": [ { From e88656cba71ed62b23907ea611c6420d0a376876 Mon Sep 17 00:00:00 2001 From: David Turner Date: Fri, 2 Nov 2018 10:00:53 +0000 Subject: [PATCH 022/106] Rewording --- docs/reference/modules/coordination.asciidoc | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/reference/modules/coordination.asciidoc b/docs/reference/modules/coordination.asciidoc index 9d3c768cbfef9..4d7e87bdf0f22 100644 --- a/docs/reference/modules/coordination.asciidoc +++ b/docs/reference/modules/coordination.asciidoc @@ -318,8 +318,8 @@ the `GET /_cluster/bootstrap_configuration` API. This API returns a properly-constructed bootstrap configuration that is ready to pass back to the `POST /_cluster/bootstrap_configuration` API. It includes all of the master-eligible nodes that the handling node has discovered via the -gossip-based discovery protocol, and can return an error if fewer nodes have -been discovered than expected. +gossip-based discovery protocol, and returns an error if fewer nodes have been +discovered than required: [source,js] -------------------------------------------------- From 8de22f188c9900e647926428f6eeb49525ffa54d Mon Sep 17 00:00:00 2001 From: David Turner Date: Mon, 26 Nov 2018 14:19:26 +0000 Subject: [PATCH 023/106] Update APIs --- docs/reference/modules/coordination.asciidoc | 16 +++++++--------- 1 file changed, 7 insertions(+), 9 deletions(-) diff --git a/docs/reference/modules/coordination.asciidoc b/docs/reference/modules/coordination.asciidoc index 4d7e87bdf0f22..d257a4fc5a73e 100644 --- a/docs/reference/modules/coordination.asciidoc +++ b/docs/reference/modules/coordination.asciidoc @@ -66,7 +66,7 @@ contents as follows: [source,js] -------------------------------------------------- -GET /_cluster/state?filter_path=TODO +GET /_cluster/state?filter_path=metadata.cluster_coordination.last_committed_config -------------------------------------------------- // CONSOLE @@ -181,10 +181,10 @@ cluster's availability. A node's vote can be withdrawn using the following API: -------------------------------------------------- # Withdraw vote from node and wait for its removal from the voting # configuration up to the default timeout of 30 seconds -POST /_nodes/node_name/withdraw_vote +POST /_cluster/withdrawn_votes/node_name # Withdraw vote from node and wait for its removal from the voting # configuration up to one minute -POST /_nodes/node_name/withdraw_vote?timeout=1m +POST /_cluster/withdrawn_votes/node_name?timeout=1m -------------------------------------------------- // CONSOLE @@ -210,7 +210,7 @@ and can be inspected as follows: [source,js] -------------------------------------------------- -GET /_cluster/state?filter_path=TODO +GET /_cluster/state?filter_path=metadata.cluster_coordination.voting_tombstones -------------------------------------------------- // CONSOLE @@ -229,11 +229,9 @@ created in error or were only required temporarily: [source,js] -------------------------------------------------- -# Allow the selected nodes back into the voting configuration by -# removing their tombstones -DELETE /_nodes/node_name/withdraw_vote -# Remove all voting tombstones -DELETE /_nodes/_all/withdraw_vote +# Remove all previous withdrawals of votes, allowing any node to return to the +# voting configuration in future. +DELETE /_cluster/withdrawn_votes -------------------------------------------------- // CONSOLE From b54c0c1546efcf9076403303161de3c9159ebc9c Mon Sep 17 00:00:00 2001 From: David Turner Date: Mon, 26 Nov 2018 17:14:15 +0000 Subject: [PATCH 024/106] isn't --- docs/reference/modules/coordination.asciidoc | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/docs/reference/modules/coordination.asciidoc b/docs/reference/modules/coordination.asciidoc index d257a4fc5a73e..a8ea2b8090c3b 100644 --- a/docs/reference/modules/coordination.asciidoc +++ b/docs/reference/modules/coordination.asciidoc @@ -268,11 +268,11 @@ certainly possible to get into this situation if using a more automated orchestrator, particularly if the orchestrator is not resilient to failures such as network partitions. -The cluster bootstrapping process is is only required the very first time a -whole cluster starts up: new nodes joining an established cluster can safely -obtain all the information they need from the elected master, and nodes that -have previously been part of a cluster will have stored to disk all the -information required when restarting. +The cluster bootstrapping process is only required the very first time a whole +cluster starts up: new nodes joining an established cluster can safely obtain +all the information they need from the elected master, and nodes that have +previously been part of a cluster will have stored to disk all the information +required when restarting. A cluster can be bootstrapped by sending the _bootstrap configuration_ to any of its master-eligible nodes via the `POST /_cluster/bootstrap_configuration` From dfd64f9ddaa300e7746ae333988266fff2a53629 Mon Sep 17 00:00:00 2001 From: David Turner Date: Thu, 29 Nov 2018 09:06:24 +0000 Subject: [PATCH 025/106] Add wait_for_removal parameter --- docs/reference/modules/coordination.asciidoc | 19 ++++++++++++++----- 1 file changed, 14 insertions(+), 5 deletions(-) diff --git a/docs/reference/modules/coordination.asciidoc b/docs/reference/modules/coordination.asciidoc index a8ea2b8090c3b..1c92e865dd70e 100644 --- a/docs/reference/modules/coordination.asciidoc +++ b/docs/reference/modules/coordination.asciidoc @@ -222,16 +222,25 @@ This set is limited in size by the following setting: to `10`. Since voting tombstones are persistent and limited in number, they must be -cleaned up from time to time. If a node's vote is withdrawn because it is to be -shut down permanently then its tombstone can be removed once it is certain -never to return to the cluster. Tombstones can also be removed if they were -created in error or were only required temporarily: +cleaned up. Normally a vote is withdrawn when performing some maintenance on +the cluster, and the voting tombstones should be cleaned up when the +maintenance is complete. Clusters should have no voting tombstones in normal +operation. + +If a node's vote is withdrawn because it is to be shut down permanently then +its tombstone can be removed once it has shut down and been removed from the +cluster. Tombstones can also be removed if they were created in error or were +only required temporarily: [source,js] -------------------------------------------------- -# Remove all previous withdrawals of votes, allowing any node to return to the +# Wait for all the nodes with withdrawn votes to be removed from the cluster +# and then remove all the voting tombstones, allowing any node to return to the # voting configuration in future. DELETE /_cluster/withdrawn_votes +# Immediately remove all the voting tombstones, allowing any node to return to +# the voting configuration in future. +DELETE /_cluster/withdrawn_votes?wait_for_removal=false -------------------------------------------------- // CONSOLE From 0973de867780d039972bf79c8640c7b897a929c4 Mon Sep 17 00:00:00 2001 From: Yannick Welsch Date: Tue, 4 Dec 2018 18:30:53 +0100 Subject: [PATCH 026/106] rename withdrawal to exclusions --- docs/reference/modules/coordination.asciidoc | 104 ++++++++++--------- 1 file changed, 53 insertions(+), 51 deletions(-) diff --git a/docs/reference/modules/coordination.asciidoc b/docs/reference/modules/coordination.asciidoc index 1c92e865dd70e..44b7471a46ab6 100644 --- a/docs/reference/modules/coordination.asciidoc +++ b/docs/reference/modules/coordination.asciidoc @@ -167,80 +167,82 @@ the auto-reconfiguration to take effect after each removal. If there are only two master-eligible nodes then neither node can be safely removed since both are required to reliably make progress, so you must first -inform Elasticsearch that one of the nodes should have its vote withdrawn, -transferring all the voting power to the other node and allowing the node with -the withdrawn vote to be taken offline without preventing the other node from -making progress. A node whose vote has been withdrawn still works normally, -but Elasticsearch will try and remove it from the voting configuration so its -vote is no longer required, and will never automatically move such a node back -into the voting configuration after it has been removed. Once a node's vote has -been successfully withdrawn, it is safe to shut it down with affecting the -cluster's availability. A node's vote can be withdrawn using the following API: +inform Elasticsearch that one of the nodes should not be part of the voting +configuration, and that the voting power should instead be given to other +nodes, allowing the excluded node to be taken offline without preventing +the other node from making progress. A node who is added to a voting +configuration exclusion list still works normally, but Elasticsearch will try +and remove it from the voting configuration so its vote is no longer required, +and will never automatically move such a node back into the voting configuration +after it has been removed. Once a node's has been successfully reconfigured out +of the voting configuration, it is safe to shut it down with affecting the +cluster's availability. A node can be added to the voting configuration exclusion +list using the following API: [source,js] -------------------------------------------------- -# Withdraw vote from node and wait for its removal from the voting -# configuration up to the default timeout of 30 seconds -POST /_cluster/withdrawn_votes/node_name -# Withdraw vote from node and wait for its removal from the voting -# configuration up to one minute -POST /_cluster/withdrawn_votes/node_name?timeout=1m +# Add node to voting configuration exclusions list and wait for the system to +# auto-reconfigure the node from the voting configuration up to the default +# timeout of 30 seconds +POST /_cluster/voting_config_exclusions/node_name +# Add node to voting configuration exclusions list and wait for +# auto-reconfiguration up to one minute +POST /_cluster/voting_config_exclusions/node_name?timeout=1m -------------------------------------------------- // CONSOLE -The node whose vote should be withdrawn is specified using <> in place of `node_name` here. If a call to the vote withdrawal API -fails then the call can safely be retried. A successful response guarantees -that the node has been removed from the voting configuration and will not be -reinstated. - -Although the vote withdrawal API is most useful for removing a node from a -two-node cluster, it is also possible to use it to remove multiple nodes from -larger clusters all at the same time. Withdrawing the vote from a set of nodes -confirms that this set is no longer part of the voting configuration and can -therefore safely be shut down. In the example described above, shrinking a -seven-master-node cluster down to only have three master nodes, you could -withdraw the vote from four of the nodes and then shut them down -simultaneously. - -Withdrawing the vote from a node creates a _voting tombstone_ for that node, -which prevents it from returning to the voting configuration once it has -removed. The current set of voting tombstones is stored in the cluster state -and can be inspected as follows: +The node who should be excluded from the voting configuration is specified +using <> in place of `node_name` here. If a call +to the voting config exclusions API fails then the call can safely be retried. +A successful response guarantees that the node has been removed from the voting +configuration and will not be reinstated. + +Although the vote configuration exclusions API is most useful for down-scaling +a two-node to a one-node cluster, it is also possible to use it to remove multiple +nodes from larger clusters all at the same time. Adding multiple nodes to the +exclusions list has the system try to auto-reconfigure all of these nodes from +the voting configuration, allowing them to be safely shut down while keeping the +cluster available. In the example described above, shrinking a seven-master-node +cluster down to only have three master nodes, you could add four nodes to the +exclusions list, wait for confirmation, and then shut them down simultaneously. + +Excluding a node from the voting configuration creates an entry for that node +in the voting configuration exclusions list, which prevents it from returning +to the voting configuration once it has removed. The current set of exclusions +is stored in the cluster state and can be inspected as follows: [source,js] -------------------------------------------------- -GET /_cluster/state?filter_path=metadata.cluster_coordination.voting_tombstones +GET /_cluster/state?filter_path=metadata.cluster_coordination.voting_config_exclusions -------------------------------------------------- // CONSOLE -This set is limited in size by the following setting: +This list is limited in size by the following setting: -`cluster.max_voting_tombstones`:: +`cluster.max_voting_config_exclusions`:: - Sets a limits on the number of voting tombstones at any one time. Defaults + Sets a limits on the number of voting config exclusions at any one time. Defaults to `10`. -Since voting tombstones are persistent and limited in number, they must be -cleaned up. Normally a vote is withdrawn when performing some maintenance on -the cluster, and the voting tombstones should be cleaned up when the -maintenance is complete. Clusters should have no voting tombstones in normal -operation. +Since voting configuration exclusions are persistent and limited in number, they +must be cleaned up. Normally an exclusion is added when performing some maintenance on +the cluster, and the exclusions should be cleaned up when the maintenance is complete. +Clusters should have no exclusions in normal operation. -If a node's vote is withdrawn because it is to be shut down permanently then -its tombstone can be removed once it has shut down and been removed from the -cluster. Tombstones can also be removed if they were created in error or were +If a node is excluded from the voting configuration because it is to be shut down +permanently then its exclusion can be removed once it has shut down and been removed +from the cluster. Exclusions can also be cleared if they were created in error or were only required temporarily: [source,js] -------------------------------------------------- -# Wait for all the nodes with withdrawn votes to be removed from the cluster -# and then remove all the voting tombstones, allowing any node to return to the +# Wait for all the nodes with voting config exclusions to be removed from the cluster +# and then remove all the exclusions, allowing any node to return to the # voting configuration in future. -DELETE /_cluster/withdrawn_votes -# Immediately remove all the voting tombstones, allowing any node to return to +DELETE /_cluster/voting_config_exclusions +# Immediately remove all the voting config exclusions, allowing any node to return to # the voting configuration in future. -DELETE /_cluster/withdrawn_votes?wait_for_removal=false +DELETE /_cluster/voting_config_exclusions?wait_for_removal=false -------------------------------------------------- // CONSOLE From e8d9656d073627dd8eba4721e473c52f755b2c8b Mon Sep 17 00:00:00 2001 From: Yannick Welsch Date: Tue, 4 Dec 2018 18:37:08 +0100 Subject: [PATCH 027/106] Rename tombstones to exclusions --- docs/reference/modules/coordination.asciidoc | 29 ++++++++++---------- 1 file changed, 15 insertions(+), 14 deletions(-) diff --git a/docs/reference/modules/coordination.asciidoc b/docs/reference/modules/coordination.asciidoc index 44b7471a46ab6..49e0b75b23445 100644 --- a/docs/reference/modules/coordination.asciidoc +++ b/docs/reference/modules/coordination.asciidoc @@ -182,7 +182,7 @@ list using the following API: [source,js] -------------------------------------------------- # Add node to voting configuration exclusions list and wait for the system to -# auto-reconfigure the node from the voting configuration up to the default +# auto-reconfigure the node out of the voting configuration up to the default # timeout of 30 seconds POST /_cluster/voting_config_exclusions/node_name # Add node to voting configuration exclusions list and wait for @@ -191,13 +191,13 @@ POST /_cluster/voting_config_exclusions/node_name?timeout=1m -------------------------------------------------- // CONSOLE -The node who should be excluded from the voting configuration is specified -using <> in place of `node_name` here. If a call -to the voting config exclusions API fails then the call can safely be retried. +The node who should be added to the exclusions list is specified using +<> in place of `node_name` here. If a call to the +voting configuration exclusions API fails then the call can safely be retried. A successful response guarantees that the node has been removed from the voting configuration and will not be reinstated. -Although the vote configuration exclusions API is most useful for down-scaling +Although the voting configuration exclusions API is most useful for down-scaling a two-node to a one-node cluster, it is also possible to use it to remove multiple nodes from larger clusters all at the same time. Adding multiple nodes to the exclusions list has the system try to auto-reconfigure all of these nodes from @@ -206,8 +206,9 @@ cluster available. In the example described above, shrinking a seven-master-node cluster down to only have three master nodes, you could add four nodes to the exclusions list, wait for confirmation, and then shut them down simultaneously. -Excluding a node from the voting configuration creates an entry for that node -in the voting configuration exclusions list, which prevents it from returning +Adding an exclusion for a node creates an entry for that node in the voting +configuration exclusions list, which has the system automatically try to reconfigure +the voting configuration to remove that node and prevents it from returning to the voting configuration once it has removed. The current set of exclusions is stored in the cluster state and can be inspected as follows: @@ -221,8 +222,8 @@ This list is limited in size by the following setting: `cluster.max_voting_config_exclusions`:: - Sets a limits on the number of voting config exclusions at any one time. Defaults - to `10`. + Sets a limits on the number of voting configuration exclusions at any one time. + Defaults to `10`. Since voting configuration exclusions are persistent and limited in number, they must be cleaned up. Normally an exclusion is added when performing some maintenance on @@ -236,12 +237,12 @@ only required temporarily: [source,js] -------------------------------------------------- -# Wait for all the nodes with voting config exclusions to be removed from the cluster -# and then remove all the exclusions, allowing any node to return to the -# voting configuration in future. +# Wait for all the nodes with voting configuration exclusions to be removed from the +# cluster and then remove all the exclusions, allowing any node to return to the +# voting configuration in the future. DELETE /_cluster/voting_config_exclusions -# Immediately remove all the voting config exclusions, allowing any node to return to -# the voting configuration in future. +# Immediately remove all the voting configuration exclusions, allowing any node to +# return to the voting configuration in the future. DELETE /_cluster/voting_config_exclusions?wait_for_removal=false -------------------------------------------------- // CONSOLE From 208d463bfa1d62e4f4f233940bf97ccbbf2326b6 Mon Sep 17 00:00:00 2001 From: David Turner Date: Wed, 5 Dec 2018 12:52:16 +0000 Subject: [PATCH 028/106] Reformat --- docs/reference/modules/coordination.asciidoc | 72 ++++++++++---------- 1 file changed, 37 insertions(+), 35 deletions(-) diff --git a/docs/reference/modules/coordination.asciidoc b/docs/reference/modules/coordination.asciidoc index 49e0b75b23445..0bfa12e8a6b24 100644 --- a/docs/reference/modules/coordination.asciidoc +++ b/docs/reference/modules/coordination.asciidoc @@ -169,15 +169,15 @@ If there are only two master-eligible nodes then neither node can be safely removed since both are required to reliably make progress, so you must first inform Elasticsearch that one of the nodes should not be part of the voting configuration, and that the voting power should instead be given to other -nodes, allowing the excluded node to be taken offline without preventing -the other node from making progress. A node who is added to a voting +nodes, allowing the excluded node to be taken offline without preventing the +other node from making progress. A node which is added to a voting configuration exclusion list still works normally, but Elasticsearch will try and remove it from the voting configuration so its vote is no longer required, -and will never automatically move such a node back into the voting configuration -after it has been removed. Once a node's has been successfully reconfigured out -of the voting configuration, it is safe to shut it down with affecting the -cluster's availability. A node can be added to the voting configuration exclusion -list using the following API: +and will never automatically move such a node back into the voting +configuration after it has been removed. Once a node has been successfully +reconfigured out of the voting configuration, it is safe to shut it down +without affecting the cluster's availability. A node can be added to the voting +configuration exclusion list using the following API: [source,js] -------------------------------------------------- @@ -191,26 +191,27 @@ POST /_cluster/voting_config_exclusions/node_name?timeout=1m -------------------------------------------------- // CONSOLE -The node who should be added to the exclusions list is specified using +The node that should be added to the exclusions list is specified using <> in place of `node_name` here. If a call to the voting configuration exclusions API fails then the call can safely be retried. A successful response guarantees that the node has been removed from the voting configuration and will not be reinstated. -Although the voting configuration exclusions API is most useful for down-scaling -a two-node to a one-node cluster, it is also possible to use it to remove multiple -nodes from larger clusters all at the same time. Adding multiple nodes to the -exclusions list has the system try to auto-reconfigure all of these nodes from -the voting configuration, allowing them to be safely shut down while keeping the -cluster available. In the example described above, shrinking a seven-master-node -cluster down to only have three master nodes, you could add four nodes to the -exclusions list, wait for confirmation, and then shut them down simultaneously. +Although the voting configuration exclusions API is most useful for +down-scaling a two-node to a one-node cluster, it is also possible to use it to +remove multiple nodes from larger clusters all at the same time. Adding +multiple nodes to the exclusions list has the system try to auto-reconfigure +all of these nodes out of the voting configuration, allowing them to be safely +shut down while keeping the cluster available. In the example described above, +shrinking a seven-master-node cluster down to only have three master nodes, you +could add four nodes to the exclusions list, wait for confirmation, and then +shut them down simultaneously. Adding an exclusion for a node creates an entry for that node in the voting -configuration exclusions list, which has the system automatically try to reconfigure -the voting configuration to remove that node and prevents it from returning -to the voting configuration once it has removed. The current set of exclusions -is stored in the cluster state and can be inspected as follows: +configuration exclusions list, which has the system automatically try to +reconfigure the voting configuration to remove that node and prevents it from +returning to the voting configuration once it has removed. The current set of +exclusions is stored in the cluster state and can be inspected as follows: [source,js] -------------------------------------------------- @@ -222,27 +223,28 @@ This list is limited in size by the following setting: `cluster.max_voting_config_exclusions`:: - Sets a limits on the number of voting configuration exclusions at any one time. - Defaults to `10`. + Sets a limits on the number of voting configuration exclusions at any one + time. Defaults to `10`. -Since voting configuration exclusions are persistent and limited in number, they -must be cleaned up. Normally an exclusion is added when performing some maintenance on -the cluster, and the exclusions should be cleaned up when the maintenance is complete. -Clusters should have no exclusions in normal operation. +Since voting configuration exclusions are persistent and limited in number, +they must be cleaned up. Normally an exclusion is added when performing some +maintenance on the cluster, and the exclusions should be cleaned up when the +maintenance is complete. Clusters should have no voting configuration +exclusions in normal operation. -If a node is excluded from the voting configuration because it is to be shut down -permanently then its exclusion can be removed once it has shut down and been removed -from the cluster. Exclusions can also be cleared if they were created in error or were -only required temporarily: +If a node is excluded from the voting configuration because it is to be shut +down permanently then its exclusion can be removed once it has shut down and +been removed from the cluster. Exclusions can also be cleared if they were +created in error or were only required temporarily: [source,js] -------------------------------------------------- -# Wait for all the nodes with voting configuration exclusions to be removed from the -# cluster and then remove all the exclusions, allowing any node to return to the -# voting configuration in the future. -DELETE /_cluster/voting_config_exclusions -# Immediately remove all the voting configuration exclusions, allowing any node to +# Wait for all the nodes with voting configuration exclusions to be removed +# from the cluster and then remove all the exclusions, allowing any node to # return to the voting configuration in the future. +DELETE /_cluster/voting_config_exclusions +# Immediately remove all the voting configuration exclusions, allowing any node +# to return to the voting configuration in the future. DELETE /_cluster/voting_config_exclusions?wait_for_removal=false -------------------------------------------------- // CONSOLE From de98cba8aa6c0820c390cf2827599ab5c44101f0 Mon Sep 17 00:00:00 2001 From: David Turner Date: Wed, 5 Dec 2018 13:05:00 +0000 Subject: [PATCH 029/106] Expand section about quorums --- docs/reference/modules/coordination.asciidoc | 32 ++++++++++++++------ 1 file changed, 22 insertions(+), 10 deletions(-) diff --git a/docs/reference/modules/coordination.asciidoc b/docs/reference/modules/coordination.asciidoc index 0bfa12e8a6b24..b0d2a25d6d92a 100644 --- a/docs/reference/modules/coordination.asciidoc +++ b/docs/reference/modules/coordination.asciidoc @@ -21,16 +21,28 @@ those of the other piece. Elasticsearch allows you to add and remove master-eligible nodes to a running cluster. In many cases you can do this simply by starting or stopping the nodes -as required, as described in more detail below. As nodes are added or removed -Elasticsearch maintains an optimal level of fault tolerance by updating the -cluster's _voting configuration_, which is the set of master-eligible nodes -whose responses are counted when making decisions such as electing a new master -or committing a new cluster state. A decision is only made once more than half -of the nodes in the voting configuration have responded. Usually the voting -configuration is the same as the set of all the master-eligible nodes that are -currently in the cluster, but there are some situations in which they may be -different. As long as more than half of the nodes in the voting configuration -are still healthy then the cluster can still make progress. +as required, as described in more detail below. + +As nodes are added or removed Elasticsearch maintains an optimal level of fault +tolerance by updating the cluster's _voting configuration_, which is the set of +master-eligible nodes whose responses are counted when making decisions such as +electing a new master or committing a new cluster state. A decision is only +made once more than half of the nodes in the voting configuration have +responded. Usually the voting configuration is the same as the set of all the +master-eligible nodes that are currently in the cluster, but there are some +situations in which they may be different. + +To be sure that the cluster remains available you **must not stop half or more +of the nodes in the voting configuration at the same time**. As long as more +than half of the voting nodes are available the cluster can still work +normally. This means that if there are three or four master-eligible nodes then +the cluster can tolerate one of them being unavailable; if there are two or +fewer master-eligible nodes then they must all remain available. + +After a node has joined or left the cluster the elected master must issue a +cluster-state update that adjusts the voting configuration to match, and this +can take a short time to complete. It is important to wait for this adjustment +to complete before removing more nodes from the cluster. [float] === Cluster maintenance, rolling restarts and migrations From 12c2b4b9c08726c867fb9516c4c10306bc1a9373 Mon Sep 17 00:00:00 2001 From: David Turner Date: Wed, 5 Dec 2018 13:13:14 +0000 Subject: [PATCH 030/106] Simplify bootstrapping docs --- docs/reference/modules/coordination.asciidoc | 115 +++--------------- .../coordination/ClusterBootstrapService.java | 76 +++++++++++- 2 files changed, 94 insertions(+), 97 deletions(-) diff --git a/docs/reference/modules/coordination.asciidoc b/docs/reference/modules/coordination.asciidoc index b0d2a25d6d92a..6d4d271932019 100644 --- a/docs/reference/modules/coordination.asciidoc +++ b/docs/reference/modules/coordination.asciidoc @@ -300,108 +300,31 @@ all the information they need from the elected master, and nodes that have previously been part of a cluster will have stored to disk all the information required when restarting. -A cluster can be bootstrapped by sending the _bootstrap configuration_ to any -of its master-eligible nodes via the `POST /_cluster/bootstrap_configuration` -API: +A cluster can be bootstrapped by setting the names or addresses of the initial +set of master nodes in the `elasticsearch.yml` file: -[source,js] --------------------------------------------------- -POST /_cluster/bootstrap_configuration -{ - "master_nodes": [ - { - "id": "gAMDNeJRTX6A_VelgSb84g", - "name": "master-a" - }, - { - "id": "t3LZCVGxTf-idQIC8z4A1A", - "name": "master-b" - }, - { - "id": "GfwXZYVVSFCOWNT0zcDixQ", - "name": "master-c" - } - ] -} --------------------------------------------------- -// CONSOLE - -This only needs to occur once, on a single master-eligible node in the cluster, -but for robustness it is safe to repeatedly call `POST -/_cluster/bootstrap_configuration`, and to call it on different nodes -concurrently. However **it is vitally important** to use exactly the same -bootstrap configuration in each call. - -WARNING: You must pass exactly the same bootstrap configuration to each call to -`POST /_cluster/bootstrap_configuration` in order to be sure that only a single -cluster forms during bootstrapping and therefore to avoid the risk of data -loss. - -The simplest and safest way to construct a bootstrap configuration is to use -the `GET /_cluster/bootstrap_configuration` API. This API returns a -properly-constructed bootstrap configuration that is ready to pass back to the -`POST /_cluster/bootstrap_configuration` API. It includes all of the -master-eligible nodes that the handling node has discovered via the -gossip-based discovery protocol, and returns an error if fewer nodes have been -discovered than required: - -[source,js] +[source] -------------------------------------------------- -# Immediately return a bootstrap configuration based on the nodes -# discovered so far. -GET /_cluster/bootstrap_configuration -# Return a bootstrap configuration of at least three nodes, or return an -# error if fewer than three nodes have been discovered. -GET /_cluster/bootstrap_configuration?min_size=3 -# Return a bootstrap configuration of at least three nodes, waiting for -# up to a minute for this many nodes to be discovered before returning -# an error. -GET /_cluster/bootstrap_configuration?min_size=3&timeout=1m +cluster.initial_master_nodes: + - master-a + - master-b + - master-c -------------------------------------------------- -// CONSOLE -It is also possible to construct a bootstrap configuration manually and to -specify the initial set of nodes in terms of their names alone, rather than -including their IDs too: +This only needs to be set on a single master-eligible node in the cluster, but +for robustness it is safe to set this on every node in the cluster. However +**it is vitally important** to use exactly the same set of nodes in each +configuration file. -[source,js] --------------------------------------------------- -POST /_cluster/bootstrap_configuration -{ - "master_nodes": [ - { - "name": "master-a" - }, - { - "name": "master-b" - }, - { - "name": "master-c" - } - ] -} --------------------------------------------------- -// CONSOLE - -It is safer to include the node IDs, in case two nodes are accidentally started -with the same name. - -[float] -==== Cluster bootstrapping tool - -A simpler way to bootstrap a cluster is to use the -`elasticsearch-bootstrap-cluster` command-line tool which implements the -process described here: - -[source,txt] --------------------------------------------------- -$ bin/elasticsearch-bootstrap-cluster --node http://10.0.12.1:9200/ \ - --node http://10.0.13.1:9200/ --node https://10.0.14.1:9200/ --------------------------------------------------- +WARNING: You must put exactly the same set of master nodes in each +configuration file in order to be sure that only a single cluster forms during +bootstrapping and therefore to avoid the risk of data loss. -The arguments to this tool are the addresses of (some, preferably all, of) its -master-eligible nodes. The tool will construct a bootstrap warrant and then -bootstrap the cluster, retrying safely if any step fails. +If the cluster is running with a completely default configuration then it will +automatically bootstrap based on the nodes that could be discovered within a +short time after startup. Since nodes may not always reliably discover each +other quickly enough this automatic bootstrapping is not always reliable and +cannot be used in production deployments. [float] === Unsafe disaster recovery diff --git a/server/src/main/java/org/elasticsearch/cluster/coordination/ClusterBootstrapService.java b/server/src/main/java/org/elasticsearch/cluster/coordination/ClusterBootstrapService.java index 572631a8f61cd..51206ff563e4b 100644 --- a/server/src/main/java/org/elasticsearch/cluster/coordination/ClusterBootstrapService.java +++ b/server/src/main/java/org/elasticsearch/cluster/coordination/ClusterBootstrapService.java @@ -29,6 +29,7 @@ import org.elasticsearch.action.admin.cluster.bootstrap.GetDiscoveredNodesRequest; import org.elasticsearch.action.admin.cluster.bootstrap.GetDiscoveredNodesResponse; import org.elasticsearch.cluster.node.DiscoveryNode; +import org.elasticsearch.common.Nullable; import org.elasticsearch.common.io.stream.StreamInput; import org.elasticsearch.common.settings.Setting; import org.elasticsearch.common.settings.Setting.Property; @@ -41,6 +42,13 @@ import org.elasticsearch.transport.TransportService; import java.io.IOException; +import java.util.Collections; +import java.util.List; +import java.util.function.Function; +import java.util.stream.Stream; + +import static org.elasticsearch.discovery.DiscoveryModule.DISCOVERY_HOSTS_PROVIDER_SETTING; +import static org.elasticsearch.discovery.zen.SettingsBasedHostsProvider.DISCOVERY_ZEN_PING_UNICAST_HOSTS_SETTING; public class ClusterBootstrapService { @@ -51,20 +59,85 @@ public class ClusterBootstrapService { public static final Setting INITIAL_MASTER_NODE_COUNT_SETTING = Setting.intSetting("cluster.unsafe_initial_master_node_count", 0, 0, Property.NodeScope); + public static final Setting> INITIAL_MASTER_NODES_SETTING = + Setting.listSetting("cluster.initial_master_nodes", Collections.emptyList(), Function.identity(), Property.NodeScope); + + public static final Setting UNCONFIGURED_BOOTSTRAP_TIMEOUT_SETTING = + Setting.timeSetting("discovery.unconfigured_bootstrap_timeout", + TimeValue.timeValueSeconds(3), TimeValue.timeValueMillis(1), Property.NodeScope); + private final int initialMasterNodeCount; + private final List initialMasterNodes; + @Nullable + private final TimeValue unconfiguredBootstrapTimeout; private final TransportService transportService; private volatile boolean running; public ClusterBootstrapService(Settings settings, TransportService transportService) { initialMasterNodeCount = INITIAL_MASTER_NODE_COUNT_SETTING.get(settings); + initialMasterNodes = INITIAL_MASTER_NODES_SETTING.get(settings); + unconfiguredBootstrapTimeout = discoveryIsConfigured(settings) ? null : UNCONFIGURED_BOOTSTRAP_TIMEOUT_SETTING.get(settings); this.transportService = transportService; } + public static boolean discoveryIsConfigured(Settings settings) { + return Stream.of(DISCOVERY_HOSTS_PROVIDER_SETTING, DISCOVERY_ZEN_PING_UNICAST_HOSTS_SETTING, + INITIAL_MASTER_NODE_COUNT_SETTING, INITIAL_MASTER_NODES_SETTING).anyMatch(s -> s.exists(settings)); + } + public void start() { assert running == false; running = true; - if (initialMasterNodeCount > 0 && transportService.getLocalNode().isMasterNode()) { + if (transportService.getLocalNode().isMasterNode() == false) { + return; + } + + if (unconfiguredBootstrapTimeout != null) { + logger.info("no discovery configuration found, will perform best-effort cluster bootstrapping after [{}] " + + "unless existing master is discovered", unconfiguredBootstrapTimeout); + final ThreadContext threadContext = transportService.getThreadPool().getThreadContext(); + try (ThreadContext.StoredContext ignore = threadContext.stashContext()) { + threadContext.markAsSystemContext(); + + transportService.getThreadPool().scheduleUnlessShuttingDown(unconfiguredBootstrapTimeout, Names.SAME, new Runnable() { + @Override + public void run() { + final GetDiscoveredNodesRequest request = new GetDiscoveredNodesRequest(); + logger.trace("sending {}", request); + transportService.sendRequest(transportService.getLocalNode(), GetDiscoveredNodesAction.NAME, request, + new TransportResponseHandler() { + @Override + public void handleResponse(GetDiscoveredNodesResponse response) { + logger.debug("discovered {}, starting to bootstrap", response.getNodes()); + awaitBootstrap(response.getBootstrapConfiguration()); + } + + @Override + public void handleException(TransportException exp) { + logger.warn("discovery attempt failed", exp); + } + + @Override + public String executor() { + return Names.SAME; + } + + @Override + public GetDiscoveredNodesResponse read(StreamInput in) throws IOException { + return new GetDiscoveredNodesResponse(in); + } + }); + } + + @Override + public String toString() { + return "unconfigured-discovery delayed bootstrap"; + } + }); + + } + } else if (initialMasterNodeCount > 0) { logger.debug("unsafely waiting for discovery of [{}] master-eligible nodes", initialMasterNodeCount); final ThreadContext threadContext = transportService.getThreadPool().getThreadContext(); @@ -73,6 +146,7 @@ public void start() { final GetDiscoveredNodesRequest request = new GetDiscoveredNodesRequest(); request.setWaitForNodes(initialMasterNodeCount); + request.setRequiredNodes(initialMasterNodes); request.setTimeout(null); logger.trace("sending {}", request); transportService.sendRequest(transportService.getLocalNode(), GetDiscoveredNodesAction.NAME, request, From d4763eea3332c5dfe5e5d52c1acba6bee09367e3 Mon Sep 17 00:00:00 2001 From: David Turner Date: Wed, 5 Dec 2018 13:30:54 +0000 Subject: [PATCH 031/106] Oops --- .../coordination/ClusterBootstrapService.java | 76 +------------------ 1 file changed, 1 insertion(+), 75 deletions(-) diff --git a/server/src/main/java/org/elasticsearch/cluster/coordination/ClusterBootstrapService.java b/server/src/main/java/org/elasticsearch/cluster/coordination/ClusterBootstrapService.java index 51206ff563e4b..572631a8f61cd 100644 --- a/server/src/main/java/org/elasticsearch/cluster/coordination/ClusterBootstrapService.java +++ b/server/src/main/java/org/elasticsearch/cluster/coordination/ClusterBootstrapService.java @@ -29,7 +29,6 @@ import org.elasticsearch.action.admin.cluster.bootstrap.GetDiscoveredNodesRequest; import org.elasticsearch.action.admin.cluster.bootstrap.GetDiscoveredNodesResponse; import org.elasticsearch.cluster.node.DiscoveryNode; -import org.elasticsearch.common.Nullable; import org.elasticsearch.common.io.stream.StreamInput; import org.elasticsearch.common.settings.Setting; import org.elasticsearch.common.settings.Setting.Property; @@ -42,13 +41,6 @@ import org.elasticsearch.transport.TransportService; import java.io.IOException; -import java.util.Collections; -import java.util.List; -import java.util.function.Function; -import java.util.stream.Stream; - -import static org.elasticsearch.discovery.DiscoveryModule.DISCOVERY_HOSTS_PROVIDER_SETTING; -import static org.elasticsearch.discovery.zen.SettingsBasedHostsProvider.DISCOVERY_ZEN_PING_UNICAST_HOSTS_SETTING; public class ClusterBootstrapService { @@ -59,85 +51,20 @@ public class ClusterBootstrapService { public static final Setting INITIAL_MASTER_NODE_COUNT_SETTING = Setting.intSetting("cluster.unsafe_initial_master_node_count", 0, 0, Property.NodeScope); - public static final Setting> INITIAL_MASTER_NODES_SETTING = - Setting.listSetting("cluster.initial_master_nodes", Collections.emptyList(), Function.identity(), Property.NodeScope); - - public static final Setting UNCONFIGURED_BOOTSTRAP_TIMEOUT_SETTING = - Setting.timeSetting("discovery.unconfigured_bootstrap_timeout", - TimeValue.timeValueSeconds(3), TimeValue.timeValueMillis(1), Property.NodeScope); - private final int initialMasterNodeCount; - private final List initialMasterNodes; - @Nullable - private final TimeValue unconfiguredBootstrapTimeout; private final TransportService transportService; private volatile boolean running; public ClusterBootstrapService(Settings settings, TransportService transportService) { initialMasterNodeCount = INITIAL_MASTER_NODE_COUNT_SETTING.get(settings); - initialMasterNodes = INITIAL_MASTER_NODES_SETTING.get(settings); - unconfiguredBootstrapTimeout = discoveryIsConfigured(settings) ? null : UNCONFIGURED_BOOTSTRAP_TIMEOUT_SETTING.get(settings); this.transportService = transportService; } - public static boolean discoveryIsConfigured(Settings settings) { - return Stream.of(DISCOVERY_HOSTS_PROVIDER_SETTING, DISCOVERY_ZEN_PING_UNICAST_HOSTS_SETTING, - INITIAL_MASTER_NODE_COUNT_SETTING, INITIAL_MASTER_NODES_SETTING).anyMatch(s -> s.exists(settings)); - } - public void start() { assert running == false; running = true; - if (transportService.getLocalNode().isMasterNode() == false) { - return; - } - - if (unconfiguredBootstrapTimeout != null) { - logger.info("no discovery configuration found, will perform best-effort cluster bootstrapping after [{}] " + - "unless existing master is discovered", unconfiguredBootstrapTimeout); - final ThreadContext threadContext = transportService.getThreadPool().getThreadContext(); - try (ThreadContext.StoredContext ignore = threadContext.stashContext()) { - threadContext.markAsSystemContext(); - - transportService.getThreadPool().scheduleUnlessShuttingDown(unconfiguredBootstrapTimeout, Names.SAME, new Runnable() { - @Override - public void run() { - final GetDiscoveredNodesRequest request = new GetDiscoveredNodesRequest(); - logger.trace("sending {}", request); - transportService.sendRequest(transportService.getLocalNode(), GetDiscoveredNodesAction.NAME, request, - new TransportResponseHandler() { - @Override - public void handleResponse(GetDiscoveredNodesResponse response) { - logger.debug("discovered {}, starting to bootstrap", response.getNodes()); - awaitBootstrap(response.getBootstrapConfiguration()); - } - - @Override - public void handleException(TransportException exp) { - logger.warn("discovery attempt failed", exp); - } - - @Override - public String executor() { - return Names.SAME; - } - - @Override - public GetDiscoveredNodesResponse read(StreamInput in) throws IOException { - return new GetDiscoveredNodesResponse(in); - } - }); - } - - @Override - public String toString() { - return "unconfigured-discovery delayed bootstrap"; - } - }); - - } - } else if (initialMasterNodeCount > 0) { + if (initialMasterNodeCount > 0 && transportService.getLocalNode().isMasterNode()) { logger.debug("unsafely waiting for discovery of [{}] master-eligible nodes", initialMasterNodeCount); final ThreadContext threadContext = transportService.getThreadPool().getThreadContext(); @@ -146,7 +73,6 @@ public String toString() { final GetDiscoveredNodesRequest request = new GetDiscoveredNodesRequest(); request.setWaitForNodes(initialMasterNodeCount); - request.setRequiredNodes(initialMasterNodes); request.setTimeout(null); logger.trace("sending {}", request); transportService.sendRequest(transportService.getLocalNode(), GetDiscoveredNodesAction.NAME, request, From 3fc691f87d1baaf94224da021644d25884882779 Mon Sep 17 00:00:00 2001 From: David Turner Date: Fri, 7 Dec 2018 17:56:21 +0000 Subject: [PATCH 032/106] Command line also ok --- docs/reference/modules/coordination.asciidoc | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/docs/reference/modules/coordination.asciidoc b/docs/reference/modules/coordination.asciidoc index 6d4d271932019..8f8b64a57d3e3 100644 --- a/docs/reference/modules/coordination.asciidoc +++ b/docs/reference/modules/coordination.asciidoc @@ -320,6 +320,15 @@ WARNING: You must put exactly the same set of master nodes in each configuration file in order to be sure that only a single cluster forms during bootstrapping and therefore to avoid the risk of data loss. +It is also possible to set the initial set of master nodes on the command-line +used to start Elasticsearch: + +[source] +-------------------------------------------------- +$ bin/elasticsearch -Ecluster.initial_master_nodes=master-a,master-b,master-c +-------------------------------------------------- + + If the cluster is running with a completely default configuration then it will automatically bootstrap based on the nodes that could be discovered within a short time after startup. Since nodes may not always reliably discover each From ca73f1f825252a510b73ea0e5541be0510cabc7c Mon Sep 17 00:00:00 2001 From: Yannick Welsch Date: Sun, 9 Dec 2018 23:38:15 +0100 Subject: [PATCH 033/106] Refactor docs --- docs/plugins/discovery.asciidoc | 5 +- docs/reference/modules.asciidoc | 16 +- docs/reference/modules/cluster.asciidoc | 2 +- docs/reference/modules/coordination.asciidoc | 384 ------------------ docs/reference/modules/discovery.asciidoc | 104 ++++- .../discovery/auto-reconfiguration.asciidoc | 112 +++++ .../modules/discovery/azure.asciidoc | 5 - .../discovery/bootstrap-cluster.asciidoc | 67 +++ docs/reference/modules/discovery/ec2.asciidoc | 4 - docs/reference/modules/discovery/gce.asciidoc | 6 - .../discovery/hosts-providers.asciidoc | 150 +++++++ .../discovery/master-election.asciidoc | 97 +++++ .../modules/discovery/quorums.asciidoc | 187 +++++++++ docs/reference/modules/discovery/zen.asciidoc | 226 ----------- docs/reference/modules/node.asciidoc | 72 +--- .../reference/setup/bootstrap-checks.asciidoc | 1 + .../discovery-settings.asciidoc | 36 +- 17 files changed, 727 insertions(+), 747 deletions(-) create mode 100644 docs/reference/modules/discovery/auto-reconfiguration.asciidoc delete mode 100644 docs/reference/modules/discovery/azure.asciidoc create mode 100644 docs/reference/modules/discovery/bootstrap-cluster.asciidoc delete mode 100644 docs/reference/modules/discovery/ec2.asciidoc delete mode 100644 docs/reference/modules/discovery/gce.asciidoc create mode 100644 docs/reference/modules/discovery/hosts-providers.asciidoc create mode 100644 docs/reference/modules/discovery/master-election.asciidoc create mode 100644 docs/reference/modules/discovery/quorums.asciidoc delete mode 100644 docs/reference/modules/discovery/zen.asciidoc diff --git a/docs/plugins/discovery.asciidoc b/docs/plugins/discovery.asciidoc index 46b61146b128d..fb77e60898ff9 100644 --- a/docs/plugins/discovery.asciidoc +++ b/docs/plugins/discovery.asciidoc @@ -1,8 +1,8 @@ [[discovery]] == Discovery Plugins -Discovery plugins extend Elasticsearch by adding new discovery mechanisms that -can be used instead of {ref}/modules-discovery-zen.html[Zen Discovery]. +Discovery plugins extend Elasticsearch by adding new host providers that +can be used to extend the {ref}/modules-discovery-zen.html[cluster formation module]. [float] ==== Core discovery plugins @@ -26,7 +26,6 @@ The Google Compute Engine discovery plugin uses the GCE API for unicast discover A number of discovery plugins have been contributed by our community: -* https://github.com/shikhar/eskka[eskka Discovery Plugin] (by Shikhar Bhushan) * https://github.com/fabric8io/elasticsearch-cloud-kubernetes[Kubernetes Discovery Plugin] (by Jimmi Dyson, http://fabric8.io[fabric8]) include::discovery-ec2.asciidoc[] diff --git a/docs/reference/modules.asciidoc b/docs/reference/modules.asciidoc index 45eac7ba53165..f8b6c2784a075 100644 --- a/docs/reference/modules.asciidoc +++ b/docs/reference/modules.asciidoc @@ -18,17 +18,13 @@ These settings can be dynamically updated on a live cluster with the The modules in this section are: -<>:: +<>:: - Settings to control where, when, and how shards are allocated to nodes. - -<>:: + How nodes discover each other, elect a master and form a cluster. - How nodes discover each other to form a cluster. +<>:: -<>:: - - How the cluster elects a master node and manages the cluster state + Settings to control where, when, and how shards are allocated to nodes. <>:: @@ -89,11 +85,9 @@ The modules in this section are: -- -include::modules/cluster.asciidoc[] - include::modules/discovery.asciidoc[] -include::modules/coordination.asciidoc[] +include::modules/cluster.asciidoc[] include::modules/gateway.asciidoc[] diff --git a/docs/reference/modules/cluster.asciidoc b/docs/reference/modules/cluster.asciidoc index c4b6445292726..810ed7c4a34b4 100644 --- a/docs/reference/modules/cluster.asciidoc +++ b/docs/reference/modules/cluster.asciidoc @@ -1,5 +1,5 @@ [[modules-cluster]] -== Cluster +== Shard allocation and cluster-level routing One of the main roles of the master is to decide which shards to allocate to which nodes, and when to move shards between nodes in order to rebalance the diff --git a/docs/reference/modules/coordination.asciidoc b/docs/reference/modules/coordination.asciidoc index 8f8b64a57d3e3..5b38ab38f2f50 100644 --- a/docs/reference/modules/coordination.asciidoc +++ b/docs/reference/modules/coordination.asciidoc @@ -4,262 +4,9 @@ The cluster coordination module is responsible for electing a master node and managing changes to the cluster state. -[float] -=== Quorum-based decision making - -Electing a master node and changing the cluster state are the two fundamental -tasks that master-eligible nodes must work together to perform. It is important -that these activities work robustly even if some nodes have failed, and -Elasticsearch achieves this robustness by only considering each action to have -succeeded on receipt of responses from a _quorum_, a subset of the -master-eligible nodes in the cluster. The advantage of requiring only a subset -of the nodes to respond is that it allows for some of the nodes to fail without -preventing the cluster from making progress, and the quorums are carefully -chosen so as not to allow the cluster to "split brain", i.e. to be partitioned -into two pieces each of which may make decisions that are inconsistent with -those of the other piece. - -Elasticsearch allows you to add and remove master-eligible nodes to a running -cluster. In many cases you can do this simply by starting or stopping the nodes -as required, as described in more detail below. - -As nodes are added or removed Elasticsearch maintains an optimal level of fault -tolerance by updating the cluster's _voting configuration_, which is the set of -master-eligible nodes whose responses are counted when making decisions such as -electing a new master or committing a new cluster state. A decision is only -made once more than half of the nodes in the voting configuration have -responded. Usually the voting configuration is the same as the set of all the -master-eligible nodes that are currently in the cluster, but there are some -situations in which they may be different. - -To be sure that the cluster remains available you **must not stop half or more -of the nodes in the voting configuration at the same time**. As long as more -than half of the voting nodes are available the cluster can still work -normally. This means that if there are three or four master-eligible nodes then -the cluster can tolerate one of them being unavailable; if there are two or -fewer master-eligible nodes then they must all remain available. - -After a node has joined or left the cluster the elected master must issue a -cluster-state update that adjusts the voting configuration to match, and this -can take a short time to complete. It is important to wait for this adjustment -to complete before removing more nodes from the cluster. - -[float] -=== Cluster maintenance, rolling restarts and migrations - -Many cluster maintenance tasks involve temporarily shutting down one or more -nodes and then starting them back up again. By default Elasticsearch can remain -available if one of its master-eligible nodes is taken offline, such as during -a <>. Furthermore, if multiple nodes are -stopped and then started again then it will automatically recover, such as -during a <>. There is no need to take any -further action with the APIs described here in these cases, because the set of -master nodes is not changing permanently. - -It is also possible to perform a migration of a cluster onto entirely new nodes -without taking the cluster offline, via a _rolling migration_. A rolling -migration is similar to a rolling restart, in that it is performed one node at -a time, and also requires no special handling for the master-eligible nodes as -long as there are at least two of them available at all times. - -TODO the above is only true if the maintenance happens slowly enough, otherwise -the configuration might not catch up. Need to add this to the rolling restart -docs. - -[float] -==== Auto-reconfiguration -Nodes may join or leave the cluster, and Elasticsearch reacts by making -corresponding changes to the voting configuration in order to ensure that the -cluster is as resilient as possible. The default auto-reconfiguration behaviour -is expected to give the best results in most situation. The current voting -configuration is stored in the cluster state so you can inspect its current -contents as follows: -[source,js] --------------------------------------------------- -GET /_cluster/state?filter_path=metadata.cluster_coordination.last_committed_config --------------------------------------------------- -// CONSOLE - -NOTE: The current voting configuration is not necessarily the same as the set -of all available master-eligible nodes in the cluster. Altering the voting -configuration itself involves taking a vote, so it takes some time to adjust -the configuration as nodes join or leave the cluster. Also, there are -situations where the most resilient configuration includes unavailable nodes, -or does not include some available nodes, and in these situations the voting -configuration will differ from the set of available master-eligible nodes in -the cluster. - -Larger voting configurations are usually more resilient, so Elasticsearch will -normally prefer to add master-eligible nodes to the voting configuration once -they have joined the cluster. Similarly, if a node in the voting configuration -leaves the cluster and there is another master-eligible node in the cluster -that is not in the voting configuration then it is preferable to swap these two -nodes over, leaving the size of the voting configuration unchanged but -increasing its resilience. - -It is not so straightforward to automatically remove nodes from the voting -configuration after they have left the cluster, and different strategies have -different benefits and drawbacks, so the right choice depends on how the -cluster will be used and is controlled by the following setting. -`cluster.auto_shrink_voting_configuration`:: - - Defaults to `true`, meaning that the voting configuration will - automatically shrink, shedding departed nodes, as long as it still contains - at least 3 nodes. If set to `false`, the voting configuration never - automatically shrinks; departed nodes must be removed manually using the - vote withdrawal API described below. - -NOTE: If `cluster.auto_shrink_voting_configuration` is set to `true`, the -recommended and default setting, and there are at least three master-eligible -nodes in the cluster, then Elasticsearch remains capable of processing -cluster-state updates as long as all but one of its master-eligible nodes are -healthy. - -There are situations in which Elasticsearch might tolerate the loss of multiple -nodes, but this is not guaranteed under all sequences of failures. If this -setting is set to `false` then departed nodes must be removed from the voting -configuration manually, using the vote withdrawal API described below, to achieve -the desired level of resilience. - -Note that Elasticsearch will not suffer from a "split-brain" inconsistency -however it is configured. This setting only affects its availability in the -event of the failure of some of its nodes, and the administrative tasks that -must be performed as nodes join and leave the cluster. - -[float] -==== Even numbers of master-eligible nodes - -There should normally be an odd number of master-eligible nodes in a cluster. -If there is an even number then Elasticsearch will leave one of them out of the -voting configuration to ensure that it has an odd size. This does not decrease -the failure-tolerance of the cluster, and in fact improves it slightly: if the -cluster is partitioned into two even halves then one of the halves will contain -a majority of the voting configuration and will be able to keep operating, -whereas if all of the master-eligible nodes' votes were counted then neither -side could make any progress in this situation. - -For instance if there are four master-eligible nodes in the cluster and the -voting configuration contained all of them then any quorum-based decision would -require votes from at least three of them, which means that the cluster can -only tolerate the loss of a single master-eligible node. If this cluster were -split into two equal halves then neither half would contain three -master-eligible nodes so would not be able to make any progress. However if the -voting configuration contains only three of the four master-eligible nodes then -the cluster is still only fully tolerant to the loss of one node, but -quorum-based decisions require votes from two of the three voting nodes. In the -event of an even split, one half will contain two of the three voting nodes so -will remain available. - -[float] -==== Adding and removing master-eligible nodes - -It is recommended to have a small and fixed number of master-eligible nodes in -a cluster, and to scale the cluster up and down by adding and removing -non-master-eligible nodes only. However there are situations in which it may be -desirable to add or remove some master-eligible nodes to or from a cluster. - -If you wish to add some master-eligible nodes to your cluster, simply configure -the new nodes to find the existing cluster and start them up. Elasticsearch -will add the new nodes to the voting configuration if it is appropriate to do -so. - -When removing master-eligible nodes, it is important not to remove too many all -at the same time. For instance, if there are currently seven master-eligible -nodes and you wish to reduce this to three, it is not possible simply to stop -four of the nodes at once: to do so would leave only three nodes remaining, -which is less than half of the voting configuration, which means the cluster -cannot take any further actions. - -As long as there are at least three master-eligible nodes in the cluster, as a -general rule it is best to remove nodes one-at-a-time, allowing enough time for -the auto-reconfiguration to take effect after each removal. - -If there are only two master-eligible nodes then neither node can be safely -removed since both are required to reliably make progress, so you must first -inform Elasticsearch that one of the nodes should not be part of the voting -configuration, and that the voting power should instead be given to other -nodes, allowing the excluded node to be taken offline without preventing the -other node from making progress. A node which is added to a voting -configuration exclusion list still works normally, but Elasticsearch will try -and remove it from the voting configuration so its vote is no longer required, -and will never automatically move such a node back into the voting -configuration after it has been removed. Once a node has been successfully -reconfigured out of the voting configuration, it is safe to shut it down -without affecting the cluster's availability. A node can be added to the voting -configuration exclusion list using the following API: - -[source,js] --------------------------------------------------- -# Add node to voting configuration exclusions list and wait for the system to -# auto-reconfigure the node out of the voting configuration up to the default -# timeout of 30 seconds -POST /_cluster/voting_config_exclusions/node_name -# Add node to voting configuration exclusions list and wait for -# auto-reconfiguration up to one minute -POST /_cluster/voting_config_exclusions/node_name?timeout=1m --------------------------------------------------- -// CONSOLE - -The node that should be added to the exclusions list is specified using -<> in place of `node_name` here. If a call to the -voting configuration exclusions API fails then the call can safely be retried. -A successful response guarantees that the node has been removed from the voting -configuration and will not be reinstated. - -Although the voting configuration exclusions API is most useful for -down-scaling a two-node to a one-node cluster, it is also possible to use it to -remove multiple nodes from larger clusters all at the same time. Adding -multiple nodes to the exclusions list has the system try to auto-reconfigure -all of these nodes out of the voting configuration, allowing them to be safely -shut down while keeping the cluster available. In the example described above, -shrinking a seven-master-node cluster down to only have three master nodes, you -could add four nodes to the exclusions list, wait for confirmation, and then -shut them down simultaneously. - -Adding an exclusion for a node creates an entry for that node in the voting -configuration exclusions list, which has the system automatically try to -reconfigure the voting configuration to remove that node and prevents it from -returning to the voting configuration once it has removed. The current set of -exclusions is stored in the cluster state and can be inspected as follows: - -[source,js] --------------------------------------------------- -GET /_cluster/state?filter_path=metadata.cluster_coordination.voting_config_exclusions --------------------------------------------------- -// CONSOLE - -This list is limited in size by the following setting: - -`cluster.max_voting_config_exclusions`:: - - Sets a limits on the number of voting configuration exclusions at any one - time. Defaults to `10`. - -Since voting configuration exclusions are persistent and limited in number, -they must be cleaned up. Normally an exclusion is added when performing some -maintenance on the cluster, and the exclusions should be cleaned up when the -maintenance is complete. Clusters should have no voting configuration -exclusions in normal operation. - -If a node is excluded from the voting configuration because it is to be shut -down permanently then its exclusion can be removed once it has shut down and -been removed from the cluster. Exclusions can also be cleared if they were -created in error or were only required temporarily: - -[source,js] --------------------------------------------------- -# Wait for all the nodes with voting configuration exclusions to be removed -# from the cluster and then remove all the exclusions, allowing any node to -# return to the voting configuration in the future. -DELETE /_cluster/voting_config_exclusions -# Immediately remove all the voting configuration exclusions, allowing any node -# to return to the voting configuration in the future. -DELETE /_cluster/voting_config_exclusions?wait_for_removal=false --------------------------------------------------- -// CONSOLE [float] === Cluster bootstrapping @@ -356,135 +103,4 @@ handling node is the only voting master, so that it forms a quorum on its own. Because there is a risk of data loss when performing this command it requires the `accept_data_loss` parameter to be set to `true` in the URL. -[float] -=== Election scheduling - -Elasticsearch uses an election process to agree on an elected master node, both -at startup and if the existing elected master fails. Any master-eligible node -can start an election, and normally the first election that takes place will -succeed. Elections only usually fail when two nodes both happen to start their -elections at about the same time, so elections are scheduled randomly on each -node to avoid this happening. Nodes will retry elections until a master is -elected, backing off on failure, so that eventually an election will succeed -(with arbitrarily high probability). The following settings control the -scheduling of elections. - -`cluster.election.initial_timeout`:: - - Sets the upper bound on how long a node will wait initially, or after a - leader failure, before attempting its first election. This defaults to - `100ms`. - -`cluster.election.back_off_time`:: - - Sets the amount to increase the upper bound on the wait before an election - on each election failure. Note that this is _linear_ backoff. This defaults - to `100ms` - -`cluster.election.max_timeout`:: - - Sets the maximum upper bound on how long a node will wait before attempting - an first election, so that an network partition that lasts for a long time - does not result in excessively sparse elections. This defaults to `10s` - -`cluster.election.duration`:: - - Sets how long each election is allowed to take before a node considers it - to have failed and schedules a retry. This defaults to `500ms`. - -[float] -=== Fault detection - -An elected master periodically checks each of its followers in order to ensure -that they are still connected and healthy, and in turn each follower -periodically checks the health of the elected master. Elasticsearch allows for -these checks occasionally to fail or timeout without taking any action, and -will only consider a node to be truly faulty after a number of consecutive -checks have failed. The following settings control the behaviour of fault -detection. - -`cluster.fault_detection.follower_check.interval`:: - - Sets how long the elected master waits between checks of its followers. - Defaults to `1s`. - -`cluster.fault_detection.follower_check.timeout`:: - - Sets how long the elected master waits for a response to a follower check - before considering it to have failed. Defaults to `30s`. - -`cluster.fault_detection.follower_check.retry_count`:: - - Sets how many consecutive follower check failures must occur before the - elected master considers a follower node to be faulty and removes it from - the cluster. Defaults to `3`. - -`cluster.fault_detection.leader_check.interval`:: - - Sets how long each follower node waits between checks of its leader. - Defaults to `1s`. - -`cluster.fault_detection.leader_check.timeout`:: - - Sets how long each follower node waits for a response to a leader check - before considering it to have failed. Defaults to `30s`. - -`cluster.fault_detection.leader_check.retry_count`:: - - Sets how many consecutive leader check failures must occur before a - follower node considers the elected master to be faulty and attempts to - find or elect a new master. Defaults to `3`. - - -[float] -=== Discovery settings - -TODO move this to the discovery module docs - -Discovery operates in two phases: First, each node "probes" the addresses of -all known nodes by connecting to each address and attempting to identify the -node to which it is connected. Secondly it shares with the remote node a list -of all of its peers and the remote node responds with _its_ peers in turn. The -node then probes all the new nodes about which it just discovered, requests -their peers, and so on, until it has discovered an elected master node or -enough other masterless nodes that it can perform an election. If neither of -these occur quickly enough then it tries again. This process is controlled by -the following settings. - -`discovery.probe.connect_timeout`:: - - Sets how long to wait when attempting to connect to each address. Defaults - to `3s`. - -`discovery.probe.handshake_timeout`:: - - Sets how long to wait when attempting to identify the remote node via a - handshake. Defaults to `1s`. - -`discovery.find_peers_interval`:: - - Sets how long a node will wait before attempting another discovery round. - -`discovery.request_peers_timeout`:: - - Sets how long a node will wait after asking its peers again before - considering the request to have failed. - -[float] -=== Miscellaneous timeouts - -`cluster.join.timeout`:: - - Sets how long a node will wait after sending a request to join a cluster - before it considers the request to have failed and retries. Defaults to - `60s`. - -`cluster.publish.timeout`:: - Sets how long the elected master will wait after publishing a cluster state - update to receive acknowledgements from all its followers. If this timeout - occurs then the elected master may start to calculate and publish a - subsequent cluster state update, as long as it received enough - acknowledgements to know that the previous publication was committed; if it - did not receive enough acknowledgements to commit the update then it stands - down as the elected leader. diff --git a/docs/reference/modules/discovery.asciidoc b/docs/reference/modules/discovery.asciidoc index 292748d1d7b90..191c9b0295bea 100644 --- a/docs/reference/modules/discovery.asciidoc +++ b/docs/reference/modules/discovery.asciidoc @@ -1,8 +1,68 @@ [[modules-discovery]] -== Discovery +== Discovery and cluster formation -The discovery module is responsible for discovering nodes within a -cluster, as well as electing a master node. +The discovery and cluster formation module is responsible for discovering nodes, +electing a master, and publishing the cluster state. + +This module is integrated with other modules, for example, +all communication between nodes is done using the <> module. + +It is separated into several sub modules, which are explained below: + +[float] +=== Discovery + +The discovery sub-module uses a list of _seed_ nodes in order to start +off the discovery process. At startup, or when disconnected from a master, +Elasticsearch tries to connect to each seed node in its list, and holds a +gossip-like conversation with them to find other nodes and to build a complete +picture of the master-eligible nodes in the cluster. + +include::discovery/hosts-providers.asciidoc[] + +[float] +=== Bootstrapping a cluster + +Starting an Elasticsearch cluster for the very first time requires a +cluster bootstrapping step. In <>, +with no discovery settings configured, this step is automatically +performed by the nodes themselves. As this auto-bootstrapping is +<>, running a node in <> +requires an explicit cluster bootstrapping step. + +include::discovery/bootstrap-cluster.asciidoc[] + +[float] +==== Adding and removing nodes + +It is recommended to have a small and fixed number of master-eligible nodes in +a cluster, and to scale the cluster up and down by adding and removing +non-master-eligible nodes only. However there are situations in which it may be +desirable to add or remove some master-eligible nodes to or from a cluster. + +Elasticsearch supports dynamically adding and removing master-eligible nodes, +but under certain conditions, special care must be taken. + +[float] +==== Cluster state publishing + +The master node is the only node in a cluster that can make changes to the +cluster state. The master node processes one cluster state update at a time, +applies the required changes and publishes the updated cluster state to all the +other nodes in the cluster. Each node receives the publish message, acknowledges +it, but does *not* yet apply it. If the master does not receive acknowledgement +from enough nodes within a certain time +(controlled by the `cluster.publish.timeout` setting and defaults to 30 +seconds) the cluster state change is rejected. + +Once enough nodes have responded, the cluster state is committed and a message +will be sent to all the nodes. The nodes then proceed to apply the new cluster +state to their internal state. The master node waits for all nodes to respond, +up to a timeout, before going ahead processing the next updates in the queue. +The `cluster.publish.timeout` is set by default to 30 seconds and is +measured from the moment the publishing started. + +TODO add lag detection Note, Elasticsearch is a peer to peer based system, nodes communicate with one another directly if operations are delegated / broadcast. All @@ -14,17 +74,37 @@ the other nodes in the cluster (the manner depends on the actual discovery implementation). [float] -=== Settings +[[no-master-block]] +==== No master block -The `cluster.name` allows to create separated clusters from one another. -The default value for the cluster name is `elasticsearch`, though it is -recommended to change this to reflect the logical group name of the -cluster running. +For the cluster to be fully operational, it must have an active master. +The `discovery.zen.no_master_block` settings controls what operations should be +rejected when there is no active master. -include::discovery/azure.asciidoc[] +The `discovery.zen.no_master_block` setting has two valid options: -include::discovery/ec2.asciidoc[] +[horizontal] +`all`:: All operations on the node--i.e. both read & writes--will be rejected. +This also applies for api cluster state read or write operations, like the get +index settings, put mapping and cluster state api. +`write`:: (default) Write operations will be rejected. Read operations will +succeed, based on the last known cluster configuration. This may result in +partial reads of stale data as this node may be isolated from the rest of the +cluster. -include::discovery/gce.asciidoc[] +The `discovery.zen.no_master_block` setting doesn't apply to nodes-based apis +(for example cluster stats, node info and node stats apis). Requests to these +apis will not be blocked and can run on any available node. + +[float] +==== Master election and fault detection + +The master election and fault detection sub modules cover advanced settings +to influence the election and fault detection processes. + +include::discovery/master-election.asciidoc[] + +[float] +=== Quorum-based decision making -include::discovery/zen.asciidoc[] +include::discovery/quorums.asciidoc[] diff --git a/docs/reference/modules/discovery/auto-reconfiguration.asciidoc b/docs/reference/modules/discovery/auto-reconfiguration.asciidoc new file mode 100644 index 0000000000000..b73681650d5f4 --- /dev/null +++ b/docs/reference/modules/discovery/auto-reconfiguration.asciidoc @@ -0,0 +1,112 @@ +[float] +==== Adding and removing nodes + +As nodes are added or removed Elasticsearch maintains an optimal level of fault +tolerance by updating the cluster's _voting configuration_, which is the set of +master-eligible nodes whose responses are counted when making decisions such as +electing a new master or committing a new cluster state. + +It is recommended to have a small and fixed number of master-eligible nodes in +a cluster, and to scale the cluster up and down by adding and removing +non-master-eligible nodes only. However there are situations in which it may be +desirable to add or remove some master-eligible nodes to or from a cluster. + +If you wish to add some master-eligible nodes to your cluster, simply configure +the new nodes to find the existing cluster and start them up. Elasticsearch +will add the new nodes to the voting configuration if it is appropriate to do +so. + +When removing master-eligible nodes, it is important not to remove too many all +at the same time. For instance, if there are currently seven master-eligible +nodes and you wish to reduce this to three, it is not possible simply to stop +four of the nodes at once: to do so would leave only three nodes remaining, +which is less than half of the voting configuration, which means the cluster +cannot take any further actions. + +As long as there are at least three master-eligible nodes in the cluster, as a +general rule it is best to remove nodes one-at-a-time, allowing enough time for +the auto-reconfiguration to take effect after each removal. + +If there are only two master-eligible nodes then neither node can be safely +removed since both are required to reliably make progress, so you must first +inform Elasticsearch that one of the nodes should not be part of the voting +configuration, and that the voting power should instead be given to other +nodes, allowing the excluded node to be taken offline without preventing the +other node from making progress. A node which is added to a voting +configuration exclusion list still works normally, but Elasticsearch will try +and remove it from the voting configuration so its vote is no longer required, +and will never automatically move such a node back into the voting +configuration after it has been removed. Once a node has been successfully +reconfigured out of the voting configuration, it is safe to shut it down +without affecting the cluster's availability. A node can be added to the voting +configuration exclusion list using the following API: + +[source,js] +-------------------------------------------------- +# Add node to voting configuration exclusions list and wait for the system to +# auto-reconfigure the node out of the voting configuration up to the default +# timeout of 30 seconds +POST /_cluster/voting_config_exclusions/node_name +# Add node to voting configuration exclusions list and wait for +# auto-reconfiguration up to one minute +POST /_cluster/voting_config_exclusions/node_name?timeout=1m +-------------------------------------------------- +// CONSOLE + +The node that should be added to the exclusions list is specified using +<> in place of `node_name` here. If a call to the +voting configuration exclusions API fails then the call can safely be retried. +A successful response guarantees that the node has been removed from the voting +configuration and will not be reinstated. + +Although the voting configuration exclusions API is most useful for +down-scaling a two-node to a one-node cluster, it is also possible to use it to +remove multiple nodes from larger clusters all at the same time. Adding +multiple nodes to the exclusions list has the system try to auto-reconfigure +all of these nodes out of the voting configuration, allowing them to be safely +shut down while keeping the cluster available. In the example described above, +shrinking a seven-master-node cluster down to only have three master nodes, you +could add four nodes to the exclusions list, wait for confirmation, and then +shut them down simultaneously. + +Adding an exclusion for a node creates an entry for that node in the voting +configuration exclusions list, which has the system automatically try to +reconfigure the voting configuration to remove that node and prevents it from +returning to the voting configuration once it has removed. The current set of +exclusions is stored in the cluster state and can be inspected as follows: + +[source,js] +-------------------------------------------------- +GET /_cluster/state?filter_path=metadata.cluster_coordination.voting_config_exclusions +-------------------------------------------------- +// CONSOLE + +This list is limited in size by the following setting: + +`cluster.max_voting_config_exclusions`:: + + Sets a limits on the number of voting configuration exclusions at any one + time. Defaults to `10`. + +Since voting configuration exclusions are persistent and limited in number, +they must be cleaned up. Normally an exclusion is added when performing some +maintenance on the cluster, and the exclusions should be cleaned up when the +maintenance is complete. Clusters should have no voting configuration +exclusions in normal operation. + +If a node is excluded from the voting configuration because it is to be shut +down permanently then its exclusion can be removed once it has shut down and +been removed from the cluster. Exclusions can also be cleared if they were +created in error or were only required temporarily: + +[source,js] +-------------------------------------------------- +# Wait for all the nodes with voting configuration exclusions to be removed +# from the cluster and then remove all the exclusions, allowing any node to +# return to the voting configuration in the future. +DELETE /_cluster/voting_config_exclusions +# Immediately remove all the voting configuration exclusions, allowing any node +# to return to the voting configuration in the future. +DELETE /_cluster/voting_config_exclusions?wait_for_removal=false +-------------------------------------------------- +// CONSOLE diff --git a/docs/reference/modules/discovery/azure.asciidoc b/docs/reference/modules/discovery/azure.asciidoc deleted file mode 100644 index 1343819b02a85..0000000000000 --- a/docs/reference/modules/discovery/azure.asciidoc +++ /dev/null @@ -1,5 +0,0 @@ -[[modules-discovery-azure-classic]] -=== Azure Classic Discovery - -Azure classic discovery allows to use the Azure Classic APIs to perform automatic discovery (similar to multicast). -It is available as a plugin. See {plugins}/discovery-azure-classic.html[discovery-azure-classic] for more information. diff --git a/docs/reference/modules/discovery/bootstrap-cluster.asciidoc b/docs/reference/modules/discovery/bootstrap-cluster.asciidoc new file mode 100644 index 0000000000000..1217233417117 --- /dev/null +++ b/docs/reference/modules/discovery/bootstrap-cluster.asciidoc @@ -0,0 +1,67 @@ +[[modules-discovery-bootstrap-cluster]] +=== Bootstrapping a cluster + +Starting an Elasticsearch cluster for the very first time requires a +cluster bootstrapping step. + +The simplest way to bootstrap a cluster is by specifying the node names +or transport addresses of at least a non-empty subset of the master-eligible nodes +before start-up. The node setting `cluster.initial_master_nodes`, which +takes a list of node names or transport addresses, can be either specified +on the command line when starting up the nodes, or be added to the node +configuration file `elasticsearch.yml`. + +For a cluster with 3 master-eligible nodes (named master-a, master-b, and master-c) and +Note that if you have not explicitly configured a node name, this +name defaults to the host name, so using the host names will work as well: + +[source,yaml] +-------------------------------------------------- +cluster.initial_master_nodes: + - master-a + - master-b + - master-c +-------------------------------------------------- + +TODO provide another example with ip addresses (+ possibly port) + +While it is sufficient to set this on a single master-eligible node +in the cluster, and only mention a single maser-eligible node, using +multiple nodes for bootstrapping allows the bootstrap process to go +through even if not all nodes are avilable. In any case, when +specifying the list of initial master nodes, **it is vitally important** +to configure each node with exactly the same list of nodes, to prevent +two independent clusters from forming. Typically you will set this +on the nodes that are mentioned in the list of initial master nodes. + +WARNING: You must put exactly the same set of initial master nodes in each + configuration file in order to be sure that only a single cluster forms during + bootstrapping and therefore to avoid the risk of data loss. + + +It is also possible to set the initial set of master nodes on the +command-line used to start Elasticsearch: + +[source,bash] +-------------------------------------------------- +$ bin/elasticsearch -Ecluster.initial_master_nodes=master-a,master-b,master-c +-------------------------------------------------- + +Just as with the config file, this additional command-line parameter +can be removed once a cluster has successfully formed. + +[float] +==== Choosing a cluster name +The `cluster.name` allows to create separated clusters from one another. +The default value for the cluster name is `elasticsearch`, though it is +recommended to change this to reflect the logical group name of the +cluster running. + + +==== Auto-bootstrapping in development mode + +If the cluster is running with a completely default configuration then it will +automatically bootstrap based on the nodes that could be discovered within a +short time after startup. Since nodes may not always reliably discover each +other quickly enough this automatic bootstrapping is not always reliable and +cannot be used in production deployments. diff --git a/docs/reference/modules/discovery/ec2.asciidoc b/docs/reference/modules/discovery/ec2.asciidoc deleted file mode 100644 index ba15f6bffa4cd..0000000000000 --- a/docs/reference/modules/discovery/ec2.asciidoc +++ /dev/null @@ -1,4 +0,0 @@ -[[modules-discovery-ec2]] -=== EC2 Discovery - -EC2 discovery is available as a plugin. See {plugins}/discovery-ec2.html[discovery-ec2] for more information. diff --git a/docs/reference/modules/discovery/gce.asciidoc b/docs/reference/modules/discovery/gce.asciidoc deleted file mode 100644 index ea367d52ceb75..0000000000000 --- a/docs/reference/modules/discovery/gce.asciidoc +++ /dev/null @@ -1,6 +0,0 @@ -[[modules-discovery-gce]] -=== Google Compute Engine Discovery - -Google Compute Engine (GCE) discovery allows to use the GCE APIs to perform automatic discovery (similar to multicast). -It is available as a plugin. See {plugins}/discovery-gce.html[discovery-gce] for more information. - diff --git a/docs/reference/modules/discovery/hosts-providers.asciidoc b/docs/reference/modules/discovery/hosts-providers.asciidoc new file mode 100644 index 0000000000000..ecaf32e37992a --- /dev/null +++ b/docs/reference/modules/discovery/hosts-providers.asciidoc @@ -0,0 +1,150 @@ +[[modules-discovery-hosts-providers]] +=== Discovery + +The cluster formation module uses a list of _seed_ nodes in order to start +off the discovery process. At startup, or when disconnected from a master, +Elasticsearch tries to connect to each seed node in its list, and holds a +gossip-like conversation with them to find other nodes and to build a complete +picture of the master-eligible nodes in the cluster. By default the cluster formation +module offers two hosts providers to configure the list of seed nodes: +a _settings-based_ and a _file-based_ hosts provider, but can be extended to +support cloud environments and other forms of host providers via plugins. +Host providers are configured using the `discovery.zen.hosts_provider` setting, +which defaults to the _settings-based_ hosts provider. Multiple hosts providers +can be specified as a list. + +[float] +[[settings-based-hosts-provider]] +===== Settings-based hosts provider + +The settings-based hosts provider use a node setting to configure a static +list of hosts to use as seed nodes. These hosts can be specified as hostnames +or IP addresses; hosts specified as hostnames are resolved to IP addresses +during each round of pinging. Note that if you are in an environment where +DNS resolutions vary with time, you might need to adjust your +<>. + +The list of hosts is set using the `discovery.zen.ping.unicast.hosts` static +setting. This is either an array of hosts or a comma-delimited string. Each +value should be in the form of `host:port` or `host` (where `port` defaults to +the setting `transport.profiles.default.port` falling back to +`transport.tcp.port` if not set). Note that IPv6 hosts must be bracketed. The +default for this setting is `127.0.0.1, [::1]` + +Additionally, the `discovery.zen.ping.unicast.hosts.resolve_timeout` configures the +amount of time to wait for DNS lookups on each round of pinging. This is +specified as a <> and defaults to 5s. + +Unicast discovery uses the <> module to perform the +discovery. + +[float] +[[file-based-hosts-provider]] +===== File-based hosts provider + +The file-based hosts provider configures a list of hosts via an external file. +Elasticsearch reloads this file when it changes, so that the list of seed nodes +can change dynamically without needing to restart each node. For example, this +gives a convenient mechanism for an Elasticsearch instance that is run in a +Docker container to be dynamically supplied with a list of IP addresses to +connect to when those IP addresses may not be known at node startup. + +To enable file-based discovery, configure the `file` hosts provider as follows: + +[source,txt] +---------------------------------------------------------------- +discovery.zen.hosts_provider: file +---------------------------------------------------------------- + +Then create a file at `$ES_PATH_CONF/unicast_hosts.txt` in the format described +below. Any time a change is made to the `unicast_hosts.txt` file the new +changes will be picked up by Elasticsearch and the new hosts list will be used. + +Note that the file-based discovery plugin augments the unicast hosts list in +`elasticsearch.yml`: if there are valid unicast host entries in +`discovery.zen.ping.unicast.hosts` then they will be used in addition to those +supplied in `unicast_hosts.txt`. + +The `discovery.zen.ping.unicast.hosts.resolve_timeout` setting also applies to DNS +lookups for nodes specified by address via file-based discovery. This is +specified as a <> and defaults to 5s. + +The format of the file is to specify one node entry per line. Each node entry +consists of the host (host name or IP address) and an optional transport port +number. If the port number is specified, is must come immediately after the +host (on the same line) separated by a `:`. If the port number is not +specified, a default value of 9300 is used. + +For example, this is an example of `unicast_hosts.txt` for a cluster with four +nodes that participate in unicast discovery, some of which are not running on +the default port: + +[source,txt] +---------------------------------------------------------------- +10.10.10.5 +10.10.10.6:9305 +10.10.10.5:10005 +# an IPv6 address +[2001:0db8:85a3:0000:0000:8a2e:0370:7334]:9301 +---------------------------------------------------------------- + +Host names are allowed instead of IP addresses (similar to +`discovery.zen.ping.unicast.hosts`), and IPv6 addresses must be specified in +brackets with the port coming after the brackets. + +It is also possible to add comments to this file. All comments must appear on +their lines starting with `#` (i.e. comments cannot start in the middle of a +line). + +[float] +[[ec2-hosts-provider]] +===== EC2 hosts provider + +The {plugins}/discovery-ec2.html[EC2 discovery plugin] adds a hosts provider +that uses the https://github.com/aws/aws-sdk-java[AWS API] to find a list of seed nodes. + +[float] +[[azure-classic-hosts-provider]] +===== Azure Classic hosts provider + +The {plugins}/discovery-azure-classic.html[Azure Classic discovery plugin] adds a hosts provider +that uses the Azure Classic API find a list of seed nodes. + +[float] +[[gce-hosts-provider]] +===== Google Compute Engine hosts provider + +The {plugins}/discovery-gce.html[GCE discovery plugin] adds a hosts provider +that uses the GCE API find a list of seed nodes. + +[float] +==== Discovery settings + +Discovery operates in two phases: First, each node "probes" the addresses of +all known nodes by connecting to each address and attempting to identify the +node to which it is connected. Secondly it shares with the remote node a list +of all of its peers and the remote node responds with _its_ peers in turn. The +node then probes all the new nodes about which it just discovered, requests +their peers, and so on, until it has discovered an elected master node or +enough other masterless nodes that it can perform an election. If neither of +these occur quickly enough then it tries again. This process is controlled by +the following settings. + +`discovery.probe.connect_timeout`:: + + Sets how long to wait when attempting to connect to each address. Defaults + to `3s`. + +`discovery.probe.handshake_timeout`:: + + Sets how long to wait when attempting to identify the remote node via a + handshake. Defaults to `1s`. + +`discovery.find_peers_interval`:: + + Sets how long a node will wait before attempting another discovery round. + +`discovery.request_peers_timeout`:: + + Sets how long a node will wait after asking its peers again before + considering the request to have failed. diff --git a/docs/reference/modules/discovery/master-election.asciidoc b/docs/reference/modules/discovery/master-election.asciidoc new file mode 100644 index 0000000000000..4591720ea5963 --- /dev/null +++ b/docs/reference/modules/discovery/master-election.asciidoc @@ -0,0 +1,97 @@ +[float] +[[master-election]] +==== Master Election + +Elasticsearch uses an election process to agree on an elected master node, both +at startup and if the existing elected master fails. Any master-eligible node +can start an election, and normally the first election that takes place will +succeed. Elections only usually fail when two nodes both happen to start their +elections at about the same time, so elections are scheduled randomly on each +node to avoid this happening. Nodes will retry elections until a master is +elected, backing off on failure, so that eventually an election will succeed +(with arbitrarily high probability). The following settings control the +scheduling of elections. + +`cluster.election.initial_timeout`:: + + Sets the upper bound on how long a node will wait initially, or after a + leader failure, before attempting its first election. This defaults to + `100ms`. + +`cluster.election.back_off_time`:: + + Sets the amount to increase the upper bound on the wait before an election + on each election failure. Note that this is _linear_ backoff. This defaults + to `100ms` + +`cluster.election.max_timeout`:: + + Sets the maximum upper bound on how long a node will wait before attempting + an first election, so that an network partition that lasts for a long time + does not result in excessively sparse elections. This defaults to `10s` + +`cluster.election.duration`:: + + Sets how long each election is allowed to take before a node considers it + to have failed and schedules a retry. This defaults to `500ms`. + + +[float] +[[node-joining]] +==== Joining an elected master + +During master election, or when joining an existing formed cluster, a node will send +a join request to the master in order to be officially added to the cluster. This join +process can be configured with the following settings. + +`cluster.join.timeout`:: + + Sets how long a node will wait after sending a request to join a cluster + before it considers the request to have failed and retries. Defaults to + `60s`. + +[float] +[[fault-detection]] +==== Fault Detection + +An elected master periodically checks each of its followers in order to ensure +that they are still connected and healthy, and in turn each follower +periodically checks the health of the elected master. Elasticsearch allows for +these checks occasionally to fail or timeout without taking any action, and +will only consider a node to be truly faulty after a number of consecutive +checks have failed. The following settings control the behaviour of fault +detection. + +`cluster.fault_detection.follower_check.interval`:: + + Sets how long the elected master waits between checks of its followers. + Defaults to `1s`. + +`cluster.fault_detection.follower_check.timeout`:: + + Sets how long the elected master waits for a response to a follower check + before considering it to have failed. Defaults to `30s`. + +`cluster.fault_detection.follower_check.retry_count`:: + + Sets how many consecutive follower check failures must occur before the + elected master considers a follower node to be faulty and removes it from + the cluster. Defaults to `3`. + +`cluster.fault_detection.leader_check.interval`:: + + Sets how long each follower node waits between checks of its leader. + Defaults to `1s`. + +`cluster.fault_detection.leader_check.timeout`:: + + Sets how long each follower node waits for a response to a leader check + before considering it to have failed. Defaults to `30s`. + +`cluster.fault_detection.leader_check.retry_count`:: + + Sets how many consecutive leader check failures must occur before a + follower node considers the elected master to be faulty and attempts to + find or elect a new master. Defaults to `3`. + +TODO add lag detection \ No newline at end of file diff --git a/docs/reference/modules/discovery/quorums.asciidoc b/docs/reference/modules/discovery/quorums.asciidoc new file mode 100644 index 0000000000000..a8b30d6e79a96 --- /dev/null +++ b/docs/reference/modules/discovery/quorums.asciidoc @@ -0,0 +1,187 @@ +[[modules-discovery-quorums]] +=== Quorum-based decision making + +Electing a master node and changing the cluster state are the two fundamental +tasks that master-eligible nodes must work together to perform. It is important +that these activities work robustly even if some nodes have failed, and +Elasticsearch achieves this robustness by only considering each action to have +succeeded on receipt of responses from a _quorum_, a subset of the +master-eligible nodes in the cluster. The advantage of requiring only a subset +of the nodes to respond is that it allows for some of the nodes to fail without +preventing the cluster from making progress, and the quorums are carefully +chosen so as not to allow the cluster to "split brain", i.e. to be partitioned +into two pieces each of which may make decisions that are inconsistent with +those of the other piece. + +Elasticsearch allows you to add and remove master-eligible nodes to a running +cluster. In many cases you can do this simply by starting or stopping the nodes +as required, as described in more detail below. + +As nodes are added or removed Elasticsearch maintains an optimal level of fault +tolerance by updating the cluster's _voting configuration_, which is the set of +master-eligible nodes whose responses are counted when making decisions such as +electing a new master or committing a new cluster state. A decision is only +made once more than half of the nodes in the voting configuration have +responded. Usually the voting configuration is the same as the set of all the +master-eligible nodes that are currently in the cluster, but there are some +situations in which they may be different. + +To be sure that the cluster remains available you **must not stop half or more +of the nodes in the voting configuration at the same time**. As long as more +than half of the voting nodes are available the cluster can still work +normally. This means that if there are three or four master-eligible nodes then +the cluster can tolerate one of them being unavailable; if there are two or +fewer master-eligible nodes then they must all remain available. + +After a node has joined or left the cluster the elected master must issue a +cluster-state update that adjusts the voting configuration to match, and this +can take a short time to complete. It is important to wait for this adjustment +to complete before removing more nodes from the cluster. + +[float] +=== Getting the initial quorum + +When a brand-new cluster starts up for the first time, one of the tasks it must +perform is to elect its first master node, for which it needs to know the set +of master-eligible nodes whose votes should count in this first election. This +initial voting configuration is known as the _bootstrap configuration_. + +It is important that the bootstrap configuration identifies exactly which nodes +should vote in the first election, and it is not sufficient to configure each +node with an expectation of how many nodes there should be in the cluster. It +is also important to note that the bootstrap configuration must come from +outside the cluster: there is no safe way for the cluster to determine the +bootstrap configuration correctly on its own. + +If the bootstrap configuration is not set correctly then there is a risk when +starting up a brand-new cluster is that you accidentally form two separate +clusters instead of one. This could lead to data loss: you might start using +both clusters before noticing that anything had gone wrong, and it will then be +impossible to merge them together later. + +NOTE: To illustrate the problem with configuring each node to expect a certain +cluster size, imagine starting up a three-node cluster in which each node knows +that it is going to be part of a three-node cluster. A majority of three nodes +is two, so normally the first two nodes to discover each other will form a +cluster and the third node will join them a short time later. However, imagine +that four nodes were erroneously started instead of three: in this case there +are enough nodes to form two separate clusters. Of course if each node is +started manually then it's unlikely that too many nodes are started, but it's +certainly possible to get into this situation if using a more automated +orchestrator, particularly if the orchestrator is not resilient to failures +such as network partitions. + +The cluster bootstrapping process is only required the very first time a whole +cluster starts up: new nodes joining an established cluster can safely obtain +all the information they need from the elected master, and nodes that have +previously been part of a cluster will have stored to disk all the information +required when restarting. + +[float] +=== Cluster maintenance, rolling restarts and migrations + +Many cluster maintenance tasks involve temporarily shutting down one or more +nodes and then starting them back up again. By default Elasticsearch can remain +available if one of its master-eligible nodes is taken offline, such as during +a <>. Furthermore, if multiple nodes are +stopped and then started again then it will automatically recover, such as +during a <>. There is no need to take any +further action with the APIs described here in these cases, because the set of +master nodes is not changing permanently. + +It is also possible to perform a migration of a cluster onto entirely new nodes +without taking the cluster offline, via a _rolling migration_. A rolling +migration is similar to a rolling restart, in that it is performed one node at +a time, and also requires no special handling for the master-eligible nodes as +long as there are at least two of them available at all times. + +TODO the above is only true if the maintenance happens slowly enough, otherwise +the configuration might not catch up. Need to add this to the rolling restart +docs. + +[float] +==== Auto-reconfiguration + +Nodes may join or leave the cluster, and Elasticsearch reacts by making +corresponding changes to the voting configuration in order to ensure that the +cluster is as resilient as possible. The default auto-reconfiguration behaviour +is expected to give the best results in most situation. The current voting +configuration is stored in the cluster state so you can inspect its current +contents as follows: + +[source,js] +-------------------------------------------------- +GET /_cluster/state?filter_path=metadata.cluster_coordination.last_committed_config +-------------------------------------------------- +// CONSOLE + +NOTE: The current voting configuration is not necessarily the same as the set +of all available master-eligible nodes in the cluster. Altering the voting +configuration itself involves taking a vote, so it takes some time to adjust +the configuration as nodes join or leave the cluster. Also, there are +situations where the most resilient configuration includes unavailable nodes, +or does not include some available nodes, and in these situations the voting +configuration will differ from the set of available master-eligible nodes in +the cluster. + +Larger voting configurations are usually more resilient, so Elasticsearch will +normally prefer to add master-eligible nodes to the voting configuration once +they have joined the cluster. Similarly, if a node in the voting configuration +leaves the cluster and there is another master-eligible node in the cluster +that is not in the voting configuration then it is preferable to swap these two +nodes over, leaving the size of the voting configuration unchanged but +increasing its resilience. + +It is not so straightforward to automatically remove nodes from the voting +configuration after they have left the cluster, and different strategies have +different benefits and drawbacks, so the right choice depends on how the +cluster will be used and is controlled by the following setting. + +`cluster.auto_shrink_voting_configuration`:: + + Defaults to `true`, meaning that the voting configuration will + automatically shrink, shedding departed nodes, as long as it still contains + at least 3 nodes. If set to `false`, the voting configuration never + automatically shrinks; departed nodes must be removed manually using the + vote withdrawal API described below. + +NOTE: If `cluster.auto_shrink_voting_configuration` is set to `true`, the +recommended and default setting, and there are at least three master-eligible +nodes in the cluster, then Elasticsearch remains capable of processing +cluster-state updates as long as all but one of its master-eligible nodes are +healthy. + +There are situations in which Elasticsearch might tolerate the loss of multiple +nodes, but this is not guaranteed under all sequences of failures. If this +setting is set to `false` then departed nodes must be removed from the voting +configuration manually, using the vote withdrawal API described below, to achieve +the desired level of resilience. + +Note that Elasticsearch will not suffer from a "split-brain" inconsistency +however it is configured. This setting only affects its availability in the +event of the failure of some of its nodes, and the administrative tasks that +must be performed as nodes join and leave the cluster. + +[float] +==== Even numbers of master-eligible nodes + +There should normally be an odd number of master-eligible nodes in a cluster. +If there is an even number then Elasticsearch will leave one of them out of the +voting configuration to ensure that it has an odd size. This does not decrease +the failure-tolerance of the cluster, and in fact improves it slightly: if the +cluster is partitioned into two even halves then one of the halves will contain +a majority of the voting configuration and will be able to keep operating, +whereas if all of the master-eligible nodes' votes were counted then neither +side could make any progress in this situation. + +For instance if there are four master-eligible nodes in the cluster and the +voting configuration contained all of them then any quorum-based decision would +require votes from at least three of them, which means that the cluster can +only tolerate the loss of a single master-eligible node. If this cluster were +split into two equal halves then neither half would contain three +master-eligible nodes so would not be able to make any progress. However if the +voting configuration contains only three of the four master-eligible nodes then +the cluster is still only fully tolerant to the loss of one node, but +quorum-based decisions require votes from two of the three voting nodes. In the +event of an even split, one half will contain two of the three voting nodes so +will remain available. diff --git a/docs/reference/modules/discovery/zen.asciidoc b/docs/reference/modules/discovery/zen.asciidoc deleted file mode 100644 index e9be7aa52e890..0000000000000 --- a/docs/reference/modules/discovery/zen.asciidoc +++ /dev/null @@ -1,226 +0,0 @@ -[[modules-discovery-zen]] -=== Zen Discovery - -Zen discovery is the built-in, default, discovery module for Elasticsearch. It -provides unicast and file-based discovery, and can be extended to support cloud -environments and other forms of discovery via plugins. - -Zen discovery is integrated with other modules, for example, all communication -between nodes is done using the <> module. - -It is separated into several sub modules, which are explained below: - -[float] -[[ping]] -==== Ping - -This is the process where a node uses the discovery mechanisms to find other -nodes. - -[float] -[[discovery-seed-nodes]] -==== Seed nodes - -Zen discovery uses a list of _seed_ nodes in order to start off the discovery -process. At startup, or when electing a new master, Elasticsearch tries to -connect to each seed node in its list, and holds a gossip-like conversation with -them to find other nodes and to build a complete picture of the cluster. By -default there are two methods for configuring the list of seed nodes: _unicast_ -and _file-based_. It is recommended that the list of seed nodes comprises the -list of master-eligible nodes in the cluster. - -[float] -[[unicast]] -===== Unicast - -Unicast discovery configures a static list of hosts for use as seed nodes. -These hosts can be specified as hostnames or IP addresses; hosts specified as -hostnames are resolved to IP addresses during each round of pinging. Note that -if you are in an environment where DNS resolutions vary with time, you might -need to adjust your <>. - -The list of hosts is set using the `discovery.zen.ping.unicast.hosts` static -setting. This is either an array of hosts or a comma-delimited string. Each -value should be in the form of `host:port` or `host` (where `port` defaults to -the setting `transport.profiles.default.port` falling back to -`transport.tcp.port` if not set). Note that IPv6 hosts must be bracketed. The -default for this setting is `127.0.0.1, [::1]` - -Additionally, the `discovery.zen.ping.unicast.resolve_timeout` configures the -amount of time to wait for DNS lookups on each round of pinging. This is -specified as a <> and defaults to 5s. - -Unicast discovery uses the <> module to perform the -discovery. - -[float] -[[file-based-hosts-provider]] -===== File-based - -In addition to hosts provided by the static `discovery.zen.ping.unicast.hosts` -setting, it is possible to provide a list of hosts via an external file. -Elasticsearch reloads this file when it changes, so that the list of seed nodes -can change dynamically without needing to restart each node. For example, this -gives a convenient mechanism for an Elasticsearch instance that is run in a -Docker container to be dynamically supplied with a list of IP addresses to -connect to for Zen discovery when those IP addresses may not be known at node -startup. - -To enable file-based discovery, configure the `file` hosts provider as follows: - -[source,txt] ----------------------------------------------------------------- -discovery.zen.hosts_provider: file ----------------------------------------------------------------- - -Then create a file at `$ES_PATH_CONF/unicast_hosts.txt` in the format described -below. Any time a change is made to the `unicast_hosts.txt` file the new -changes will be picked up by Elasticsearch and the new hosts list will be used. - -Note that the file-based discovery plugin augments the unicast hosts list in -`elasticsearch.yml`: if there are valid unicast host entries in -`discovery.zen.ping.unicast.hosts` then they will be used in addition to those -supplied in `unicast_hosts.txt`. - -The `discovery.zen.ping.unicast.resolve_timeout` setting also applies to DNS -lookups for nodes specified by address via file-based discovery. This is -specified as a <> and defaults to 5s. - -The format of the file is to specify one node entry per line. Each node entry -consists of the host (host name or IP address) and an optional transport port -number. If the port number is specified, is must come immediately after the -host (on the same line) separated by a `:`. If the port number is not -specified, a default value of 9300 is used. - -For example, this is an example of `unicast_hosts.txt` for a cluster with four -nodes that participate in unicast discovery, some of which are not running on -the default port: - -[source,txt] ----------------------------------------------------------------- -10.10.10.5 -10.10.10.6:9305 -10.10.10.5:10005 -# an IPv6 address -[2001:0db8:85a3:0000:0000:8a2e:0370:7334]:9301 ----------------------------------------------------------------- - -Host names are allowed instead of IP addresses (similar to -`discovery.zen.ping.unicast.hosts`), and IPv6 addresses must be specified in -brackets with the port coming after the brackets. - -It is also possible to add comments to this file. All comments must appear on -their lines starting with `#` (i.e. comments cannot start in the middle of a -line). - -[float] -[[master-election]] -==== Master Election - -As part of the ping process a master of the cluster is either elected or joined -to. This is done automatically. The `discovery.zen.ping_timeout` (which defaults -to `3s`) determines how long the node will wait before deciding on starting an -election or joining an existing cluster. Three pings will be sent over this -timeout interval. In case where no decision can be reached after the timeout, -the pinging process restarts. In slow or congested networks, three seconds -might not be enough for a node to become aware of the other nodes in its -environment before making an election decision. Increasing the timeout should -be done with care in that case, as it will slow down the election process. Once -a node decides to join an existing formed cluster, it will send a join request -to the master (`discovery.zen.join_timeout`) with a timeout defaulting at 20 -times the ping timeout. - -When the master node stops or has encountered a problem, the cluster nodes start -pinging again and will elect a new master. This pinging round also serves as a -protection against (partial) network failures where a node may unjustly think -that the master has failed. In this case the node will simply hear from other -nodes about the currently active master. - -If `discovery.zen.master_election.ignore_non_master_pings` is `true`, pings from -nodes that are not master eligible (nodes where `node.master` is `false`) are -ignored during master election; the default value is `false`. - -Nodes can be excluded from becoming a master by setting `node.master` to -`false`. - -The `discovery.zen.minimum_master_nodes` sets the minimum number of master -eligible nodes that need to join a newly elected master in order for an election -to complete and for the elected node to accept its mastership. The same setting -controls the minimum number of active master eligible nodes that should be a -part of any active cluster. If this requirement is not met the active master -node will step down and a new master election will begin. - -This setting must be set to a <> of your master -eligible nodes. It is recommended to avoid having only two master eligible -nodes, since a quorum of two is two. Therefore, a loss of either master eligible -node will result in an inoperable cluster. - -[float] -[[fault-detection]] -==== Fault Detection - -There are two fault detection processes running. The first is by the master, to -ping all the other nodes in the cluster and verify that they are alive. And on -the other end, each node pings to master to verify if its still alive or an -election process needs to be initiated. - -The following settings control the fault detection process using the -`discovery.zen.fd` prefix: - -[cols="<,<",options="header",] -|======================================================================= -|Setting |Description -|`ping_interval` |How often a node gets pinged. Defaults to `1s`. - -|`ping_timeout` |How long to wait for a ping response, defaults to -`30s`. - -|`ping_retries` |How many ping failures / timeouts cause a node to be -considered failed. Defaults to `3`. -|======================================================================= - -[float] -==== Cluster state updates - -The master node is the only node in a cluster that can make changes to the -cluster state. The master node processes one cluster state update at a time, -applies the required changes and publishes the updated cluster state to all the -other nodes in the cluster. Each node receives the publish message, acknowledges -it, but does *not* yet apply it. If the master does not receive acknowledgement -from at least `discovery.zen.minimum_master_nodes` nodes within a certain time -(controlled by the `discovery.zen.commit_timeout` setting and defaults to 30 -seconds) the cluster state change is rejected. - -Once enough nodes have responded, the cluster state is committed and a message -will be sent to all the nodes. The nodes then proceed to apply the new cluster -state to their internal state. The master node waits for all nodes to respond, -up to a timeout, before going ahead processing the next updates in the queue. -The `discovery.zen.publish_timeout` is set by default to 30 seconds and is -measured from the moment the publishing started. Both timeout settings can be -changed dynamically through the <> - -[float] -[[no-master-block]] -==== No master block - -For the cluster to be fully operational, it must have an active master and the -number of running master eligible nodes must satisfy the -`discovery.zen.minimum_master_nodes` setting if set. The -`discovery.zen.no_master_block` settings controls what operations should be -rejected when there is no active master. - -The `discovery.zen.no_master_block` setting has two valid options: - -[horizontal] -`all`:: All operations on the node--i.e. both read & writes--will be rejected. -This also applies for api cluster state read or write operations, like the get -index settings, put mapping and cluster state api. -`write`:: (default) Write operations will be rejected. Read operations will -succeed, based on the last known cluster configuration. This may result in -partial reads of stale data as this node may be isolated from the rest of the -cluster. - -The `discovery.zen.no_master_block` setting doesn't apply to nodes-based apis -(for example cluster stats, node info and node stats apis). Requests to these -apis will not be blocked and can run on any available node. diff --git a/docs/reference/modules/node.asciidoc b/docs/reference/modules/node.asciidoc index 9287e171129ff..a94f76c55de1f 100644 --- a/docs/reference/modules/node.asciidoc +++ b/docs/reference/modules/node.asciidoc @@ -19,7 +19,7 @@ purpose: <>:: A node that has `node.master` set to `true` (default), which makes it eligible -to be <>, which controls +to be <>, which controls the cluster. <>:: @@ -69,7 +69,7 @@ and deciding which shards to allocate to which nodes. It is important for cluster health to have a stable master node. Any master-eligible node (all nodes by default) may be elected to become the -master node by the <>. +master node by the <>. IMPORTANT: Master nodes must have access to the `data/` directory (just like `data` nodes) as this is where the cluster state is persisted between node restarts. @@ -105,74 +105,6 @@ NOTE: These settings apply only when {xpack} is not installed. To create a dedicated master-eligible node when {xpack} is installed, see <>. endif::include-xpack[] - -[float] -[[split-brain]] -==== Avoiding split brain with `minimum_master_nodes` - -To prevent data loss, it is vital to configure the -`discovery.zen.minimum_master_nodes` setting (which defaults to `1`) so that -each master-eligible node knows the _minimum number of master-eligible nodes_ -that must be visible in order to form a cluster. - -To explain, imagine that you have a cluster consisting of two master-eligible -nodes. A network failure breaks communication between these two nodes. Each -node sees one master-eligible node... itself. With `minimum_master_nodes` set -to the default of `1`, this is sufficient to form a cluster. Each node elects -itself as the new master (thinking that the other master-eligible node has -died) and the result is two clusters, or a _split brain_. These two nodes -will never rejoin until one node is restarted. Any data that has been written -to the restarted node will be lost. - -Now imagine that you have a cluster with three master-eligible nodes, and -`minimum_master_nodes` set to `2`. If a network split separates one node from -the other two nodes, the side with one node cannot see enough master-eligible -nodes and will realise that it cannot elect itself as master. The side with -two nodes will elect a new master (if needed) and continue functioning -correctly. As soon as the network split is resolved, the single node will -rejoin the cluster and start serving requests again. - -This setting should be set to a _quorum_ of master-eligible nodes: - - (master_eligible_nodes / 2) + 1 - -In other words, if there are three master-eligible nodes, then minimum master -nodes should be set to `(3 / 2) + 1` or `2`: - -[source,yaml] ----------------------------- -discovery.zen.minimum_master_nodes: 2 <1> ----------------------------- -<1> Defaults to `1`. - -To be able to remain available when one of the master-eligible nodes fails, -clusters should have at least three master-eligible nodes, with -`minimum_master_nodes` set accordingly. A <>, -performed without any downtime, also requires at least three master-eligible -nodes to avoid the possibility of data loss if a network split occurs while the -upgrade is in progress. - -This setting can also be changed dynamically on a live cluster with the -<>: - -[source,js] ----------------------------- -PUT _cluster/settings -{ - "transient": { - "discovery.zen.minimum_master_nodes": 2 - } -} ----------------------------- -// CONSOLE -// TEST[skip:Test use Zen2 now so we can't test Zen1 behaviour here] - -TIP: An advantage of splitting the master and data roles between dedicated -nodes is that you can have just three master-eligible nodes and set -`minimum_master_nodes` to `2`. You never have to change this setting, no -matter how many dedicated data nodes you add to the cluster. - - [float] [[data-node]] === Data Node diff --git a/docs/reference/setup/bootstrap-checks.asciidoc b/docs/reference/setup/bootstrap-checks.asciidoc index 9cf3620636a41..34b39546324d1 100644 --- a/docs/reference/setup/bootstrap-checks.asciidoc +++ b/docs/reference/setup/bootstrap-checks.asciidoc @@ -21,6 +21,7 @@ Elasticsearch from running with incompatible settings. These checks are documented individually. [float] +[[dev-vs-prod-mode]] === Development vs. production mode By default, Elasticsearch binds to loopback addresses for <> diff --git a/docs/reference/setup/important-settings/discovery-settings.asciidoc b/docs/reference/setup/important-settings/discovery-settings.asciidoc index e0c67ffb22da8..94d95f866ba4e 100644 --- a/docs/reference/setup/important-settings/discovery-settings.asciidoc +++ b/docs/reference/setup/important-settings/discovery-settings.asciidoc @@ -31,28 +31,14 @@ discovery.zen.ping.unicast.hosts: addresses. [float] -[[minimum_master_nodes]] -==== `discovery.zen.minimum_master_nodes` - -To prevent data loss, it is vital to configure the -`discovery.zen.minimum_master_nodes` setting so that each master-eligible node -knows the _minimum number of master-eligible nodes_ that must be visible in -order to form a cluster. - -Without this setting, a cluster that suffers a network failure is at risk of -having the cluster split into two independent clusters -- a split brain -- which -will lead to data loss. A more detailed explanation is provided in -<>. - -To avoid a split brain, this setting should be set to a _quorum_ of -master-eligible nodes: - - (master_eligible_nodes / 2) + 1 - -In other words, if there are three master-eligible nodes, then minimum master -nodes should be set to `(3 / 2) + 1` or `2`: - -[source,yaml] --------------------------------------------------- -discovery.zen.minimum_master_nodes: 2 --------------------------------------------------- +[[initial_master_nodes]] +==== `cluster.initial_master_nodes` + +Starting an Elasticsearch cluster for the very first time requires a +<> step. +In <>, +with no discovery settings configured, this step is automatically +performed by the nodes themselves. As this auto-bootstrapping is +<>, running a node in +<> requires an explicit cluster +bootstrapping step. From f3a8b93b2d74e4db52e26a853b94f3eacb93381c Mon Sep 17 00:00:00 2001 From: Yannick Welsch Date: Mon, 10 Dec 2018 10:57:49 +0100 Subject: [PATCH 034/106] put all in one doc --- docs/reference/modules/discovery.asciidoc | 646 +++++++++++++++++- .../discovery/auto-reconfiguration.asciidoc | 112 --- .../discovery/bootstrap-cluster.asciidoc | 67 -- .../discovery/hosts-providers.asciidoc | 150 ---- .../discovery/master-election.asciidoc | 97 --- .../modules/discovery/quorums.asciidoc | 187 ----- 6 files changed, 623 insertions(+), 636 deletions(-) delete mode 100644 docs/reference/modules/discovery/auto-reconfiguration.asciidoc delete mode 100644 docs/reference/modules/discovery/bootstrap-cluster.asciidoc delete mode 100644 docs/reference/modules/discovery/hosts-providers.asciidoc delete mode 100644 docs/reference/modules/discovery/master-election.asciidoc delete mode 100644 docs/reference/modules/discovery/quorums.asciidoc diff --git a/docs/reference/modules/discovery.asciidoc b/docs/reference/modules/discovery.asciidoc index 191c9b0295bea..e6e10c96659d1 100644 --- a/docs/reference/modules/discovery.asciidoc +++ b/docs/reference/modules/discovery.asciidoc @@ -2,49 +2,378 @@ == Discovery and cluster formation The discovery and cluster formation module is responsible for discovering nodes, -electing a master, and publishing the cluster state. +electing a master, and publishing the cluster state. It is integrated with other +modules, for example, all communication between nodes is done using the +<> module. -This module is integrated with other modules, for example, -all communication between nodes is done using the <> module. +It is separated into several sections, which are explained below: + +* <> is the process where nodes find + each other when starting up, or when losing a master. +* <> is a configuration step that's required + when an Elasticsearch starts up for the very first time. + In <>, with no discovery settings configured, + this step is automatically performed by the nodes themselves. As this + auto-bootstrapping is <>, running + a node in <> requires an explicit cluster + bootstrapping step. +* It is recommended to have a small and fixed number of master-eligible nodes in + a cluster, and to scale the cluster up and down by adding and removing + non-master-eligible nodes only. However there are situations in which it may be + desirable to add or remove some master-eligible nodes to or from a cluster. + A section on <> + describes how Elasticsearch supports dynamically adding and removing + master-eligible nodes where, under certain conditions, special care must be taken. +* <> covers how a master + publishes cluster states to the other nodes in the cluster. +* <> describes what operations should be rejected when there is + no active master. +* <> and <> sections cover advanced settings + to influence the election and fault detection processes. +* <> explains the design + behind the master election and auto-reconfiguration logic. -It is separated into several sub modules, which are explained below: [float] +[[modules-discovery-hosts-providers]] === Discovery -The discovery sub-module uses a list of _seed_ nodes in order to start +The cluster formation module uses a list of _seed_ nodes in order to start off the discovery process. At startup, or when disconnected from a master, Elasticsearch tries to connect to each seed node in its list, and holds a gossip-like conversation with them to find other nodes and to build a complete -picture of the master-eligible nodes in the cluster. +picture of the master-eligible nodes in the cluster. By default the cluster formation +module offers two hosts providers to configure the list of seed nodes: +a _settings-based_ and a _file-based_ hosts provider, but can be extended to +support cloud environments and other forms of host providers via plugins. +Host providers are configured using the `discovery.zen.hosts_provider` setting, +which defaults to the _settings-based_ hosts provider. Multiple hosts providers +can be specified as a list. + +[float] +[[settings-based-hosts-provider]] +===== Settings-based hosts provider + +The settings-based hosts provider use a node setting to configure a static +list of hosts to use as seed nodes. These hosts can be specified as hostnames +or IP addresses; hosts specified as hostnames are resolved to IP addresses +during each round of pinging. Note that if you are in an environment where +DNS resolutions vary with time, you might need to adjust your +<>. + +The list of hosts is set using the `discovery.zen.ping.unicast.hosts` static +setting. This is either an array of hosts or a comma-delimited string. Each +value should be in the form of `host:port` or `host` (where `port` defaults to +the setting `transport.profiles.default.port` falling back to +`transport.tcp.port` if not set). Note that IPv6 hosts must be bracketed. The +default for this setting is `127.0.0.1, [::1]` -include::discovery/hosts-providers.asciidoc[] +Additionally, the `discovery.zen.ping.unicast.hosts.resolve_timeout` configures the +amount of time to wait for DNS lookups on each round of pinging. This is +specified as a <> and defaults to 5s. + +Unicast discovery uses the <> module to perform the +discovery. [float] +[[file-based-hosts-provider]] +===== File-based hosts provider + +The file-based hosts provider configures a list of hosts via an external file. +Elasticsearch reloads this file when it changes, so that the list of seed nodes +can change dynamically without needing to restart each node. For example, this +gives a convenient mechanism for an Elasticsearch instance that is run in a +Docker container to be dynamically supplied with a list of IP addresses to +connect to when those IP addresses may not be known at node startup. + +To enable file-based discovery, configure the `file` hosts provider as follows: + +[source,txt] +---------------------------------------------------------------- +discovery.zen.hosts_provider: file +---------------------------------------------------------------- + +Then create a file at `$ES_PATH_CONF/unicast_hosts.txt` in the format described +below. Any time a change is made to the `unicast_hosts.txt` file the new +changes will be picked up by Elasticsearch and the new hosts list will be used. + +Note that the file-based discovery plugin augments the unicast hosts list in +`elasticsearch.yml`: if there are valid unicast host entries in +`discovery.zen.ping.unicast.hosts` then they will be used in addition to those +supplied in `unicast_hosts.txt`. + +The `discovery.zen.ping.unicast.hosts.resolve_timeout` setting also applies to DNS +lookups for nodes specified by address via file-based discovery. This is +specified as a <> and defaults to 5s. + +The format of the file is to specify one node entry per line. Each node entry +consists of the host (host name or IP address) and an optional transport port +number. If the port number is specified, is must come immediately after the +host (on the same line) separated by a `:`. If the port number is not +specified, a default value of 9300 is used. + +For example, this is an example of `unicast_hosts.txt` for a cluster with four +nodes that participate in unicast discovery, some of which are not running on +the default port: + +[source,txt] +---------------------------------------------------------------- +10.10.10.5 +10.10.10.6:9305 +10.10.10.5:10005 +# an IPv6 address +[2001:0db8:85a3:0000:0000:8a2e:0370:7334]:9301 +---------------------------------------------------------------- + +Host names are allowed instead of IP addresses (similar to +`discovery.zen.ping.unicast.hosts`), and IPv6 addresses must be specified in +brackets with the port coming after the brackets. + +It is also possible to add comments to this file. All comments must appear on +their lines starting with `#` (i.e. comments cannot start in the middle of a +line). + +[float] +[[ec2-hosts-provider]] +===== EC2 hosts provider + +The {plugins}/discovery-ec2.html[EC2 discovery plugin] adds a hosts provider +that uses the https://github.com/aws/aws-sdk-java[AWS API] to find a list of seed nodes. + +[float] +[[azure-classic-hosts-provider]] +===== Azure Classic hosts provider + +The {plugins}/discovery-azure-classic.html[Azure Classic discovery plugin] adds a hosts provider +that uses the Azure Classic API find a list of seed nodes. + +[float] +[[gce-hosts-provider]] +===== Google Compute Engine hosts provider + +The {plugins}/discovery-gce.html[GCE discovery plugin] adds a hosts provider +that uses the GCE API find a list of seed nodes. + +[float] +==== Discovery settings + +Discovery operates in two phases: First, each node "probes" the addresses of +all known nodes by connecting to each address and attempting to identify the +node to which it is connected. Secondly it shares with the remote node a list +of all of its peers and the remote node responds with _its_ peers in turn. The +node then probes all the new nodes about which it just discovered, requests +their peers, and so on, until it has discovered an elected master node or +enough other masterless nodes that it can perform an election. If neither of +these occur quickly enough then it tries again. This process is controlled by +the following settings. + +`discovery.probe.connect_timeout`:: + + Sets how long to wait when attempting to connect to each address. Defaults + to `3s`. + +`discovery.probe.handshake_timeout`:: + + Sets how long to wait when attempting to identify the remote node via a + handshake. Defaults to `1s`. + +`discovery.find_peers_interval`:: + + Sets how long a node will wait before attempting another discovery round. + +`discovery.request_peers_timeout`:: + + Sets how long a node will wait after asking its peers again before + considering the request to have failed. + + +[float] +[[modules-discovery-bootstrap-cluster]] === Bootstrapping a cluster Starting an Elasticsearch cluster for the very first time requires a -cluster bootstrapping step. In <>, -with no discovery settings configured, this step is automatically -performed by the nodes themselves. As this auto-bootstrapping is -<>, running a node in <> -requires an explicit cluster bootstrapping step. +cluster bootstrapping step. + +The simplest way to bootstrap a cluster is by specifying the node names +or transport addresses of at least a non-empty subset of the master-eligible nodes +before start-up. The node setting `cluster.initial_master_nodes`, which +takes a list of node names or transport addresses, can be either specified +on the command line when starting up the nodes, or be added to the node +configuration file `elasticsearch.yml`. + +For a cluster with 3 master-eligible nodes (named master-a, master-b, and master-c) +the configuration will look as follows: + +[source,yaml] +-------------------------------------------------- +cluster.initial_master_nodes: + - master-a + - master-b + - master-c +-------------------------------------------------- + +TODO provide another example with ip addresses (+ possibly port) + +Note that if you have not explicitly configured a node name, this +name defaults to the host name, so using the host names will work as well. +While it is sufficient to set this on a single master-eligible node +in the cluster, and only mention a single master-eligible node, using +multiple nodes for bootstrapping allows the bootstrap process to go +through even if not all nodes are available. In any case, when +specifying the list of initial master nodes, **it is vitally important** +to configure each node with exactly the same list of nodes, to prevent +two independent clusters from forming. Typically you will set this +on the nodes that are mentioned in the list of initial master nodes. + +WARNING: You must put exactly the same set of initial master nodes in each + configuration file in order to be sure that only a single cluster forms during + bootstrapping and therefore to avoid the risk of data loss. + + +It is also possible to set the initial set of master nodes on the +command-line used to start Elasticsearch: + +[source,bash] +-------------------------------------------------- +$ bin/elasticsearch -Ecluster.initial_master_nodes=master-a,master-b,master-c +-------------------------------------------------- -include::discovery/bootstrap-cluster.asciidoc[] +Just as with the config file, this additional command-line parameter +can be removed once a cluster has successfully formed. [float] -==== Adding and removing nodes +==== Choosing a cluster name +The `cluster.name` allows to create separated clusters from one another. +The default value for the cluster name is `elasticsearch`, though it is +recommended to change this to reflect the logical group name of the +cluster running. + +[float] +==== Auto-bootstrapping in development mode + +If the cluster is running with a completely default configuration then it will +automatically bootstrap based on the nodes that could be discovered within a +short time after startup. Since nodes may not always reliably discover each +other quickly enough this automatic bootstrapping is not always reliable and +cannot be used in production deployments. + +[float] +[[modules-discovery-adding-removing-nodes]] +=== Adding and removing nodes + +As nodes are added or removed Elasticsearch maintains an optimal level of fault +tolerance by updating the cluster's _voting configuration_, which is the set of +master-eligible nodes whose responses are counted when making decisions such as +electing a new master or committing a new cluster state. It is recommended to have a small and fixed number of master-eligible nodes in a cluster, and to scale the cluster up and down by adding and removing non-master-eligible nodes only. However there are situations in which it may be desirable to add or remove some master-eligible nodes to or from a cluster. -Elasticsearch supports dynamically adding and removing master-eligible nodes, -but under certain conditions, special care must be taken. +If you wish to add some master-eligible nodes to your cluster, simply configure +the new nodes to find the existing cluster and start them up. Elasticsearch +will add the new nodes to the voting configuration if it is appropriate to do +so. + +When removing master-eligible nodes, it is important not to remove too many all +at the same time. For instance, if there are currently seven master-eligible +nodes and you wish to reduce this to three, it is not possible simply to stop +four of the nodes at once: to do so would leave only three nodes remaining, +which is less than half of the voting configuration, which means the cluster +cannot take any further actions. + +As long as there are at least three master-eligible nodes in the cluster, as a +general rule it is best to remove nodes one-at-a-time, allowing enough time for +the auto-reconfiguration to take effect after each removal. + +If there are only two master-eligible nodes then neither node can be safely +removed since both are required to reliably make progress, so you must first +inform Elasticsearch that one of the nodes should not be part of the voting +configuration, and that the voting power should instead be given to other +nodes, allowing the excluded node to be taken offline without preventing the +other node from making progress. A node which is added to a voting +configuration exclusion list still works normally, but Elasticsearch will try +and remove it from the voting configuration so its vote is no longer required, +and will never automatically move such a node back into the voting +configuration after it has been removed. Once a node has been successfully +reconfigured out of the voting configuration, it is safe to shut it down +without affecting the cluster's availability. A node can be added to the voting +configuration exclusion list using the following API: + +[source,js] +-------------------------------------------------- +# Add node to voting configuration exclusions list and wait for the system to +# auto-reconfigure the node out of the voting configuration up to the default +# timeout of 30 seconds +POST /_cluster/voting_config_exclusions/node_name +# Add node to voting configuration exclusions list and wait for +# auto-reconfiguration up to one minute +POST /_cluster/voting_config_exclusions/node_name?timeout=1m +-------------------------------------------------- +// CONSOLE + +The node that should be added to the exclusions list is specified using +<> in place of `node_name` here. If a call to the +voting configuration exclusions API fails then the call can safely be retried. +A successful response guarantees that the node has been removed from the voting +configuration and will not be reinstated. + +Although the voting configuration exclusions API is most useful for +down-scaling a two-node to a one-node cluster, it is also possible to use it to +remove multiple nodes from larger clusters all at the same time. Adding +multiple nodes to the exclusions list has the system try to auto-reconfigure +all of these nodes out of the voting configuration, allowing them to be safely +shut down while keeping the cluster available. In the example described above, +shrinking a seven-master-node cluster down to only have three master nodes, you +could add four nodes to the exclusions list, wait for confirmation, and then +shut them down simultaneously. + +Adding an exclusion for a node creates an entry for that node in the voting +configuration exclusions list, which has the system automatically try to +reconfigure the voting configuration to remove that node and prevents it from +returning to the voting configuration once it has removed. The current set of +exclusions is stored in the cluster state and can be inspected as follows: + +[source,js] +-------------------------------------------------- +GET /_cluster/state?filter_path=metadata.cluster_coordination.voting_config_exclusions +-------------------------------------------------- +// CONSOLE + +This list is limited in size by the following setting: + +`cluster.max_voting_config_exclusions`:: + + Sets a limits on the number of voting configuration exclusions at any one + time. Defaults to `10`. + +Since voting configuration exclusions are persistent and limited in number, +they must be cleaned up. Normally an exclusion is added when performing some +maintenance on the cluster, and the exclusions should be cleaned up when the +maintenance is complete. Clusters should have no voting configuration +exclusions in normal operation. + +If a node is excluded from the voting configuration because it is to be shut +down permanently then its exclusion can be removed once it has shut down and +been removed from the cluster. Exclusions can also be cleared if they were +created in error or were only required temporarily: + +[source,js] +-------------------------------------------------- +# Wait for all the nodes with voting configuration exclusions to be removed +# from the cluster and then remove all the exclusions, allowing any node to +# return to the voting configuration in the future. +DELETE /_cluster/voting_config_exclusions +# Immediately remove all the voting configuration exclusions, allowing any node +# to return to the voting configuration in the future. +DELETE /_cluster/voting_config_exclusions?wait_for_removal=false +-------------------------------------------------- +// CONSOLE [float] -==== Cluster state publishing +[[cluster-state-publishing]] +=== Cluster state publishing The master node is the only node in a cluster that can make changes to the cluster state. The master node processes one cluster state update at a time, @@ -75,7 +404,7 @@ discovery implementation). [float] [[no-master-block]] -==== No master block +=== No master block For the cluster to be fully operational, it must have an active master. The `discovery.zen.no_master_block` settings controls what operations should be @@ -97,14 +426,285 @@ The `discovery.zen.no_master_block` setting doesn't apply to nodes-based apis apis will not be blocked and can run on any available node. [float] -==== Master election and fault detection +[[master-election]] +=== Master Election + +Elasticsearch uses an election process to agree on an elected master node, both +at startup and if the existing elected master fails. Any master-eligible node +can start an election, and normally the first election that takes place will +succeed. Elections only usually fail when two nodes both happen to start their +elections at about the same time, so elections are scheduled randomly on each +node to avoid this happening. Nodes will retry elections until a master is +elected, backing off on failure, so that eventually an election will succeed +(with arbitrarily high probability). The following settings control the +scheduling of elections. + +`cluster.election.initial_timeout`:: + + Sets the upper bound on how long a node will wait initially, or after a + leader failure, before attempting its first election. This defaults to + `100ms`. + +`cluster.election.back_off_time`:: + + Sets the amount to increase the upper bound on the wait before an election + on each election failure. Note that this is _linear_ backoff. This defaults + to `100ms` + +`cluster.election.max_timeout`:: + + Sets the maximum upper bound on how long a node will wait before attempting + an first election, so that an network partition that lasts for a long time + does not result in excessively sparse elections. This defaults to `10s` + +`cluster.election.duration`:: + + Sets how long each election is allowed to take before a node considers it + to have failed and schedules a retry. This defaults to `500ms`. + + +[float] +==== Joining an elected master + +During master election, or when joining an existing formed cluster, a node will send +a join request to the master in order to be officially added to the cluster. This join +process can be configured with the following settings. + +`cluster.join.timeout`:: + + Sets how long a node will wait after sending a request to join a cluster + before it considers the request to have failed and retries. Defaults to + `60s`. + +[float] +[[fault-detection]] +=== Fault Detection + +An elected master periodically checks each of its followers in order to ensure +that they are still connected and healthy, and in turn each follower +periodically checks the health of the elected master. Elasticsearch allows for +these checks occasionally to fail or timeout without taking any action, and +will only consider a node to be truly faulty after a number of consecutive +checks have failed. The following settings control the behaviour of fault +detection. + +`cluster.fault_detection.follower_check.interval`:: + + Sets how long the elected master waits between checks of its followers. + Defaults to `1s`. + +`cluster.fault_detection.follower_check.timeout`:: + + Sets how long the elected master waits for a response to a follower check + before considering it to have failed. Defaults to `30s`. + +`cluster.fault_detection.follower_check.retry_count`:: + + Sets how many consecutive follower check failures must occur before the + elected master considers a follower node to be faulty and removes it from + the cluster. Defaults to `3`. + +`cluster.fault_detection.leader_check.interval`:: + + Sets how long each follower node waits between checks of its leader. + Defaults to `1s`. + +`cluster.fault_detection.leader_check.timeout`:: + + Sets how long each follower node waits for a response to a leader check + before considering it to have failed. Defaults to `30s`. -The master election and fault detection sub modules cover advanced settings -to influence the election and fault detection processes. +`cluster.fault_detection.leader_check.retry_count`:: -include::discovery/master-election.asciidoc[] + Sets how many consecutive leader check failures must occur before a + follower node considers the elected master to be faulty and attempts to + find or elect a new master. Defaults to `3`. [float] +[[modules-discovery-quorums]] === Quorum-based decision making -include::discovery/quorums.asciidoc[] +Electing a master node and changing the cluster state are the two fundamental +tasks that master-eligible nodes must work together to perform. It is important +that these activities work robustly even if some nodes have failed, and +Elasticsearch achieves this robustness by only considering each action to have +succeeded on receipt of responses from a _quorum_, a subset of the +master-eligible nodes in the cluster. The advantage of requiring only a subset +of the nodes to respond is that it allows for some of the nodes to fail without +preventing the cluster from making progress, and the quorums are carefully +chosen so as not to allow the cluster to "split brain", i.e. to be partitioned +into two pieces each of which may make decisions that are inconsistent with +those of the other piece. + +Elasticsearch allows you to add and remove master-eligible nodes to a running +cluster. In many cases you can do this simply by starting or stopping the nodes +as required, as described in more detail below. + +As nodes are added or removed Elasticsearch maintains an optimal level of fault +tolerance by updating the cluster's _voting configuration_, which is the set of +master-eligible nodes whose responses are counted when making decisions such as +electing a new master or committing a new cluster state. A decision is only +made once more than half of the nodes in the voting configuration have +responded. Usually the voting configuration is the same as the set of all the +master-eligible nodes that are currently in the cluster, but there are some +situations in which they may be different. + +To be sure that the cluster remains available you **must not stop half or more +of the nodes in the voting configuration at the same time**. As long as more +than half of the voting nodes are available the cluster can still work +normally. This means that if there are three or four master-eligible nodes then +the cluster can tolerate one of them being unavailable; if there are two or +fewer master-eligible nodes then they must all remain available. + +After a node has joined or left the cluster the elected master must issue a +cluster-state update that adjusts the voting configuration to match, and this +can take a short time to complete. It is important to wait for this adjustment +to complete before removing more nodes from the cluster. + +[float] +==== Getting the initial quorum + +When a brand-new cluster starts up for the first time, one of the tasks it must +perform is to elect its first master node, for which it needs to know the set +of master-eligible nodes whose votes should count in this first election. This +initial voting configuration is known as the _bootstrap configuration_. + +It is important that the bootstrap configuration identifies exactly which nodes +should vote in the first election, and it is not sufficient to configure each +node with an expectation of how many nodes there should be in the cluster. It +is also important to note that the bootstrap configuration must come from +outside the cluster: there is no safe way for the cluster to determine the +bootstrap configuration correctly on its own. + +If the bootstrap configuration is not set correctly then there is a risk when +starting up a brand-new cluster is that you accidentally form two separate +clusters instead of one. This could lead to data loss: you might start using +both clusters before noticing that anything had gone wrong, and it will then be +impossible to merge them together later. + +NOTE: To illustrate the problem with configuring each node to expect a certain +cluster size, imagine starting up a three-node cluster in which each node knows +that it is going to be part of a three-node cluster. A majority of three nodes +is two, so normally the first two nodes to discover each other will form a +cluster and the third node will join them a short time later. However, imagine +that four nodes were erroneously started instead of three: in this case there +are enough nodes to form two separate clusters. Of course if each node is +started manually then it's unlikely that too many nodes are started, but it's +certainly possible to get into this situation if using a more automated +orchestrator, particularly if the orchestrator is not resilient to failures +such as network partitions. + +The <> is +only required the very first time a whole cluster starts up: new nodes joining +an established cluster can safely obtain all the information they need from +the elected master, and nodes that have previously been part of a cluster +will have stored to disk all the information required when restarting. + +[float] +==== Cluster maintenance, rolling restarts and migrations + +Many cluster maintenance tasks involve temporarily shutting down one or more +nodes and then starting them back up again. By default Elasticsearch can remain +available if one of its master-eligible nodes is taken offline, such as during +a <>. Furthermore, if multiple nodes are +stopped and then started again then it will automatically recover, such as +during a <>. There is no need to take any +further action with the APIs described here in these cases, because the set of +master nodes is not changing permanently. + +It is also possible to perform a migration of a cluster onto entirely new nodes +without taking the cluster offline, via a _rolling migration_. A rolling +migration is similar to a rolling restart, in that it is performed one node at +a time, and also requires no special handling for the master-eligible nodes as +long as there are at least two of them available at all times. + +TODO the above is only true if the maintenance happens slowly enough, otherwise +the configuration might not catch up. Need to add this to the rolling restart +docs. + +[float] +==== Auto-reconfiguration + +Nodes may join or leave the cluster, and Elasticsearch reacts by making +corresponding changes to the voting configuration in order to ensure that the +cluster is as resilient as possible. The default auto-reconfiguration behaviour +is expected to give the best results in most situation. The current voting +configuration is stored in the cluster state so you can inspect its current +contents as follows: + +[source,js] +-------------------------------------------------- +GET /_cluster/state?filter_path=metadata.cluster_coordination.last_committed_config +-------------------------------------------------- +// CONSOLE + +NOTE: The current voting configuration is not necessarily the same as the set +of all available master-eligible nodes in the cluster. Altering the voting +configuration itself involves taking a vote, so it takes some time to adjust +the configuration as nodes join or leave the cluster. Also, there are +situations where the most resilient configuration includes unavailable nodes, +or does not include some available nodes, and in these situations the voting +configuration will differ from the set of available master-eligible nodes in +the cluster. + +Larger voting configurations are usually more resilient, so Elasticsearch will +normally prefer to add master-eligible nodes to the voting configuration once +they have joined the cluster. Similarly, if a node in the voting configuration +leaves the cluster and there is another master-eligible node in the cluster +that is not in the voting configuration then it is preferable to swap these two +nodes over, leaving the size of the voting configuration unchanged but +increasing its resilience. + +It is not so straightforward to automatically remove nodes from the voting +configuration after they have left the cluster, and different strategies have +different benefits and drawbacks, so the right choice depends on how the +cluster will be used and is controlled by the following setting. + +`cluster.auto_shrink_voting_configuration`:: + + Defaults to `true`, meaning that the voting configuration will + automatically shrink, shedding departed nodes, as long as it still contains + at least 3 nodes. If set to `false`, the voting configuration never + automatically shrinks; departed nodes must be removed manually using the + <>. + +NOTE: If `cluster.auto_shrink_voting_configuration` is set to `true`, the +recommended and default setting, and there are at least three master-eligible +nodes in the cluster, then Elasticsearch remains capable of processing +cluster-state updates as long as all but one of its master-eligible nodes are +healthy. + +There are situations in which Elasticsearch might tolerate the loss of multiple +nodes, but this is not guaranteed under all sequences of failures. If this +setting is set to `false` then departed nodes must be removed from the voting +configuration manually, using the vote withdrawal API described below, to achieve +the desired level of resilience. + +Note that Elasticsearch will not suffer from a "split-brain" inconsistency +however it is configured. This setting only affects its availability in the +event of the failure of some of its nodes, and the administrative tasks that +must be performed as nodes join and leave the cluster. + +[float] +==== Even numbers of master-eligible nodes + +There should normally be an odd number of master-eligible nodes in a cluster. +If there is an even number then Elasticsearch will leave one of them out of the +voting configuration to ensure that it has an odd size. This does not decrease +the failure-tolerance of the cluster, and in fact improves it slightly: if the +cluster is partitioned into two even halves then one of the halves will contain +a majority of the voting configuration and will be able to keep operating, +whereas if all of the master-eligible nodes' votes were counted then neither +side could make any progress in this situation. + +For instance if there are four master-eligible nodes in the cluster and the +voting configuration contained all of them then any quorum-based decision would +require votes from at least three of them, which means that the cluster can +only tolerate the loss of a single master-eligible node. If this cluster were +split into two equal halves then neither half would contain three +master-eligible nodes so would not be able to make any progress. However if the +voting configuration contains only three of the four master-eligible nodes then +the cluster is still only fully tolerant to the loss of one node, but +quorum-based decisions require votes from two of the three voting nodes. In the +event of an even split, one half will contain two of the three voting nodes so +will remain available. diff --git a/docs/reference/modules/discovery/auto-reconfiguration.asciidoc b/docs/reference/modules/discovery/auto-reconfiguration.asciidoc deleted file mode 100644 index b73681650d5f4..0000000000000 --- a/docs/reference/modules/discovery/auto-reconfiguration.asciidoc +++ /dev/null @@ -1,112 +0,0 @@ -[float] -==== Adding and removing nodes - -As nodes are added or removed Elasticsearch maintains an optimal level of fault -tolerance by updating the cluster's _voting configuration_, which is the set of -master-eligible nodes whose responses are counted when making decisions such as -electing a new master or committing a new cluster state. - -It is recommended to have a small and fixed number of master-eligible nodes in -a cluster, and to scale the cluster up and down by adding and removing -non-master-eligible nodes only. However there are situations in which it may be -desirable to add or remove some master-eligible nodes to or from a cluster. - -If you wish to add some master-eligible nodes to your cluster, simply configure -the new nodes to find the existing cluster and start them up. Elasticsearch -will add the new nodes to the voting configuration if it is appropriate to do -so. - -When removing master-eligible nodes, it is important not to remove too many all -at the same time. For instance, if there are currently seven master-eligible -nodes and you wish to reduce this to three, it is not possible simply to stop -four of the nodes at once: to do so would leave only three nodes remaining, -which is less than half of the voting configuration, which means the cluster -cannot take any further actions. - -As long as there are at least three master-eligible nodes in the cluster, as a -general rule it is best to remove nodes one-at-a-time, allowing enough time for -the auto-reconfiguration to take effect after each removal. - -If there are only two master-eligible nodes then neither node can be safely -removed since both are required to reliably make progress, so you must first -inform Elasticsearch that one of the nodes should not be part of the voting -configuration, and that the voting power should instead be given to other -nodes, allowing the excluded node to be taken offline without preventing the -other node from making progress. A node which is added to a voting -configuration exclusion list still works normally, but Elasticsearch will try -and remove it from the voting configuration so its vote is no longer required, -and will never automatically move such a node back into the voting -configuration after it has been removed. Once a node has been successfully -reconfigured out of the voting configuration, it is safe to shut it down -without affecting the cluster's availability. A node can be added to the voting -configuration exclusion list using the following API: - -[source,js] --------------------------------------------------- -# Add node to voting configuration exclusions list and wait for the system to -# auto-reconfigure the node out of the voting configuration up to the default -# timeout of 30 seconds -POST /_cluster/voting_config_exclusions/node_name -# Add node to voting configuration exclusions list and wait for -# auto-reconfiguration up to one minute -POST /_cluster/voting_config_exclusions/node_name?timeout=1m --------------------------------------------------- -// CONSOLE - -The node that should be added to the exclusions list is specified using -<> in place of `node_name` here. If a call to the -voting configuration exclusions API fails then the call can safely be retried. -A successful response guarantees that the node has been removed from the voting -configuration and will not be reinstated. - -Although the voting configuration exclusions API is most useful for -down-scaling a two-node to a one-node cluster, it is also possible to use it to -remove multiple nodes from larger clusters all at the same time. Adding -multiple nodes to the exclusions list has the system try to auto-reconfigure -all of these nodes out of the voting configuration, allowing them to be safely -shut down while keeping the cluster available. In the example described above, -shrinking a seven-master-node cluster down to only have three master nodes, you -could add four nodes to the exclusions list, wait for confirmation, and then -shut them down simultaneously. - -Adding an exclusion for a node creates an entry for that node in the voting -configuration exclusions list, which has the system automatically try to -reconfigure the voting configuration to remove that node and prevents it from -returning to the voting configuration once it has removed. The current set of -exclusions is stored in the cluster state and can be inspected as follows: - -[source,js] --------------------------------------------------- -GET /_cluster/state?filter_path=metadata.cluster_coordination.voting_config_exclusions --------------------------------------------------- -// CONSOLE - -This list is limited in size by the following setting: - -`cluster.max_voting_config_exclusions`:: - - Sets a limits on the number of voting configuration exclusions at any one - time. Defaults to `10`. - -Since voting configuration exclusions are persistent and limited in number, -they must be cleaned up. Normally an exclusion is added when performing some -maintenance on the cluster, and the exclusions should be cleaned up when the -maintenance is complete. Clusters should have no voting configuration -exclusions in normal operation. - -If a node is excluded from the voting configuration because it is to be shut -down permanently then its exclusion can be removed once it has shut down and -been removed from the cluster. Exclusions can also be cleared if they were -created in error or were only required temporarily: - -[source,js] --------------------------------------------------- -# Wait for all the nodes with voting configuration exclusions to be removed -# from the cluster and then remove all the exclusions, allowing any node to -# return to the voting configuration in the future. -DELETE /_cluster/voting_config_exclusions -# Immediately remove all the voting configuration exclusions, allowing any node -# to return to the voting configuration in the future. -DELETE /_cluster/voting_config_exclusions?wait_for_removal=false --------------------------------------------------- -// CONSOLE diff --git a/docs/reference/modules/discovery/bootstrap-cluster.asciidoc b/docs/reference/modules/discovery/bootstrap-cluster.asciidoc deleted file mode 100644 index 1217233417117..0000000000000 --- a/docs/reference/modules/discovery/bootstrap-cluster.asciidoc +++ /dev/null @@ -1,67 +0,0 @@ -[[modules-discovery-bootstrap-cluster]] -=== Bootstrapping a cluster - -Starting an Elasticsearch cluster for the very first time requires a -cluster bootstrapping step. - -The simplest way to bootstrap a cluster is by specifying the node names -or transport addresses of at least a non-empty subset of the master-eligible nodes -before start-up. The node setting `cluster.initial_master_nodes`, which -takes a list of node names or transport addresses, can be either specified -on the command line when starting up the nodes, or be added to the node -configuration file `elasticsearch.yml`. - -For a cluster with 3 master-eligible nodes (named master-a, master-b, and master-c) and -Note that if you have not explicitly configured a node name, this -name defaults to the host name, so using the host names will work as well: - -[source,yaml] --------------------------------------------------- -cluster.initial_master_nodes: - - master-a - - master-b - - master-c --------------------------------------------------- - -TODO provide another example with ip addresses (+ possibly port) - -While it is sufficient to set this on a single master-eligible node -in the cluster, and only mention a single maser-eligible node, using -multiple nodes for bootstrapping allows the bootstrap process to go -through even if not all nodes are avilable. In any case, when -specifying the list of initial master nodes, **it is vitally important** -to configure each node with exactly the same list of nodes, to prevent -two independent clusters from forming. Typically you will set this -on the nodes that are mentioned in the list of initial master nodes. - -WARNING: You must put exactly the same set of initial master nodes in each - configuration file in order to be sure that only a single cluster forms during - bootstrapping and therefore to avoid the risk of data loss. - - -It is also possible to set the initial set of master nodes on the -command-line used to start Elasticsearch: - -[source,bash] --------------------------------------------------- -$ bin/elasticsearch -Ecluster.initial_master_nodes=master-a,master-b,master-c --------------------------------------------------- - -Just as with the config file, this additional command-line parameter -can be removed once a cluster has successfully formed. - -[float] -==== Choosing a cluster name -The `cluster.name` allows to create separated clusters from one another. -The default value for the cluster name is `elasticsearch`, though it is -recommended to change this to reflect the logical group name of the -cluster running. - - -==== Auto-bootstrapping in development mode - -If the cluster is running with a completely default configuration then it will -automatically bootstrap based on the nodes that could be discovered within a -short time after startup. Since nodes may not always reliably discover each -other quickly enough this automatic bootstrapping is not always reliable and -cannot be used in production deployments. diff --git a/docs/reference/modules/discovery/hosts-providers.asciidoc b/docs/reference/modules/discovery/hosts-providers.asciidoc deleted file mode 100644 index ecaf32e37992a..0000000000000 --- a/docs/reference/modules/discovery/hosts-providers.asciidoc +++ /dev/null @@ -1,150 +0,0 @@ -[[modules-discovery-hosts-providers]] -=== Discovery - -The cluster formation module uses a list of _seed_ nodes in order to start -off the discovery process. At startup, or when disconnected from a master, -Elasticsearch tries to connect to each seed node in its list, and holds a -gossip-like conversation with them to find other nodes and to build a complete -picture of the master-eligible nodes in the cluster. By default the cluster formation -module offers two hosts providers to configure the list of seed nodes: -a _settings-based_ and a _file-based_ hosts provider, but can be extended to -support cloud environments and other forms of host providers via plugins. -Host providers are configured using the `discovery.zen.hosts_provider` setting, -which defaults to the _settings-based_ hosts provider. Multiple hosts providers -can be specified as a list. - -[float] -[[settings-based-hosts-provider]] -===== Settings-based hosts provider - -The settings-based hosts provider use a node setting to configure a static -list of hosts to use as seed nodes. These hosts can be specified as hostnames -or IP addresses; hosts specified as hostnames are resolved to IP addresses -during each round of pinging. Note that if you are in an environment where -DNS resolutions vary with time, you might need to adjust your -<>. - -The list of hosts is set using the `discovery.zen.ping.unicast.hosts` static -setting. This is either an array of hosts or a comma-delimited string. Each -value should be in the form of `host:port` or `host` (where `port` defaults to -the setting `transport.profiles.default.port` falling back to -`transport.tcp.port` if not set). Note that IPv6 hosts must be bracketed. The -default for this setting is `127.0.0.1, [::1]` - -Additionally, the `discovery.zen.ping.unicast.hosts.resolve_timeout` configures the -amount of time to wait for DNS lookups on each round of pinging. This is -specified as a <> and defaults to 5s. - -Unicast discovery uses the <> module to perform the -discovery. - -[float] -[[file-based-hosts-provider]] -===== File-based hosts provider - -The file-based hosts provider configures a list of hosts via an external file. -Elasticsearch reloads this file when it changes, so that the list of seed nodes -can change dynamically without needing to restart each node. For example, this -gives a convenient mechanism for an Elasticsearch instance that is run in a -Docker container to be dynamically supplied with a list of IP addresses to -connect to when those IP addresses may not be known at node startup. - -To enable file-based discovery, configure the `file` hosts provider as follows: - -[source,txt] ----------------------------------------------------------------- -discovery.zen.hosts_provider: file ----------------------------------------------------------------- - -Then create a file at `$ES_PATH_CONF/unicast_hosts.txt` in the format described -below. Any time a change is made to the `unicast_hosts.txt` file the new -changes will be picked up by Elasticsearch and the new hosts list will be used. - -Note that the file-based discovery plugin augments the unicast hosts list in -`elasticsearch.yml`: if there are valid unicast host entries in -`discovery.zen.ping.unicast.hosts` then they will be used in addition to those -supplied in `unicast_hosts.txt`. - -The `discovery.zen.ping.unicast.hosts.resolve_timeout` setting also applies to DNS -lookups for nodes specified by address via file-based discovery. This is -specified as a <> and defaults to 5s. - -The format of the file is to specify one node entry per line. Each node entry -consists of the host (host name or IP address) and an optional transport port -number. If the port number is specified, is must come immediately after the -host (on the same line) separated by a `:`. If the port number is not -specified, a default value of 9300 is used. - -For example, this is an example of `unicast_hosts.txt` for a cluster with four -nodes that participate in unicast discovery, some of which are not running on -the default port: - -[source,txt] ----------------------------------------------------------------- -10.10.10.5 -10.10.10.6:9305 -10.10.10.5:10005 -# an IPv6 address -[2001:0db8:85a3:0000:0000:8a2e:0370:7334]:9301 ----------------------------------------------------------------- - -Host names are allowed instead of IP addresses (similar to -`discovery.zen.ping.unicast.hosts`), and IPv6 addresses must be specified in -brackets with the port coming after the brackets. - -It is also possible to add comments to this file. All comments must appear on -their lines starting with `#` (i.e. comments cannot start in the middle of a -line). - -[float] -[[ec2-hosts-provider]] -===== EC2 hosts provider - -The {plugins}/discovery-ec2.html[EC2 discovery plugin] adds a hosts provider -that uses the https://github.com/aws/aws-sdk-java[AWS API] to find a list of seed nodes. - -[float] -[[azure-classic-hosts-provider]] -===== Azure Classic hosts provider - -The {plugins}/discovery-azure-classic.html[Azure Classic discovery plugin] adds a hosts provider -that uses the Azure Classic API find a list of seed nodes. - -[float] -[[gce-hosts-provider]] -===== Google Compute Engine hosts provider - -The {plugins}/discovery-gce.html[GCE discovery plugin] adds a hosts provider -that uses the GCE API find a list of seed nodes. - -[float] -==== Discovery settings - -Discovery operates in two phases: First, each node "probes" the addresses of -all known nodes by connecting to each address and attempting to identify the -node to which it is connected. Secondly it shares with the remote node a list -of all of its peers and the remote node responds with _its_ peers in turn. The -node then probes all the new nodes about which it just discovered, requests -their peers, and so on, until it has discovered an elected master node or -enough other masterless nodes that it can perform an election. If neither of -these occur quickly enough then it tries again. This process is controlled by -the following settings. - -`discovery.probe.connect_timeout`:: - - Sets how long to wait when attempting to connect to each address. Defaults - to `3s`. - -`discovery.probe.handshake_timeout`:: - - Sets how long to wait when attempting to identify the remote node via a - handshake. Defaults to `1s`. - -`discovery.find_peers_interval`:: - - Sets how long a node will wait before attempting another discovery round. - -`discovery.request_peers_timeout`:: - - Sets how long a node will wait after asking its peers again before - considering the request to have failed. diff --git a/docs/reference/modules/discovery/master-election.asciidoc b/docs/reference/modules/discovery/master-election.asciidoc deleted file mode 100644 index 4591720ea5963..0000000000000 --- a/docs/reference/modules/discovery/master-election.asciidoc +++ /dev/null @@ -1,97 +0,0 @@ -[float] -[[master-election]] -==== Master Election - -Elasticsearch uses an election process to agree on an elected master node, both -at startup and if the existing elected master fails. Any master-eligible node -can start an election, and normally the first election that takes place will -succeed. Elections only usually fail when two nodes both happen to start their -elections at about the same time, so elections are scheduled randomly on each -node to avoid this happening. Nodes will retry elections until a master is -elected, backing off on failure, so that eventually an election will succeed -(with arbitrarily high probability). The following settings control the -scheduling of elections. - -`cluster.election.initial_timeout`:: - - Sets the upper bound on how long a node will wait initially, or after a - leader failure, before attempting its first election. This defaults to - `100ms`. - -`cluster.election.back_off_time`:: - - Sets the amount to increase the upper bound on the wait before an election - on each election failure. Note that this is _linear_ backoff. This defaults - to `100ms` - -`cluster.election.max_timeout`:: - - Sets the maximum upper bound on how long a node will wait before attempting - an first election, so that an network partition that lasts for a long time - does not result in excessively sparse elections. This defaults to `10s` - -`cluster.election.duration`:: - - Sets how long each election is allowed to take before a node considers it - to have failed and schedules a retry. This defaults to `500ms`. - - -[float] -[[node-joining]] -==== Joining an elected master - -During master election, or when joining an existing formed cluster, a node will send -a join request to the master in order to be officially added to the cluster. This join -process can be configured with the following settings. - -`cluster.join.timeout`:: - - Sets how long a node will wait after sending a request to join a cluster - before it considers the request to have failed and retries. Defaults to - `60s`. - -[float] -[[fault-detection]] -==== Fault Detection - -An elected master periodically checks each of its followers in order to ensure -that they are still connected and healthy, and in turn each follower -periodically checks the health of the elected master. Elasticsearch allows for -these checks occasionally to fail or timeout without taking any action, and -will only consider a node to be truly faulty after a number of consecutive -checks have failed. The following settings control the behaviour of fault -detection. - -`cluster.fault_detection.follower_check.interval`:: - - Sets how long the elected master waits between checks of its followers. - Defaults to `1s`. - -`cluster.fault_detection.follower_check.timeout`:: - - Sets how long the elected master waits for a response to a follower check - before considering it to have failed. Defaults to `30s`. - -`cluster.fault_detection.follower_check.retry_count`:: - - Sets how many consecutive follower check failures must occur before the - elected master considers a follower node to be faulty and removes it from - the cluster. Defaults to `3`. - -`cluster.fault_detection.leader_check.interval`:: - - Sets how long each follower node waits between checks of its leader. - Defaults to `1s`. - -`cluster.fault_detection.leader_check.timeout`:: - - Sets how long each follower node waits for a response to a leader check - before considering it to have failed. Defaults to `30s`. - -`cluster.fault_detection.leader_check.retry_count`:: - - Sets how many consecutive leader check failures must occur before a - follower node considers the elected master to be faulty and attempts to - find or elect a new master. Defaults to `3`. - -TODO add lag detection \ No newline at end of file diff --git a/docs/reference/modules/discovery/quorums.asciidoc b/docs/reference/modules/discovery/quorums.asciidoc deleted file mode 100644 index a8b30d6e79a96..0000000000000 --- a/docs/reference/modules/discovery/quorums.asciidoc +++ /dev/null @@ -1,187 +0,0 @@ -[[modules-discovery-quorums]] -=== Quorum-based decision making - -Electing a master node and changing the cluster state are the two fundamental -tasks that master-eligible nodes must work together to perform. It is important -that these activities work robustly even if some nodes have failed, and -Elasticsearch achieves this robustness by only considering each action to have -succeeded on receipt of responses from a _quorum_, a subset of the -master-eligible nodes in the cluster. The advantage of requiring only a subset -of the nodes to respond is that it allows for some of the nodes to fail without -preventing the cluster from making progress, and the quorums are carefully -chosen so as not to allow the cluster to "split brain", i.e. to be partitioned -into two pieces each of which may make decisions that are inconsistent with -those of the other piece. - -Elasticsearch allows you to add and remove master-eligible nodes to a running -cluster. In many cases you can do this simply by starting or stopping the nodes -as required, as described in more detail below. - -As nodes are added or removed Elasticsearch maintains an optimal level of fault -tolerance by updating the cluster's _voting configuration_, which is the set of -master-eligible nodes whose responses are counted when making decisions such as -electing a new master or committing a new cluster state. A decision is only -made once more than half of the nodes in the voting configuration have -responded. Usually the voting configuration is the same as the set of all the -master-eligible nodes that are currently in the cluster, but there are some -situations in which they may be different. - -To be sure that the cluster remains available you **must not stop half or more -of the nodes in the voting configuration at the same time**. As long as more -than half of the voting nodes are available the cluster can still work -normally. This means that if there are three or four master-eligible nodes then -the cluster can tolerate one of them being unavailable; if there are two or -fewer master-eligible nodes then they must all remain available. - -After a node has joined or left the cluster the elected master must issue a -cluster-state update that adjusts the voting configuration to match, and this -can take a short time to complete. It is important to wait for this adjustment -to complete before removing more nodes from the cluster. - -[float] -=== Getting the initial quorum - -When a brand-new cluster starts up for the first time, one of the tasks it must -perform is to elect its first master node, for which it needs to know the set -of master-eligible nodes whose votes should count in this first election. This -initial voting configuration is known as the _bootstrap configuration_. - -It is important that the bootstrap configuration identifies exactly which nodes -should vote in the first election, and it is not sufficient to configure each -node with an expectation of how many nodes there should be in the cluster. It -is also important to note that the bootstrap configuration must come from -outside the cluster: there is no safe way for the cluster to determine the -bootstrap configuration correctly on its own. - -If the bootstrap configuration is not set correctly then there is a risk when -starting up a brand-new cluster is that you accidentally form two separate -clusters instead of one. This could lead to data loss: you might start using -both clusters before noticing that anything had gone wrong, and it will then be -impossible to merge them together later. - -NOTE: To illustrate the problem with configuring each node to expect a certain -cluster size, imagine starting up a three-node cluster in which each node knows -that it is going to be part of a three-node cluster. A majority of three nodes -is two, so normally the first two nodes to discover each other will form a -cluster and the third node will join them a short time later. However, imagine -that four nodes were erroneously started instead of three: in this case there -are enough nodes to form two separate clusters. Of course if each node is -started manually then it's unlikely that too many nodes are started, but it's -certainly possible to get into this situation if using a more automated -orchestrator, particularly if the orchestrator is not resilient to failures -such as network partitions. - -The cluster bootstrapping process is only required the very first time a whole -cluster starts up: new nodes joining an established cluster can safely obtain -all the information they need from the elected master, and nodes that have -previously been part of a cluster will have stored to disk all the information -required when restarting. - -[float] -=== Cluster maintenance, rolling restarts and migrations - -Many cluster maintenance tasks involve temporarily shutting down one or more -nodes and then starting them back up again. By default Elasticsearch can remain -available if one of its master-eligible nodes is taken offline, such as during -a <>. Furthermore, if multiple nodes are -stopped and then started again then it will automatically recover, such as -during a <>. There is no need to take any -further action with the APIs described here in these cases, because the set of -master nodes is not changing permanently. - -It is also possible to perform a migration of a cluster onto entirely new nodes -without taking the cluster offline, via a _rolling migration_. A rolling -migration is similar to a rolling restart, in that it is performed one node at -a time, and also requires no special handling for the master-eligible nodes as -long as there are at least two of them available at all times. - -TODO the above is only true if the maintenance happens slowly enough, otherwise -the configuration might not catch up. Need to add this to the rolling restart -docs. - -[float] -==== Auto-reconfiguration - -Nodes may join or leave the cluster, and Elasticsearch reacts by making -corresponding changes to the voting configuration in order to ensure that the -cluster is as resilient as possible. The default auto-reconfiguration behaviour -is expected to give the best results in most situation. The current voting -configuration is stored in the cluster state so you can inspect its current -contents as follows: - -[source,js] --------------------------------------------------- -GET /_cluster/state?filter_path=metadata.cluster_coordination.last_committed_config --------------------------------------------------- -// CONSOLE - -NOTE: The current voting configuration is not necessarily the same as the set -of all available master-eligible nodes in the cluster. Altering the voting -configuration itself involves taking a vote, so it takes some time to adjust -the configuration as nodes join or leave the cluster. Also, there are -situations where the most resilient configuration includes unavailable nodes, -or does not include some available nodes, and in these situations the voting -configuration will differ from the set of available master-eligible nodes in -the cluster. - -Larger voting configurations are usually more resilient, so Elasticsearch will -normally prefer to add master-eligible nodes to the voting configuration once -they have joined the cluster. Similarly, if a node in the voting configuration -leaves the cluster and there is another master-eligible node in the cluster -that is not in the voting configuration then it is preferable to swap these two -nodes over, leaving the size of the voting configuration unchanged but -increasing its resilience. - -It is not so straightforward to automatically remove nodes from the voting -configuration after they have left the cluster, and different strategies have -different benefits and drawbacks, so the right choice depends on how the -cluster will be used and is controlled by the following setting. - -`cluster.auto_shrink_voting_configuration`:: - - Defaults to `true`, meaning that the voting configuration will - automatically shrink, shedding departed nodes, as long as it still contains - at least 3 nodes. If set to `false`, the voting configuration never - automatically shrinks; departed nodes must be removed manually using the - vote withdrawal API described below. - -NOTE: If `cluster.auto_shrink_voting_configuration` is set to `true`, the -recommended and default setting, and there are at least three master-eligible -nodes in the cluster, then Elasticsearch remains capable of processing -cluster-state updates as long as all but one of its master-eligible nodes are -healthy. - -There are situations in which Elasticsearch might tolerate the loss of multiple -nodes, but this is not guaranteed under all sequences of failures. If this -setting is set to `false` then departed nodes must be removed from the voting -configuration manually, using the vote withdrawal API described below, to achieve -the desired level of resilience. - -Note that Elasticsearch will not suffer from a "split-brain" inconsistency -however it is configured. This setting only affects its availability in the -event of the failure of some of its nodes, and the administrative tasks that -must be performed as nodes join and leave the cluster. - -[float] -==== Even numbers of master-eligible nodes - -There should normally be an odd number of master-eligible nodes in a cluster. -If there is an even number then Elasticsearch will leave one of them out of the -voting configuration to ensure that it has an odd size. This does not decrease -the failure-tolerance of the cluster, and in fact improves it slightly: if the -cluster is partitioned into two even halves then one of the halves will contain -a majority of the voting configuration and will be able to keep operating, -whereas if all of the master-eligible nodes' votes were counted then neither -side could make any progress in this situation. - -For instance if there are four master-eligible nodes in the cluster and the -voting configuration contained all of them then any quorum-based decision would -require votes from at least three of them, which means that the cluster can -only tolerate the loss of a single master-eligible node. If this cluster were -split into two equal halves then neither half would contain three -master-eligible nodes so would not be able to make any progress. However if the -voting configuration contains only three of the four master-eligible nodes then -the cluster is still only fully tolerant to the loss of one node, but -quorum-based decisions require votes from two of the three voting nodes. In the -event of an even split, one half will contain two of the three voting nodes so -will remain available. From e10d76072f2771dc374cb6b9f18821027eec557b Mon Sep 17 00:00:00 2001 From: Yannick Welsch Date: Mon, 10 Dec 2018 12:10:16 +0100 Subject: [PATCH 035/106] remove coordination.asciidoc --- docs/reference/modules/coordination.asciidoc | 106 ------------------- 1 file changed, 106 deletions(-) delete mode 100644 docs/reference/modules/coordination.asciidoc diff --git a/docs/reference/modules/coordination.asciidoc b/docs/reference/modules/coordination.asciidoc deleted file mode 100644 index 5b38ab38f2f50..0000000000000 --- a/docs/reference/modules/coordination.asciidoc +++ /dev/null @@ -1,106 +0,0 @@ -[[modules-cluster-coordination]] -== Cluster coordination - -The cluster coordination module is responsible for electing a master node and -managing changes to the cluster state. - - - - - -[float] -=== Cluster bootstrapping - -When a brand-new cluster starts up for the first time, one of the tasks it must -perform is to elect its first master node, for which it needs to know the set -of master-eligible nodes whose votes should count in this first election. This -initial voting configuration is known as the _bootstrap configuration_. - -It is important that the bootstrap configuration identifies exactly which nodes -should vote in the first election, and it is not sufficient to configure each -node with an expectation of how many nodes there should be in the cluster. It -is also important to note that the bootstrap configuration must come from -outside the cluster: there is no safe way for the cluster to determine the -bootstrap configuration correctly on its own. - -If the bootstrap configuration is not set correctly then there is a risk when -starting up a brand-new cluster is that you accidentally form two separate -clusters instead of one. This could lead to data loss: you might start using -both clusters before noticing that anything had gone wrong, and it will then be -impossible to merge them together later. - -NOTE: To illustrate the problem with configuring each node to expect a certain -cluster size, imagine starting up a three-node cluster in which each node knows -that it is going to be part of a three-node cluster. A majority of three nodes -is two, so normally the first two nodes to discover each other will form a -cluster and the third node will join them a short time later. However, imagine -that four nodes were erroneously started instead of three: in this case there -are enough nodes to form two separate clusters. Of course if each node is -started manually then it's unlikely that too many nodes are started, but it's -certainly possible to get into this situation if using a more automated -orchestrator, particularly if the orchestrator is not resilient to failures -such as network partitions. - -The cluster bootstrapping process is only required the very first time a whole -cluster starts up: new nodes joining an established cluster can safely obtain -all the information they need from the elected master, and nodes that have -previously been part of a cluster will have stored to disk all the information -required when restarting. - -A cluster can be bootstrapped by setting the names or addresses of the initial -set of master nodes in the `elasticsearch.yml` file: - -[source] --------------------------------------------------- -cluster.initial_master_nodes: - - master-a - - master-b - - master-c --------------------------------------------------- - -This only needs to be set on a single master-eligible node in the cluster, but -for robustness it is safe to set this on every node in the cluster. However -**it is vitally important** to use exactly the same set of nodes in each -configuration file. - -WARNING: You must put exactly the same set of master nodes in each -configuration file in order to be sure that only a single cluster forms during -bootstrapping and therefore to avoid the risk of data loss. - -It is also possible to set the initial set of master nodes on the command-line -used to start Elasticsearch: - -[source] --------------------------------------------------- -$ bin/elasticsearch -Ecluster.initial_master_nodes=master-a,master-b,master-c --------------------------------------------------- - - -If the cluster is running with a completely default configuration then it will -automatically bootstrap based on the nodes that could be discovered within a -short time after startup. Since nodes may not always reliably discover each -other quickly enough this automatic bootstrapping is not always reliable and -cannot be used in production deployments. - -[float] -=== Unsafe disaster recovery - -In a disaster situation a cluster may have lost half or more of its -master-eligible nodes and therefore be in a state in which it cannot elect a -master. There is no way to recover from this situation without risking data -loss (including the loss of indexed documents) but if there is no other viable -path forwards then this may be necessary. This can be performed with the -following command on a surviving node: - -[source,js] --------------------------------------------------- -POST /_cluster/force_local_node_takeover --------------------------------------------------- -// CONSOLE - -This forcibly overrides the current voting configuration with one in which the -handling node is the only voting master, so that it forms a quorum on its own. -Because there is a risk of data loss when performing this command it requires -the `accept_data_loss` parameter to be set to `true` in the URL. - - From 024c9b236203eecbf11dc8005a46a228964f05ad Mon Sep 17 00:00:00 2001 From: Yannick Welsch Date: Tue, 11 Dec 2018 10:18:17 +0100 Subject: [PATCH 036/106] Adapt docker instructions --- .../configuring-tls-docker.asciidoc | 6 ++-- docs/reference/setup/install/docker.asciidoc | 28 +++++++++++-------- 2 files changed, 19 insertions(+), 15 deletions(-) diff --git a/docs/reference/security/securing-communications/configuring-tls-docker.asciidoc b/docs/reference/security/securing-communications/configuring-tls-docker.asciidoc index e7e1a00208adc..6c56e0377b135 100644 --- a/docs/reference/security/securing-communications/configuring-tls-docker.asciidoc +++ b/docs/reference/security/securing-communications/configuring-tls-docker.asciidoc @@ -106,7 +106,7 @@ services: image: {docker-image} environment: - node.name=es01 - - discovery.zen.minimum_master_nodes=2 + - cluster.initial_master_nodes=es01,es02 - ELASTIC_PASSWORD=$ELASTIC_PASSWORD <1> - "ES_JAVA_OPTS=-Xms512m -Xmx512m" - xpack.license.self_generated.type=trial <2> @@ -131,9 +131,9 @@ services: image: {docker-image} environment: - node.name=es02 - - discovery.zen.minimum_master_nodes=2 - - ELASTIC_PASSWORD=$ELASTIC_PASSWORD - discovery.zen.ping.unicast.hosts=es01 + - cluster.initial_master_nodes=es01,es02 + - ELASTIC_PASSWORD=$ELASTIC_PASSWORD - "ES_JAVA_OPTS=-Xms512m -Xmx512m" - xpack.license.self_generated.type=trial - xpack.security.enabled=true diff --git a/docs/reference/setup/install/docker.asciidoc b/docs/reference/setup/install/docker.asciidoc index 6eba32ba33202..267ea14420921 100644 --- a/docs/reference/setup/install/docker.asciidoc +++ b/docs/reference/setup/install/docker.asciidoc @@ -142,12 +142,12 @@ endif::[] Instructions for installing it can be found on the https://docs.docker.com/compose/install/#install-using-pip[Docker Compose webpage]. -The node `elasticsearch` listens on `localhost:9200` while `elasticsearch2` -talks to `elasticsearch` over a Docker network. +The node `es01` listens on `localhost:9200` while `es02` +talks to `es01` over a Docker network. This example also uses https://docs.docker.com/engine/tutorials/dockervolumes[Docker named volumes], -called `esdata1` and `esdata2` which will be created if not already present. +called `esdata01` and `esdata02` which will be created if not already present. [[docker-prod-cluster-composefile]] `docker-compose.yml`: @@ -163,10 +163,12 @@ ifeval::["{release-state}"!="unreleased"] -------------------------------------------- version: '2.2' services: - elasticsearch: + es01: image: {docker-image} - container_name: elasticsearch + container_name: es01 environment: + - node.name=es01 + - cluster.initial_master_nodes=es01,es02 - cluster.name=docker-cluster - bootstrap.memory_lock=true - "ES_JAVA_OPTS=-Xms512m -Xmx512m" @@ -175,32 +177,34 @@ services: soft: -1 hard: -1 volumes: - - esdata1:/usr/share/elasticsearch/data + - esdata01:/usr/share/elasticsearch/data ports: - 9200:9200 networks: - esnet - elasticsearch2: + es02: image: {docker-image} - container_name: elasticsearch2 + container_name: es02 environment: + - node.name=es02 + - discovery.zen.ping.unicast.hosts=es01 + - cluster.initial_master_nodes=es01,es02 - cluster.name=docker-cluster - bootstrap.memory_lock=true - "ES_JAVA_OPTS=-Xms512m -Xmx512m" - - "discovery.zen.ping.unicast.hosts=elasticsearch" ulimits: memlock: soft: -1 hard: -1 volumes: - - esdata2:/usr/share/elasticsearch/data + - esdata02:/usr/share/elasticsearch/data networks: - esnet volumes: - esdata1: + esdata01: driver: local - esdata2: + esdata02: driver: local networks: From 02b607c30e5f4092aa59fb4d2d8451136d461070 Mon Sep 17 00:00:00 2001 From: Yannick Welsch Date: Tue, 11 Dec 2018 10:27:12 +0100 Subject: [PATCH 037/106] adapt other uses of minimum_master_nodes --- distribution/src/config/elasticsearch.yml | 6 +++--- docs/reference/modules/snapshots.asciidoc | 5 ++--- 2 files changed, 5 insertions(+), 6 deletions(-) diff --git a/distribution/src/config/elasticsearch.yml b/distribution/src/config/elasticsearch.yml index 445c6f5c07fce..869692d01c06d 100644 --- a/distribution/src/config/elasticsearch.yml +++ b/distribution/src/config/elasticsearch.yml @@ -67,11 +67,11 @@ ${path.logs} # #discovery.zen.ping.unicast.hosts: ["host1", "host2"] # -# Prevent the "split brain" by configuring the majority of nodes (total number of master-eligible nodes / 2 + 1): +# Bootstrap the cluster using an initial set of master-eligible nodes: # -#discovery.zen.minimum_master_nodes: +#cluster.initial_master_nodes: ["node1", "node2"] # -# For more information, consult the zen discovery module documentation. +# For more information, consult the discovery and cluster formation module documentation. # # ---------------------------------- Gateway ----------------------------------- # diff --git a/docs/reference/modules/snapshots.asciidoc b/docs/reference/modules/snapshots.asciidoc index 48ae41ded2a78..7ee545d66cf0f 100644 --- a/docs/reference/modules/snapshots.asciidoc +++ b/docs/reference/modules/snapshots.asciidoc @@ -597,9 +597,8 @@ if the new cluster doesn't contain nodes with appropriate attributes that a rest index will not be successfully restored unless these index allocation settings are changed during restore operation. The restore operation also checks that restored persistent settings are compatible with the current cluster to avoid accidentally -restoring an incompatible settings such as `discovery.zen.minimum_master_nodes` and as a result disable a smaller cluster until the -required number of master eligible nodes is added. If you need to restore a snapshot with incompatible persistent settings, try -restoring it without the global cluster state. +restoring incompatible settings. If you need to restore a snapshot with incompatible persistent settings, try restoring it without +the global cluster state. [float] === Snapshot status From 60d64b46761800c5d11b5d9163d415e23847b28f Mon Sep 17 00:00:00 2001 From: David Turner Date: Tue, 11 Dec 2018 10:35:18 +0000 Subject: [PATCH 038/106] Whitespace --- docs/reference/modules/discovery.asciidoc | 391 +++++++++++----------- 1 file changed, 192 insertions(+), 199 deletions(-) diff --git a/docs/reference/modules/discovery.asciidoc b/docs/reference/modules/discovery.asciidoc index e6e10c96659d1..2578e965e6273 100644 --- a/docs/reference/modules/discovery.asciidoc +++ b/docs/reference/modules/discovery.asciidoc @@ -8,58 +8,58 @@ modules, for example, all communication between nodes is done using the It is separated into several sections, which are explained below: -* <> is the process where nodes find - each other when starting up, or when losing a master. -* <> is a configuration step that's required - when an Elasticsearch starts up for the very first time. - In <>, with no discovery settings configured, - this step is automatically performed by the nodes themselves. As this - auto-bootstrapping is <>, running - a node in <> requires an explicit cluster - bootstrapping step. -* It is recommended to have a small and fixed number of master-eligible nodes in - a cluster, and to scale the cluster up and down by adding and removing - non-master-eligible nodes only. However there are situations in which it may be - desirable to add or remove some master-eligible nodes to or from a cluster. - A section on <> - describes how Elasticsearch supports dynamically adding and removing - master-eligible nodes where, under certain conditions, special care must be taken. +* <> is the process where nodes + find each other when starting up, or when losing a master. +* <> is a configuration step that's + required when an Elasticsearch starts up for the very first time. In + <>, with no discovery settings + configured, this step is automatically performed by the nodes themselves. As + this auto-bootstrapping is <>, + running a node in <> requires an explicit + cluster bootstrapping step. +* It is recommended to have a small and fixed number of master-eligible nodes + in a cluster, and to scale the cluster up and down by adding and removing + non-master-eligible nodes only. However there are situations in which it may + be desirable to add or remove some master-eligible nodes to or from a + cluster. A section on <> describes how Elasticsearch supports dynamically adding and + removing master-eligible nodes where, under certain conditions, special care + must be taken. * <> covers how a master publishes cluster states to the other nodes in the cluster. -* <> describes what operations should be rejected when there is - no active master. +* <> describes what operations should be rejected when there + is no active master. * <> and <> sections cover advanced settings to influence the election and fault detection processes. -* <> explains the design - behind the master election and auto-reconfiguration logic. - +* <> explains the + design behind the master election and auto-reconfiguration logic. [float] [[modules-discovery-hosts-providers]] === Discovery -The cluster formation module uses a list of _seed_ nodes in order to start -off the discovery process. At startup, or when disconnected from a master, +The cluster formation module uses a list of _seed_ nodes in order to start off +the discovery process. At startup, or when disconnected from a master, Elasticsearch tries to connect to each seed node in its list, and holds a gossip-like conversation with them to find other nodes and to build a complete -picture of the master-eligible nodes in the cluster. By default the cluster formation -module offers two hosts providers to configure the list of seed nodes: +picture of the master-eligible nodes in the cluster. By default the cluster +formation module offers two hosts providers to configure the list of seed nodes: a _settings-based_ and a _file-based_ hosts provider, but can be extended to -support cloud environments and other forms of host providers via plugins. -Host providers are configured using the `discovery.zen.hosts_provider` setting, -which defaults to the _settings-based_ hosts provider. Multiple hosts providers -can be specified as a list. +support cloud environments and other forms of host providers via plugins. Host +providers are configured using the `discovery.zen.hosts_provider` setting, which +defaults to the _settings-based_ hosts provider. Multiple hosts providers can be +specified as a list. [float] [[settings-based-hosts-provider]] ===== Settings-based hosts provider -The settings-based hosts provider use a node setting to configure a static -list of hosts to use as seed nodes. These hosts can be specified as hostnames -or IP addresses; hosts specified as hostnames are resolved to IP addresses -during each round of pinging. Note that if you are in an environment where -DNS resolutions vary with time, you might need to adjust your -<>. +The settings-based hosts provider use a node setting to configure a static list +of hosts to use as seed nodes. These hosts can be specified as hostnames or IP +addresses; hosts specified as hostnames are resolved to IP addresses during each +round of pinging. Note that if you are in an environment where DNS resolutions +vary with time, you might need to adjust your <>. The list of hosts is set using the `discovery.zen.ping.unicast.hosts` static setting. This is either an array of hosts or a comma-delimited string. Each @@ -68,8 +68,8 @@ the setting `transport.profiles.default.port` falling back to `transport.tcp.port` if not set). Note that IPv6 hosts must be bracketed. The default for this setting is `127.0.0.1, [::1]` -Additionally, the `discovery.zen.ping.unicast.hosts.resolve_timeout` configures the -amount of time to wait for DNS lookups on each round of pinging. This is +Additionally, the `discovery.zen.ping.unicast.hosts.resolve_timeout` configures +the amount of time to wait for DNS lookups on each round of pinging. This is specified as a <> and defaults to 5s. Unicast discovery uses the <> module to perform the @@ -94,16 +94,16 @@ discovery.zen.hosts_provider: file ---------------------------------------------------------------- Then create a file at `$ES_PATH_CONF/unicast_hosts.txt` in the format described -below. Any time a change is made to the `unicast_hosts.txt` file the new -changes will be picked up by Elasticsearch and the new hosts list will be used. +below. Any time a change is made to the `unicast_hosts.txt` file the new changes +will be picked up by Elasticsearch and the new hosts list will be used. Note that the file-based discovery plugin augments the unicast hosts list in `elasticsearch.yml`: if there are valid unicast host entries in `discovery.zen.ping.unicast.hosts` then they will be used in addition to those supplied in `unicast_hosts.txt`. -The `discovery.zen.ping.unicast.hosts.resolve_timeout` setting also applies to DNS -lookups for nodes specified by address via file-based discovery. This is +The `discovery.zen.ping.unicast.hosts.resolve_timeout` setting also applies to +DNS lookups for nodes specified by address via file-based discovery. This is specified as a <> and defaults to 5s. The format of the file is to specify one node entry per line. Each node entry @@ -138,14 +138,15 @@ line). ===== EC2 hosts provider The {plugins}/discovery-ec2.html[EC2 discovery plugin] adds a hosts provider -that uses the https://github.com/aws/aws-sdk-java[AWS API] to find a list of seed nodes. +that uses the https://github.com/aws/aws-sdk-java[AWS API] to find a list of +seed nodes. [float] [[azure-classic-hosts-provider]] ===== Azure Classic hosts provider -The {plugins}/discovery-azure-classic.html[Azure Classic discovery plugin] adds a hosts provider -that uses the Azure Classic API find a list of seed nodes. +The {plugins}/discovery-azure-classic.html[Azure Classic discovery plugin] adds +a hosts provider that uses the Azure Classic API find a list of seed nodes. [float] [[gce-hosts-provider]] @@ -157,15 +158,15 @@ that uses the GCE API find a list of seed nodes. [float] ==== Discovery settings -Discovery operates in two phases: First, each node "probes" the addresses of -all known nodes by connecting to each address and attempting to identify the -node to which it is connected. Secondly it shares with the remote node a list -of all of its peers and the remote node responds with _its_ peers in turn. The -node then probes all the new nodes about which it just discovered, requests -their peers, and so on, until it has discovered an elected master node or -enough other masterless nodes that it can perform an election. If neither of -these occur quickly enough then it tries again. This process is controlled by -the following settings. +Discovery operates in two phases: First, each node "probes" the addresses of all +known nodes by connecting to each address and attempting to identify the node to +which it is connected. Secondly it shares with the remote node a list of all of +its peers and the remote node responds with _its_ peers in turn. The node then +probes all the new nodes about which it just discovered, requests their peers, +and so on, until it has discovered an elected master node or enough other +masterless nodes that it can perform an election. If neither of these occur +quickly enough then it tries again. This process is controlled by the following +settings. `discovery.probe.connect_timeout`:: @@ -186,23 +187,22 @@ the following settings. Sets how long a node will wait after asking its peers again before considering the request to have failed. - [float] [[modules-discovery-bootstrap-cluster]] === Bootstrapping a cluster -Starting an Elasticsearch cluster for the very first time requires a -cluster bootstrapping step. +Starting an Elasticsearch cluster for the very first time requires a cluster +bootstrapping step. -The simplest way to bootstrap a cluster is by specifying the node names -or transport addresses of at least a non-empty subset of the master-eligible nodes -before start-up. The node setting `cluster.initial_master_nodes`, which -takes a list of node names or transport addresses, can be either specified -on the command line when starting up the nodes, or be added to the node -configuration file `elasticsearch.yml`. +The simplest way to bootstrap a cluster is by specifying the node names or +transport addresses of at least a non-empty subset of the master-eligible nodes +before start-up. The node setting `cluster.initial_master_nodes`, which takes a +list of node names or transport addresses, can be either specified on the +command line when starting up the nodes, or be added to the node configuration +file `elasticsearch.yml`. -For a cluster with 3 master-eligible nodes (named master-a, master-b, and master-c) -the configuration will look as follows: +For a cluster with 3 master-eligible nodes (named master-a, master-b, and +master-c) the configuration will look as follows: [source,yaml] -------------------------------------------------- @@ -214,39 +214,37 @@ cluster.initial_master_nodes: TODO provide another example with ip addresses (+ possibly port) -Note that if you have not explicitly configured a node name, this -name defaults to the host name, so using the host names will work as well. -While it is sufficient to set this on a single master-eligible node -in the cluster, and only mention a single master-eligible node, using -multiple nodes for bootstrapping allows the bootstrap process to go -through even if not all nodes are available. In any case, when -specifying the list of initial master nodes, **it is vitally important** -to configure each node with exactly the same list of nodes, to prevent -two independent clusters from forming. Typically you will set this -on the nodes that are mentioned in the list of initial master nodes. +Note that if you have not explicitly configured a node name, this name defaults +to the host name, so using the host names will work as well. While it is +sufficient to set this on a single master-eligible node in the cluster, and only +mention a single master-eligible node, using multiple nodes for bootstrapping +allows the bootstrap process to go through even if not all nodes are available. +In any case, when specifying the list of initial master nodes, **it is vitally +important** to configure each node with exactly the same list of nodes, to +prevent two independent clusters from forming. Typically you will set this on +the nodes that are mentioned in the list of initial master nodes. WARNING: You must put exactly the same set of initial master nodes in each - configuration file in order to be sure that only a single cluster forms during - bootstrapping and therefore to avoid the risk of data loss. - + configuration file in order to be sure that only a single cluster forms during + bootstrapping and therefore to avoid the risk of data loss. -It is also possible to set the initial set of master nodes on the -command-line used to start Elasticsearch: +It is also possible to set the initial set of master nodes on the command-line +used to start Elasticsearch: [source,bash] -------------------------------------------------- $ bin/elasticsearch -Ecluster.initial_master_nodes=master-a,master-b,master-c -------------------------------------------------- -Just as with the config file, this additional command-line parameter -can be removed once a cluster has successfully formed. +Just as with the config file, this additional command-line parameter can be +removed once a cluster has successfully formed. [float] ==== Choosing a cluster name -The `cluster.name` allows to create separated clusters from one another. -The default value for the cluster name is `elasticsearch`, though it is -recommended to change this to reflect the logical group name of the -cluster running. + +The `cluster.name` allows to create separated clusters from one another. The +default value for the cluster name is `elasticsearch`, though it is recommended +to change this to reflect the logical group name of the cluster running. [float] ==== Auto-bootstrapping in development mode @@ -266,15 +264,14 @@ tolerance by updating the cluster's _voting configuration_, which is the set of master-eligible nodes whose responses are counted when making decisions such as electing a new master or committing a new cluster state. -It is recommended to have a small and fixed number of master-eligible nodes in -a cluster, and to scale the cluster up and down by adding and removing +It is recommended to have a small and fixed number of master-eligible nodes in a +cluster, and to scale the cluster up and down by adding and removing non-master-eligible nodes only. However there are situations in which it may be desirable to add or remove some master-eligible nodes to or from a cluster. If you wish to add some master-eligible nodes to your cluster, simply configure -the new nodes to find the existing cluster and start them up. Elasticsearch -will add the new nodes to the voting configuration if it is appropriate to do -so. +the new nodes to find the existing cluster and start them up. Elasticsearch will +add the new nodes to the voting configuration if it is appropriate to do so. When removing master-eligible nodes, it is important not to remove too many all at the same time. For instance, if there are currently seven master-eligible @@ -290,16 +287,16 @@ the auto-reconfiguration to take effect after each removal. If there are only two master-eligible nodes then neither node can be safely removed since both are required to reliably make progress, so you must first inform Elasticsearch that one of the nodes should not be part of the voting -configuration, and that the voting power should instead be given to other -nodes, allowing the excluded node to be taken offline without preventing the -other node from making progress. A node which is added to a voting -configuration exclusion list still works normally, but Elasticsearch will try -and remove it from the voting configuration so its vote is no longer required, -and will never automatically move such a node back into the voting -configuration after it has been removed. Once a node has been successfully -reconfigured out of the voting configuration, it is safe to shut it down -without affecting the cluster's availability. A node can be added to the voting -configuration exclusion list using the following API: +configuration, and that the voting power should instead be given to other nodes, +allowing the excluded node to be taken offline without preventing the other node +from making progress. A node which is added to a voting configuration exclusion +list still works normally, but Elasticsearch will try and remove it from the +voting configuration so its vote is no longer required, and will never +automatically move such a node back into the voting configuration after it has +been removed. Once a node has been successfully reconfigured out of the voting +configuration, it is safe to shut it down without affecting the cluster's +availability. A node can be added to the voting configuration exclusion list +using the following API: [source,js] -------------------------------------------------- @@ -319,15 +316,15 @@ voting configuration exclusions API fails then the call can safely be retried. A successful response guarantees that the node has been removed from the voting configuration and will not be reinstated. -Although the voting configuration exclusions API is most useful for -down-scaling a two-node to a one-node cluster, it is also possible to use it to -remove multiple nodes from larger clusters all at the same time. Adding -multiple nodes to the exclusions list has the system try to auto-reconfigure -all of these nodes out of the voting configuration, allowing them to be safely -shut down while keeping the cluster available. In the example described above, -shrinking a seven-master-node cluster down to only have three master nodes, you -could add four nodes to the exclusions list, wait for confirmation, and then -shut them down simultaneously. +Although the voting configuration exclusions API is most useful for down-scaling +a two-node to a one-node cluster, it is also possible to use it to remove +multiple nodes from larger clusters all at the same time. Adding multiple nodes +to the exclusions list has the system try to auto-reconfigure all of these nodes +out of the voting configuration, allowing them to be safely shut down while +keeping the cluster available. In the example described above, shrinking a +seven-master-node cluster down to only have three master nodes, you could add +four nodes to the exclusions list, wait for confirmation, and then shut them +down simultaneously. Adding an exclusion for a node creates an entry for that node in the voting configuration exclusions list, which has the system automatically try to @@ -348,11 +345,11 @@ This list is limited in size by the following setting: Sets a limits on the number of voting configuration exclusions at any one time. Defaults to `10`. -Since voting configuration exclusions are persistent and limited in number, -they must be cleaned up. Normally an exclusion is added when performing some +Since voting configuration exclusions are persistent and limited in number, they +must be cleaned up. Normally an exclusion is added when performing some maintenance on the cluster, and the exclusions should be cleaned up when the -maintenance is complete. Clusters should have no voting configuration -exclusions in normal operation. +maintenance is complete. Clusters should have no voting configuration exclusions +in normal operation. If a node is excluded from the voting configuration because it is to be shut down permanently then its exclusion can be removed once it has shut down and @@ -361,9 +358,9 @@ created in error or were only required temporarily: [source,js] -------------------------------------------------- -# Wait for all the nodes with voting configuration exclusions to be removed -# from the cluster and then remove all the exclusions, allowing any node to -# return to the voting configuration in the future. +# Wait for all the nodes with voting configuration exclusions to be removed from +# the cluster and then remove all the exclusions, allowing any node to return to +# the voting configuration in the future. DELETE /_cluster/voting_config_exclusions # Immediately remove all the voting configuration exclusions, allowing any node # to return to the voting configuration in the future. @@ -380,34 +377,33 @@ cluster state. The master node processes one cluster state update at a time, applies the required changes and publishes the updated cluster state to all the other nodes in the cluster. Each node receives the publish message, acknowledges it, but does *not* yet apply it. If the master does not receive acknowledgement -from enough nodes within a certain time -(controlled by the `cluster.publish.timeout` setting and defaults to 30 -seconds) the cluster state change is rejected. +from enough nodes within a certain time (controlled by the +`cluster.publish.timeout` setting and defaults to 30 seconds) the cluster state +change is rejected. Once enough nodes have responded, the cluster state is committed and a message will be sent to all the nodes. The nodes then proceed to apply the new cluster state to their internal state. The master node waits for all nodes to respond, up to a timeout, before going ahead processing the next updates in the queue. -The `cluster.publish.timeout` is set by default to 30 seconds and is -measured from the moment the publishing started. +The `cluster.publish.timeout` is set by default to 30 seconds and is measured +from the moment the publishing started. TODO add lag detection -Note, Elasticsearch is a peer to peer based system, nodes communicate -with one another directly if operations are delegated / broadcast. All -the main APIs (index, delete, search) do not communicate with the master -node. The responsibility of the master node is to maintain the global -cluster state, and act if nodes join or leave the cluster by reassigning -shards. Each time a cluster state is changed, the state is made known to -the other nodes in the cluster (the manner depends on the actual -discovery implementation). +Note, Elasticsearch is a peer to peer based system, nodes communicate with one +another directly if operations are delegated / broadcast. All the main APIs +(index, delete, search) do not communicate with the master node. The +responsibility of the master node is to maintain the global cluster state, and +act if nodes join or leave the cluster by reassigning shards. Each time a +cluster state is changed, the state is made known to the other nodes in the +cluster (the manner depends on the actual discovery implementation). [float] [[no-master-block]] === No master block -For the cluster to be fully operational, it must have an active master. -The `discovery.zen.no_master_block` settings controls what operations should be +For the cluster to be fully operational, it must have an active master. The +`discovery.zen.no_master_block` settings controls what operations should be rejected when there is no active master. The `discovery.zen.no_master_block` setting has two valid options: @@ -459,16 +455,15 @@ scheduling of elections. `cluster.election.duration`:: - Sets how long each election is allowed to take before a node considers it - to have failed and schedules a retry. This defaults to `500ms`. - + Sets how long each election is allowed to take before a node considers it to + have failed and schedules a retry. This defaults to `500ms`. [float] ==== Joining an elected master -During master election, or when joining an existing formed cluster, a node will send -a join request to the master in order to be officially added to the cluster. This join -process can be configured with the following settings. +During master election, or when joining an existing formed cluster, a node will +send a join request to the master in order to be officially added to the +cluster. This join process can be configured with the following settings. `cluster.join.timeout`:: @@ -483,10 +478,9 @@ process can be configured with the following settings. An elected master periodically checks each of its followers in order to ensure that they are still connected and healthy, and in turn each follower periodically checks the health of the elected master. Elasticsearch allows for -these checks occasionally to fail or timeout without taking any action, and -will only consider a node to be truly faulty after a number of consecutive -checks have failed. The following settings control the behaviour of fault -detection. +these checks occasionally to fail or timeout without taking any action, and will +only consider a node to be truly faulty after a number of consecutive checks +have failed. The following settings control the behaviour of fault detection. `cluster.fault_detection.follower_check.interval`:: @@ -516,9 +510,9 @@ detection. `cluster.fault_detection.leader_check.retry_count`:: - Sets how many consecutive leader check failures must occur before a - follower node considers the elected master to be faulty and attempts to - find or elect a new master. Defaults to `3`. + Sets how many consecutive leader check failures must occur before a follower + node considers the elected master to be faulty and attempts to find or elect + a new master. Defaults to `3`. [float] [[modules-discovery-quorums]] @@ -543,18 +537,18 @@ as required, as described in more detail below. As nodes are added or removed Elasticsearch maintains an optimal level of fault tolerance by updating the cluster's _voting configuration_, which is the set of master-eligible nodes whose responses are counted when making decisions such as -electing a new master or committing a new cluster state. A decision is only -made once more than half of the nodes in the voting configuration have -responded. Usually the voting configuration is the same as the set of all the +electing a new master or committing a new cluster state. A decision is only made +once more than half of the nodes in the voting configuration have responded. +Usually the voting configuration is the same as the set of all the master-eligible nodes that are currently in the cluster, but there are some situations in which they may be different. To be sure that the cluster remains available you **must not stop half or more of the nodes in the voting configuration at the same time**. As long as more -than half of the voting nodes are available the cluster can still work -normally. This means that if there are three or four master-eligible nodes then -the cluster can tolerate one of them being unavailable; if there are two or -fewer master-eligible nodes then they must all remain available. +than half of the voting nodes are available the cluster can still work normally. +This means that if there are three or four master-eligible nodes then the +cluster can tolerate one of them being unavailable; if there are two or fewer +master-eligible nodes then they must all remain available. After a node has joined or left the cluster the elected master must issue a cluster-state update that adjusts the voting configuration to match, and this @@ -565,16 +559,16 @@ to complete before removing more nodes from the cluster. ==== Getting the initial quorum When a brand-new cluster starts up for the first time, one of the tasks it must -perform is to elect its first master node, for which it needs to know the set -of master-eligible nodes whose votes should count in this first election. This +perform is to elect its first master node, for which it needs to know the set of +master-eligible nodes whose votes should count in this first election. This initial voting configuration is known as the _bootstrap configuration_. It is important that the bootstrap configuration identifies exactly which nodes should vote in the first election, and it is not sufficient to configure each -node with an expectation of how many nodes there should be in the cluster. It -is also important to note that the bootstrap configuration must come from -outside the cluster: there is no safe way for the cluster to determine the -bootstrap configuration correctly on its own. +node with an expectation of how many nodes there should be in the cluster. It is +also important to note that the bootstrap configuration must come from outside +the cluster: there is no safe way for the cluster to determine the bootstrap +configuration correctly on its own. If the bootstrap configuration is not set correctly then there is a risk when starting up a brand-new cluster is that you accidentally form two separate @@ -591,31 +585,31 @@ that four nodes were erroneously started instead of three: in this case there are enough nodes to form two separate clusters. Of course if each node is started manually then it's unlikely that too many nodes are started, but it's certainly possible to get into this situation if using a more automated -orchestrator, particularly if the orchestrator is not resilient to failures -such as network partitions. +orchestrator, particularly if the orchestrator is not resilient to failures such +as network partitions. The <> is only required the very first time a whole cluster starts up: new nodes joining -an established cluster can safely obtain all the information they need from -the elected master, and nodes that have previously been part of a cluster -will have stored to disk all the information required when restarting. +an established cluster can safely obtain all the information they need from the +elected master, and nodes that have previously been part of a cluster will have +stored to disk all the information required when restarting. [float] ==== Cluster maintenance, rolling restarts and migrations Many cluster maintenance tasks involve temporarily shutting down one or more nodes and then starting them back up again. By default Elasticsearch can remain -available if one of its master-eligible nodes is taken offline, such as during -a <>. Furthermore, if multiple nodes are -stopped and then started again then it will automatically recover, such as -during a <>. There is no need to take any -further action with the APIs described here in these cases, because the set of -master nodes is not changing permanently. +available if one of its master-eligible nodes is taken offline, such as during a +<>. Furthermore, if multiple nodes are stopped +and then started again then it will automatically recover, such as during a +<>. There is no need to take any further +action with the APIs described here in these cases, because the set of master +nodes is not changing permanently. It is also possible to perform a migration of a cluster onto entirely new nodes without taking the cluster offline, via a _rolling migration_. A rolling -migration is similar to a rolling restart, in that it is performed one node at -a time, and also requires no special handling for the master-eligible nodes as +migration is similar to a rolling restart, in that it is performed one node at a +time, and also requires no special handling for the master-eligible nodes as long as there are at least two of them available at all times. TODO the above is only true if the maintenance happens slowly enough, otherwise @@ -638,34 +632,33 @@ GET /_cluster/state?filter_path=metadata.cluster_coordination.last_committed_con -------------------------------------------------- // CONSOLE -NOTE: The current voting configuration is not necessarily the same as the set -of all available master-eligible nodes in the cluster. Altering the voting -configuration itself involves taking a vote, so it takes some time to adjust -the configuration as nodes join or leave the cluster. Also, there are -situations where the most resilient configuration includes unavailable nodes, -or does not include some available nodes, and in these situations the voting -configuration will differ from the set of available master-eligible nodes in -the cluster. +NOTE: The current voting configuration is not necessarily the same as the set of +all available master-eligible nodes in the cluster. Altering the voting +configuration itself involves taking a vote, so it takes some time to adjust the +configuration as nodes join or leave the cluster. Also, there are situations +where the most resilient configuration includes unavailable nodes, or does not +include some available nodes, and in these situations the voting configuration +will differ from the set of available master-eligible nodes in the cluster. Larger voting configurations are usually more resilient, so Elasticsearch will normally prefer to add master-eligible nodes to the voting configuration once they have joined the cluster. Similarly, if a node in the voting configuration -leaves the cluster and there is another master-eligible node in the cluster -that is not in the voting configuration then it is preferable to swap these two -nodes over, leaving the size of the voting configuration unchanged but -increasing its resilience. +leaves the cluster and there is another master-eligible node in the cluster that +is not in the voting configuration then it is preferable to swap these two nodes +over, leaving the size of the voting configuration unchanged but increasing its +resilience. It is not so straightforward to automatically remove nodes from the voting configuration after they have left the cluster, and different strategies have -different benefits and drawbacks, so the right choice depends on how the -cluster will be used and is controlled by the following setting. +different benefits and drawbacks, so the right choice depends on how the cluster +will be used and is controlled by the following setting. `cluster.auto_shrink_voting_configuration`:: - Defaults to `true`, meaning that the voting configuration will - automatically shrink, shedding departed nodes, as long as it still contains - at least 3 nodes. If set to `false`, the voting configuration never - automatically shrinks; departed nodes must be removed manually using the + Defaults to `true`, meaning that the voting configuration will automatically + shrink, shedding departed nodes, as long as it still contains at least 3 + nodes. If set to `false`, the voting configuration never automatically + shrinks; departed nodes must be removed manually using the <>. NOTE: If `cluster.auto_shrink_voting_configuration` is set to `true`, the @@ -677,8 +670,8 @@ healthy. There are situations in which Elasticsearch might tolerate the loss of multiple nodes, but this is not guaranteed under all sequences of failures. If this setting is set to `false` then departed nodes must be removed from the voting -configuration manually, using the vote withdrawal API described below, to achieve -the desired level of resilience. +configuration manually, using the vote withdrawal API described below, to +achieve the desired level of resilience. Note that Elasticsearch will not suffer from a "split-brain" inconsistency however it is configured. This setting only affects its availability in the @@ -699,12 +692,12 @@ side could make any progress in this situation. For instance if there are four master-eligible nodes in the cluster and the voting configuration contained all of them then any quorum-based decision would -require votes from at least three of them, which means that the cluster can -only tolerate the loss of a single master-eligible node. If this cluster were -split into two equal halves then neither half would contain three -master-eligible nodes so would not be able to make any progress. However if the -voting configuration contains only three of the four master-eligible nodes then -the cluster is still only fully tolerant to the loss of one node, but -quorum-based decisions require votes from two of the three voting nodes. In the -event of an even split, one half will contain two of the three voting nodes so -will remain available. +require votes from at least three of them, which means that the cluster can only +tolerate the loss of a single master-eligible node. If this cluster were split +into two equal halves then neither half would contain three master-eligible +nodes so would not be able to make any progress. However if the voting +configuration contains only three of the four master-eligible nodes then the +cluster is still only fully tolerant to the loss of one node, but quorum-based +decisions require votes from two of the three voting nodes. In the event of an +even split, one half will contain two of the three voting nodes so will remain +available. From 6102d5b7438c1ed430bf38bc055f98382d655d8e Mon Sep 17 00:00:00 2001 From: David Turner Date: Tue, 11 Dec 2018 10:36:38 +0000 Subject: [PATCH 039/106] Cluster formation module forms clusters --- docs/reference/modules/discovery.asciidoc | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/docs/reference/modules/discovery.asciidoc b/docs/reference/modules/discovery.asciidoc index 2578e965e6273..43bcb964fc770 100644 --- a/docs/reference/modules/discovery.asciidoc +++ b/docs/reference/modules/discovery.asciidoc @@ -1,10 +1,11 @@ [[modules-discovery]] == Discovery and cluster formation -The discovery and cluster formation module is responsible for discovering nodes, -electing a master, and publishing the cluster state. It is integrated with other -modules, for example, all communication between nodes is done using the -<> module. +The discovery and cluster formation module is responsible for discovering +nodes, electing a master, forming a cluster, and publishing the cluster state +each time it changes. It is integrated with other modules, for example, all +communication between nodes is done using the <> +module. It is separated into several sections, which are explained below: From 9fa08448c1af06356bc72fdd30acf68952eca505 Mon Sep 17 00:00:00 2001 From: David Turner Date: Tue, 11 Dec 2018 10:42:27 +0000 Subject: [PATCH 040/106] Rewording of summary --- docs/reference/modules/discovery.asciidoc | 29 +++++++++++++---------- 1 file changed, 16 insertions(+), 13 deletions(-) diff --git a/docs/reference/modules/discovery.asciidoc b/docs/reference/modules/discovery.asciidoc index 43bcb964fc770..5d1a781426d6b 100644 --- a/docs/reference/modules/discovery.asciidoc +++ b/docs/reference/modules/discovery.asciidoc @@ -10,26 +10,29 @@ module. It is separated into several sections, which are explained below: * <> is the process where nodes - find each other when starting up, or when losing a master. -* <> is a configuration step that's - required when an Elasticsearch starts up for the very first time. In - <>, with no discovery settings - configured, this step is automatically performed by the nodes themselves. As - this auto-bootstrapping is <>, - running a node in <> requires an explicit - cluster bootstrapping step. + find each other when the master is unknown, such as when a node has just + started up or when the previous master has failed. +* <> is required when an Elasticsearch + cluster starts up for the very first time. In <>, with no discovery settings configured, this is automatically + performed by the nodes themselves. As this auto-bootstrapping is + <>, running a node in + <> requires bootstrapping to be explicitly + configured via the `cluster.initial_master_nodes` setting. * It is recommended to have a small and fixed number of master-eligible nodes in a cluster, and to scale the cluster up and down by adding and removing non-master-eligible nodes only. However there are situations in which it may be desirable to add or remove some master-eligible nodes to or from a - cluster. A section on <> describes how Elasticsearch supports dynamically adding and - removing master-eligible nodes where, under certain conditions, special care - must be taken. + removing master-eligible nodes and also describes the extra steps that need + to be performed when removing more than half of the master-eligible nodes at + the same time. * <> covers how a master publishes cluster states to the other nodes in the cluster. -* <> describes what operations should be rejected when there - is no active master. +* The <> is put in place when there is no + known elected master, and can be configured to determine which operations + should be rejected when it is in place. * <> and <> sections cover advanced settings to influence the election and fault detection processes. * <> explains the From fb1e7d35e80fd7cf2d866fe85022026e9ab794d0 Mon Sep 17 00:00:00 2001 From: David Turner Date: Tue, 11 Dec 2018 10:59:40 +0000 Subject: [PATCH 041/106] Link to plugins page --- docs/reference/modules/discovery.asciidoc | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/docs/reference/modules/discovery.asciidoc b/docs/reference/modules/discovery.asciidoc index 5d1a781426d6b..0800d968e7f4f 100644 --- a/docs/reference/modules/discovery.asciidoc +++ b/docs/reference/modules/discovery.asciidoc @@ -47,12 +47,13 @@ the discovery process. At startup, or when disconnected from a master, Elasticsearch tries to connect to each seed node in its list, and holds a gossip-like conversation with them to find other nodes and to build a complete picture of the master-eligible nodes in the cluster. By default the cluster -formation module offers two hosts providers to configure the list of seed nodes: -a _settings-based_ and a _file-based_ hosts provider, but can be extended to -support cloud environments and other forms of host providers via plugins. Host -providers are configured using the `discovery.zen.hosts_provider` setting, which -defaults to the _settings-based_ hosts provider. Multiple hosts providers can be -specified as a list. +formation module offers two hosts providers to configure the list of seed +nodes: a _settings-based_ and a _file-based_ hosts provider, but can be +extended to support cloud environments and other forms of host providers via +{plugins}/discovery.html[discovery plugins]. Host providers are configured +using the `discovery.zen.hosts_provider` setting, which defaults to the +_settings-based_ hosts provider. Multiple hosts providers can be specified as a +list. [float] [[settings-based-hosts-provider]] From bfc7d16aa093708f1a3cacd624b365c25c0557c1 Mon Sep 17 00:00:00 2001 From: David Turner Date: Tue, 11 Dec 2018 10:59:52 +0000 Subject: [PATCH 042/106] Tweaks to discovery section --- docs/reference/modules/discovery.asciidoc | 40 ++++++++++++++--------- 1 file changed, 25 insertions(+), 15 deletions(-) diff --git a/docs/reference/modules/discovery.asciidoc b/docs/reference/modules/discovery.asciidoc index 0800d968e7f4f..31a55bc1ab245 100644 --- a/docs/reference/modules/discovery.asciidoc +++ b/docs/reference/modules/discovery.asciidoc @@ -163,15 +163,25 @@ that uses the GCE API find a list of seed nodes. [float] ==== Discovery settings -Discovery operates in two phases: First, each node "probes" the addresses of all -known nodes by connecting to each address and attempting to identify the node to -which it is connected. Secondly it shares with the remote node a list of all of -its peers and the remote node responds with _its_ peers in turn. The node then -probes all the new nodes about which it just discovered, requests their peers, -and so on, until it has discovered an elected master node or enough other -masterless nodes that it can perform an election. If neither of these occur -quickly enough then it tries again. This process is controlled by the following -settings. +Discovery operates in two phases: First, each node probes the addresses of all +known master-eligible nodes by connecting to each address and attempting to +identify the node to which it is connected. Secondly it shares with the remote +node a list of all of its known master-eligible peers and the remote node +responds with _its_ peers in turn. The node then probes all the new nodes about +which it just discovered, requests their peers, and so on, until it has +discovered an elected master node or enough other masterless master-eligible +nodes that it can perform an election. If neither of these occur quickly enough +then it tries again. This process is controlled by the following settings. + +`discovery.find_peers_interval`:: + + Sets how long a node will wait before attempting another discovery round. + Defaults to `1s`. + +`discovery.request_peers_timeout`:: + + Sets how long a node will wait after asking its peers again before + considering the request to have failed. Defaults to `3s`. `discovery.probe.connect_timeout`:: @@ -183,14 +193,14 @@ settings. Sets how long to wait when attempting to identify the remote node via a handshake. Defaults to `1s`. -`discovery.find_peers_interval`:: +`discovery.cluster_formation_warning_timeout`:: - Sets how long a node will wait before attempting another discovery round. + Sets how long a node will try to form a cluster before logging a warning + that the cluster did not form. Defaults to `10s`. -`discovery.request_peers_timeout`:: - - Sets how long a node will wait after asking its peers again before - considering the request to have failed. +If a cluster has not formed after `discovery.cluster_formation_warning_timeout` +has elapsed then the node will log a warning message starting with `master not +discovered` which describes the current state of the discovery module. [float] [[modules-discovery-bootstrap-cluster]] From 2ed2c39f13428476b6a91340c224ec53f8ba85c3 Mon Sep 17 00:00:00 2001 From: David Turner Date: Tue, 11 Dec 2018 11:15:01 +0000 Subject: [PATCH 043/106] More on bootstrapping --- docs/reference/modules/discovery.asciidoc | 68 ++++++++++++++--------- 1 file changed, 42 insertions(+), 26 deletions(-) diff --git a/docs/reference/modules/discovery.asciidoc b/docs/reference/modules/discovery.asciidoc index 31a55bc1ab245..9a9602c0a78a4 100644 --- a/docs/reference/modules/discovery.asciidoc +++ b/docs/reference/modules/discovery.asciidoc @@ -206,18 +206,23 @@ discovered` which describes the current state of the discovery module. [[modules-discovery-bootstrap-cluster]] === Bootstrapping a cluster -Starting an Elasticsearch cluster for the very first time requires a cluster -bootstrapping step. +Starting an Elasticsearch cluster for the very first time requires the initial +set of master-eligible nodes to be explicitly set on one or more of the +master-eligible nodes in the cluster using this setting: -The simplest way to bootstrap a cluster is by specifying the node names or -transport addresses of at least a non-empty subset of the master-eligible nodes -before start-up. The node setting `cluster.initial_master_nodes`, which takes a -list of node names or transport addresses, can be either specified on the -command line when starting up the nodes, or be added to the node configuration -file `elasticsearch.yml`. +`cluster.initial_master_nodes`:: -For a cluster with 3 master-eligible nodes (named master-a, master-b, and -master-c) the configuration will look as follows: + Sets a list of the node names or transport addresses of the initial set of + master-eligible nodes in a brand-new cluster. By default this list is + empty, meaning that this node expects to join a cluster that has already + been bootstrapped. + +This setting can be given on the command line when starting up each node, or +added to the `elasticsearch.yml` configuration file. Once the cluster has +formed this setting is no longer required and should be removed. + +For a cluster with 3 master-eligible nodes (named `master-a`, `master-b` and +`master-c`) the configuration will look as follows: [source,yaml] -------------------------------------------------- @@ -227,21 +232,18 @@ cluster.initial_master_nodes: - master-c -------------------------------------------------- -TODO provide another example with ip addresses (+ possibly port) - -Note that if you have not explicitly configured a node name, this name defaults -to the host name, so using the host names will work as well. While it is -sufficient to set this on a single master-eligible node in the cluster, and only -mention a single master-eligible node, using multiple nodes for bootstrapping -allows the bootstrap process to go through even if not all nodes are available. -In any case, when specifying the list of initial master nodes, **it is vitally -important** to configure each node with exactly the same list of nodes, to -prevent two independent clusters from forming. Typically you will set this on -the nodes that are mentioned in the list of initial master nodes. +Alternatively the IP addresses or hostnames of the nodes can be used. If there +is more than one Elasticsearch node with the same IP address or hostname then +the transport ports must also be given -WARNING: You must put exactly the same set of initial master nodes in each - configuration file in order to be sure that only a single cluster forms during - bootstrapping and therefore to avoid the risk of data loss. +[source,yaml] +-------------------------------------------------- +cluster.initial_master_nodes: + - 10.0.10.101 + - 10.0.10.102:9300 + - 10.0.10.102:9301 + - master-node-hostname +-------------------------------------------------- It is also possible to set the initial set of master nodes on the command-line used to start Elasticsearch: @@ -251,8 +253,22 @@ used to start Elasticsearch: $ bin/elasticsearch -Ecluster.initial_master_nodes=master-a,master-b,master-c -------------------------------------------------- -Just as with the config file, this additional command-line parameter can be -removed once a cluster has successfully formed. +It is technically sufficient to set this on a single master-eligible node in +the cluster, and only to mention a single master-eligible node, but this does +not allow for this single node to fail before the cluster has fully formed. It +is therefore better to bootstrap using multiple master-eligible-nodes. In any +case, when specifying the list of initial master nodes, **it is vitally +important** to configure each node with exactly the same list of nodes, to +prevent two independent clusters from forming. Typically you will set this on +the nodes that are mentioned in the list of initial master nodes. + +NOTE: In alpha releases, all listed master-eligible nodes are required to be + discovered before bootstrapping can take place. This requirement will be + relaxed in production-ready releases. + +WARNING: You must put exactly the same set of initial master nodes in each + configuration file in order to be sure that only a single cluster forms during + bootstrapping and therefore to avoid the risk of data loss. [float] ==== Choosing a cluster name From 68d9ef57a599e5ebbcb672de9553eac860aaf727 Mon Sep 17 00:00:00 2001 From: David Turner Date: Tue, 11 Dec 2018 11:21:02 +0000 Subject: [PATCH 044/106] Expand on cluster name --- docs/reference/modules/discovery.asciidoc | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/docs/reference/modules/discovery.asciidoc b/docs/reference/modules/discovery.asciidoc index 9a9602c0a78a4..76ce0939508e1 100644 --- a/docs/reference/modules/discovery.asciidoc +++ b/docs/reference/modules/discovery.asciidoc @@ -273,9 +273,12 @@ WARNING: You must put exactly the same set of initial master nodes in each [float] ==== Choosing a cluster name -The `cluster.name` allows to create separated clusters from one another. The -default value for the cluster name is `elasticsearch`, though it is recommended -to change this to reflect the logical group name of the cluster running. +The `cluster.name` allows you to create multiple clusters which are separated +from each other. Nodes verify that they agree on their cluster name when they +first connect to each other, and if two nodes have different cluster names then +they will not communicate meaningfully and will not belong to the same cluster. +The default value for the cluster name is `elasticsearch`, but it is +recommended to change this to reflect the logical name of the cluster. [float] ==== Auto-bootstrapping in development mode From a2b4d3895565c57ec1cc2a1cde84e44fc42e9122 Mon Sep 17 00:00:00 2001 From: David Turner Date: Tue, 11 Dec 2018 11:21:15 +0000 Subject: [PATCH 045/106] Expand on 'default configuration' for auto-bootstrapping --- docs/reference/modules/discovery.asciidoc | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/docs/reference/modules/discovery.asciidoc b/docs/reference/modules/discovery.asciidoc index 76ce0939508e1..50c08d6e612b6 100644 --- a/docs/reference/modules/discovery.asciidoc +++ b/docs/reference/modules/discovery.asciidoc @@ -289,6 +289,15 @@ short time after startup. Since nodes may not always reliably discover each other quickly enough this automatic bootstrapping is not always reliable and cannot be used in production deployments. +If any of the following settings are configured then auto-bootstrapping will +not take place, and you must configure `cluster.initial_master_nodes` as +described in the <>: + +* `discovery.zen.hosts_provider` +* `discovery.zen.ping.unicast.hosts` +* `cluster.initial_master_nodes` + [float] [[modules-discovery-adding-removing-nodes]] === Adding and removing nodes From bb6ef8ee199965b0f188eaeedecdccb128a49f93 Mon Sep 17 00:00:00 2001 From: David Turner Date: Tue, 11 Dec 2018 11:24:52 +0000 Subject: [PATCH 046/106] Master-ineligible --- docs/reference/modules/discovery.asciidoc | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/reference/modules/discovery.asciidoc b/docs/reference/modules/discovery.asciidoc index 50c08d6e612b6..38c099ecb71b8 100644 --- a/docs/reference/modules/discovery.asciidoc +++ b/docs/reference/modules/discovery.asciidoc @@ -21,7 +21,7 @@ It is separated into several sections, which are explained below: configured via the `cluster.initial_master_nodes` setting. * It is recommended to have a small and fixed number of master-eligible nodes in a cluster, and to scale the cluster up and down by adding and removing - non-master-eligible nodes only. However there are situations in which it may + master-ineligible nodes only. However there are situations in which it may be desirable to add or remove some master-eligible nodes to or from a cluster. A section on <> describes how Elasticsearch supports dynamically adding and @@ -309,7 +309,7 @@ electing a new master or committing a new cluster state. It is recommended to have a small and fixed number of master-eligible nodes in a cluster, and to scale the cluster up and down by adding and removing -non-master-eligible nodes only. However there are situations in which it may be +master-ineligible nodes only. However there are situations in which it may be desirable to add or remove some master-eligible nodes to or from a cluster. If you wish to add some master-eligible nodes to your cluster, simply configure From cbd33fff45a2d9dd80866918347df681b1e804c9 Mon Sep 17 00:00:00 2001 From: David Turner Date: Tue, 11 Dec 2018 11:25:06 +0000 Subject: [PATCH 047/106] Emphasize when you need voting exclusions --- docs/reference/modules/discovery.asciidoc | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/docs/reference/modules/discovery.asciidoc b/docs/reference/modules/discovery.asciidoc index 38c099ecb71b8..296dae1b4cf15 100644 --- a/docs/reference/modules/discovery.asciidoc +++ b/docs/reference/modules/discovery.asciidoc @@ -369,6 +369,11 @@ seven-master-node cluster down to only have three master nodes, you could add four nodes to the exclusions list, wait for confirmation, and then shut them down simultaneously. +NOTE: Voting exclusions are only required when removing at least half of the +master-eligible nodes from a cluster in a short time period. They are not +required when removing master-ineligible nodes, nor are they required when +removing fewer than half of the master-eligible nodes. + Adding an exclusion for a node creates an entry for that node in the voting configuration exclusions list, which has the system automatically try to reconfigure the voting configuration to remove that node and prevents it from From b91519cd1b4e2b3ea0748dfe19ed63ad3004b143 Mon Sep 17 00:00:00 2001 From: David Turner Date: Tue, 11 Dec 2018 11:27:35 +0000 Subject: [PATCH 048/106] More on publishing --- docs/reference/modules/discovery.asciidoc | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/docs/reference/modules/discovery.asciidoc b/docs/reference/modules/discovery.asciidoc index 296dae1b4cf15..4fe0321d97232 100644 --- a/docs/reference/modules/discovery.asciidoc +++ b/docs/reference/modules/discovery.asciidoc @@ -425,15 +425,15 @@ cluster state. The master node processes one cluster state update at a time, applies the required changes and publishes the updated cluster state to all the other nodes in the cluster. Each node receives the publish message, acknowledges it, but does *not* yet apply it. If the master does not receive acknowledgement -from enough nodes within a certain time (controlled by the -`cluster.publish.timeout` setting and defaults to 30 seconds) the cluster state -change is rejected. - -Once enough nodes have responded, the cluster state is committed and a message -will be sent to all the nodes. The nodes then proceed to apply the new cluster -state to their internal state. The master node waits for all nodes to respond, -up to a timeout, before going ahead processing the next updates in the queue. -The `cluster.publish.timeout` is set by default to 30 seconds and is measured +from enough master-eligible nodes within a certain time (controlled by the +`cluster.publish.timeout` setting which defaults to 30 seconds) the cluster +state change is rejected. + +Once enough nodes have responded, the cluster state is committed and a commit +message is sent to all the nodes. The nodes then proceed to apply the new +cluster state to their internal state. The master node waits for all nodes to +respond, or until `cluster.publish.timeout` has elapsed, before starting to +process the next update in the queue. The `cluster.publish.timeout` is measured from the moment the publishing started. TODO add lag detection From 7540e4e0474c4f58d25f5a0fe37002f1e4a13c83 Mon Sep 17 00:00:00 2001 From: David Turner Date: Tue, 11 Dec 2018 11:35:36 +0000 Subject: [PATCH 049/106] Add lag detection bit --- docs/reference/modules/discovery.asciidoc | 22 +++++++++++++--------- 1 file changed, 13 insertions(+), 9 deletions(-) diff --git a/docs/reference/modules/discovery.asciidoc b/docs/reference/modules/discovery.asciidoc index 4fe0321d97232..d41ff85cf6d2c 100644 --- a/docs/reference/modules/discovery.asciidoc +++ b/docs/reference/modules/discovery.asciidoc @@ -436,15 +436,19 @@ respond, or until `cluster.publish.timeout` has elapsed, before starting to process the next update in the queue. The `cluster.publish.timeout` is measured from the moment the publishing started. -TODO add lag detection - -Note, Elasticsearch is a peer to peer based system, nodes communicate with one -another directly if operations are delegated / broadcast. All the main APIs -(index, delete, search) do not communicate with the master node. The -responsibility of the master node is to maintain the global cluster state, and -act if nodes join or leave the cluster by reassigning shards. Each time a -cluster state is changed, the state is made known to the other nodes in the -cluster (the manner depends on the actual discovery implementation). +If a node fails to apply a cluster state update within the +`cluster.publish.timeout` timeout then its cluster state lags behind the most +recently-published state from the master. The master waits for a further +timeout, `cluster.follower_lag.timeout`, which defaults to 90 seconds, and if +the node has still not successfully applied the cluster state update then it is +removed from the cluster. + +NOTE: Elasticsearch is a peer to peer based system, in which nodes communicate +with one another directly. The high-throughput APIs (index, delete, search) do +not normally interact with the master node. The responsibility of the master +node is to maintain the global cluster state, and act if nodes join or leave the +cluster by reassigning shards. Each time the cluster state is changed, the new +state is published to all nodes in the cluster as described above. [float] [[no-master-block]] From d04b7ad4358ea48e4c1cf3ed9c6dd57e16475015 Mon Sep 17 00:00:00 2001 From: David Turner Date: Tue, 11 Dec 2018 11:38:45 +0000 Subject: [PATCH 050/106] Tweaks --- docs/reference/modules/discovery.asciidoc | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/docs/reference/modules/discovery.asciidoc b/docs/reference/modules/discovery.asciidoc index d41ff85cf6d2c..f18fe52b4e39d 100644 --- a/docs/reference/modules/discovery.asciidoc +++ b/docs/reference/modules/discovery.asciidoc @@ -458,7 +458,7 @@ For the cluster to be fully operational, it must have an active master. The `discovery.zen.no_master_block` settings controls what operations should be rejected when there is no active master. -The `discovery.zen.no_master_block` setting has two valid options: +The `discovery.zen.no_master_block` setting has two valid values: [horizontal] `all`:: All operations on the node--i.e. both read & writes--will be rejected. @@ -469,9 +469,9 @@ succeed, based on the last known cluster configuration. This may result in partial reads of stale data as this node may be isolated from the rest of the cluster. -The `discovery.zen.no_master_block` setting doesn't apply to nodes-based apis -(for example cluster stats, node info and node stats apis). Requests to these -apis will not be blocked and can run on any available node. +The `discovery.zen.no_master_block` setting doesn't apply to nodes-based APIs +(for example cluster stats, node info, and node stats APIs). Requests to these +APIs will not be blocked and can run on any available node. [float] [[master-election]] @@ -584,7 +584,9 @@ those of the other piece. Elasticsearch allows you to add and remove master-eligible nodes to a running cluster. In many cases you can do this simply by starting or stopping the nodes -as required, as described in more detail below. +as required, as described in more detail in the +<>. As nodes are added or removed Elasticsearch maintains an optimal level of fault tolerance by updating the cluster's _voting configuration_, which is the set of @@ -608,7 +610,7 @@ can take a short time to complete. It is important to wait for this adjustment to complete before removing more nodes from the cluster. [float] -==== Getting the initial quorum +==== Setting the initial quorum When a brand-new cluster starts up for the first time, one of the tasks it must perform is to elect its first master node, for which it needs to know the set of From 8635a4181b6867973da5938819f834bef7152b87 Mon Sep 17 00:00:00 2001 From: David Turner Date: Tue, 11 Dec 2018 11:44:39 +0000 Subject: [PATCH 051/106] Hyphen? --- docs/reference/modules/discovery.asciidoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/reference/modules/discovery.asciidoc b/docs/reference/modules/discovery.asciidoc index f18fe52b4e39d..51f25723ef145 100644 --- a/docs/reference/modules/discovery.asciidoc +++ b/docs/reference/modules/discovery.asciidoc @@ -256,7 +256,7 @@ $ bin/elasticsearch -Ecluster.initial_master_nodes=master-a,master-b,master-c It is technically sufficient to set this on a single master-eligible node in the cluster, and only to mention a single master-eligible node, but this does not allow for this single node to fail before the cluster has fully formed. It -is therefore better to bootstrap using multiple master-eligible-nodes. In any +is therefore better to bootstrap using multiple master-eligible nodes. In any case, when specifying the list of initial master nodes, **it is vitally important** to configure each node with exactly the same list of nodes, to prevent two independent clusters from forming. Typically you will set this on From 118944060fd67121933c9658e06ba45db9f24dcc Mon Sep 17 00:00:00 2001 From: David Turner Date: Tue, 11 Dec 2018 11:47:39 +0000 Subject: [PATCH 052/106] Consistentify with the `node.name` setting. --- distribution/src/config/elasticsearch.yml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/distribution/src/config/elasticsearch.yml b/distribution/src/config/elasticsearch.yml index 869692d01c06d..ceb1fc078648f 100644 --- a/distribution/src/config/elasticsearch.yml +++ b/distribution/src/config/elasticsearch.yml @@ -69,7 +69,7 @@ ${path.logs} # # Bootstrap the cluster using an initial set of master-eligible nodes: # -#cluster.initial_master_nodes: ["node1", "node2"] +#cluster.initial_master_nodes: ["node-1", "node-2"] # # For more information, consult the discovery and cluster formation module documentation. # From 7c7e7aff5342cd7fc4b70a27fa4a27884f943141 Mon Sep 17 00:00:00 2001 From: David Turner Date: Tue, 11 Dec 2018 12:34:36 +0000 Subject: [PATCH 053/106] Add note on disconnections bypassing fault detection --- docs/reference/modules/discovery.asciidoc | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/docs/reference/modules/discovery.asciidoc b/docs/reference/modules/discovery.asciidoc index 51f25723ef145..a801ba9bc86cd 100644 --- a/docs/reference/modules/discovery.asciidoc +++ b/docs/reference/modules/discovery.asciidoc @@ -566,6 +566,14 @@ have failed. The following settings control the behaviour of fault detection. node considers the elected master to be faulty and attempts to find or elect a new master. Defaults to `3`. +If the elected master detects that a follower has disconnected then this is +treated as an immediate failure, bypassing the timeouts and retries listed +above, and the master attempts to remove the node from the cluster. Similarly, +if a follower detects that the elected master has disconnected then this is +treated as an immediate failure, bypassing the timeouts and retries listed +above, and the follower restarts its discovery phase to try and find or elect a +new master. + [float] [[modules-discovery-quorums]] === Quorum-based decision making From d48eccc6e96f7cf2aba46aeee9e50f0ead74a4c2 Mon Sep 17 00:00:00 2001 From: David Turner Date: Tue, 11 Dec 2018 13:11:09 +0000 Subject: [PATCH 054/106] Add breaking changes --- docs/reference/migration/migrate_7_0.asciidoc | 2 + .../migration/migrate_7_0/cluster.asciidoc | 9 ----- .../migration/migrate_7_0/discovery.asciidoc | 38 +++++++++++++++++++ 3 files changed, 40 insertions(+), 9 deletions(-) create mode 100644 docs/reference/migration/migrate_7_0/discovery.asciidoc diff --git a/docs/reference/migration/migrate_7_0.asciidoc b/docs/reference/migration/migrate_7_0.asciidoc index 45f383435e4bc..9f99604318aa9 100644 --- a/docs/reference/migration/migrate_7_0.asciidoc +++ b/docs/reference/migration/migrate_7_0.asciidoc @@ -11,6 +11,7 @@ See also <> and <>. * <> * <> +* <> * <> * <> * <> @@ -44,6 +45,7 @@ Elasticsearch 6.x in order to be readable by Elasticsearch 7.x. include::migrate_7_0/aggregations.asciidoc[] include::migrate_7_0/analysis.asciidoc[] include::migrate_7_0/cluster.asciidoc[] +include::migrate_7_0/discovery.asciidoc[] include::migrate_7_0/indices.asciidoc[] include::migrate_7_0/mappings.asciidoc[] include::migrate_7_0/search.asciidoc[] diff --git a/docs/reference/migration/migrate_7_0/cluster.asciidoc b/docs/reference/migration/migrate_7_0/cluster.asciidoc index 732270706ff3d..bfe7d5df2d094 100644 --- a/docs/reference/migration/migrate_7_0/cluster.asciidoc +++ b/docs/reference/migration/migrate_7_0/cluster.asciidoc @@ -25,12 +25,3 @@ Clusters now have soft limits on the total number of open shards in the cluster based on the number of nodes and the `cluster.max_shards_per_node` cluster setting, to prevent accidental operations that would destabilize the cluster. More information can be found in the <>. - -[float] -==== Discovery configuration is required in production -Production deployments of Elasticsearch now require at least one of the following settings -to be specified in the `elasticsearch.yml` configuration file: - -- `discovery.zen.ping.unicast.hosts` -- `discovery.zen.hosts_provider` -- `cluster.initial_master_nodes` diff --git a/docs/reference/migration/migrate_7_0/discovery.asciidoc b/docs/reference/migration/migrate_7_0/discovery.asciidoc new file mode 100644 index 0000000000000..3d59442e965bd --- /dev/null +++ b/docs/reference/migration/migrate_7_0/discovery.asciidoc @@ -0,0 +1,38 @@ +[float] +[[breaking_70_discovery_changes]] +=== Discovery changes + +[float] +==== Cluster bootstrapping is required if discovery is configured + +The first time a cluster is started, `cluster.initial_master_nodes` must be set +to perform cluster bootstrapping. It should contain the names of the +master-eligible nodes in the initial cluster and be defined on every +master-eligible node in the cluster. The +<> describes this +setting in more detail. + +The `discovery.zen.minimum_master_nodes` setting is required during a rolling +upgrade from 6.x, but can be removed in all other circumstances. + +[float] +==== Removing master-eligible nodes sometimes requires voting exclusions + +If you wish to remove half or more of the master-eligible nodes from a cluster, +you must first exclude the affected nodes from the voting configuration using +the <>. If +removing fewer than half of the master-eligible nodes at once then this is not +required. This is also not required when removing master-ineligible nodes such +as data-only nodes or coordinating-only nodes. Finally, no special action is +required when adding nodes to the cluster, only when removing them. + +[float] +==== Discovery configuration is required in production + +Production deployments of Elasticsearch now require at least one of the +following settings to be specified in the `elasticsearch.yml` configuration +file: + +- `discovery.zen.ping.unicast.hosts` +- `discovery.zen.hosts_provider` +- `cluster.initial_master_nodes` From 43a6dcc5a7e5673ea3bf8458041ed16ef932c4c1 Mon Sep 17 00:00:00 2001 From: David Turner Date: Tue, 11 Dec 2018 13:21:26 +0000 Subject: [PATCH 055/106] Reword --- .../migration/migrate_7_0/discovery.asciidoc | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/docs/reference/migration/migrate_7_0/discovery.asciidoc b/docs/reference/migration/migrate_7_0/discovery.asciidoc index 3d59442e965bd..b9fdd79d4c037 100644 --- a/docs/reference/migration/migrate_7_0/discovery.asciidoc +++ b/docs/reference/migration/migrate_7_0/discovery.asciidoc @@ -9,8 +9,8 @@ The first time a cluster is started, `cluster.initial_master_nodes` must be set to perform cluster bootstrapping. It should contain the names of the master-eligible nodes in the initial cluster and be defined on every master-eligible node in the cluster. The -<> describes this -setting in more detail. +<> describes this setting in more detail. The `discovery.zen.minimum_master_nodes` setting is required during a rolling upgrade from 6.x, but can be removed in all other circumstances. @@ -20,10 +20,10 @@ upgrade from 6.x, but can be removed in all other circumstances. If you wish to remove half or more of the master-eligible nodes from a cluster, you must first exclude the affected nodes from the voting configuration using -the <>. If -removing fewer than half of the master-eligible nodes at once then this is not -required. This is also not required when removing master-ineligible nodes such -as data-only nodes or coordinating-only nodes. Finally, no special action is +the <>. +This is not required if removing fewer than half of the master-eligible nodes +at once. This is also not required when only removing master-ineligible nodes +such as data-only nodes or coordinating-only nodes. Finally, this is not required when adding nodes to the cluster, only when removing them. [float] From e6087e95dde43744ded7263aa609916d368faf4b Mon Sep 17 00:00:00 2001 From: David Turner Date: Tue, 11 Dec 2018 15:33:31 +0000 Subject: [PATCH 056/106] Split up discovery depending on master-eligibility --- docs/reference/modules/discovery.asciidoc | 25 ++++++++++++++++------- 1 file changed, 18 insertions(+), 7 deletions(-) diff --git a/docs/reference/modules/discovery.asciidoc b/docs/reference/modules/discovery.asciidoc index a801ba9bc86cd..c7ffdb22c8f94 100644 --- a/docs/reference/modules/discovery.asciidoc +++ b/docs/reference/modules/discovery.asciidoc @@ -167,11 +167,21 @@ Discovery operates in two phases: First, each node probes the addresses of all known master-eligible nodes by connecting to each address and attempting to identify the node to which it is connected. Secondly it shares with the remote node a list of all of its known master-eligible peers and the remote node -responds with _its_ peers in turn. The node then probes all the new nodes about -which it just discovered, requests their peers, and so on, until it has -discovered an elected master node or enough other masterless master-eligible -nodes that it can perform an election. If neither of these occur quickly enough -then it tries again. This process is controlled by the following settings. +responds with _its_ peers in turn. The node then probes all the new nodes that +it just discovered, requests their peers, and so on. + +If the node is not master-eligible then it continues this discovery process +until it has discovered an elected master node. If no elected master is +discovered then the node will retry after `discovery.find_peers_interval` which +defaults to `1s`. + +If the node is master-eligible then it continues this discovery process until it +has either discovered an elected master node or else it has discovered enough +masterless master-eligible nodes to complete an election. If neither of these +occur quickly enough then the node will retry after +`discovery.find_peers_interval` which defaults to `1s`. + +The discovery process is controlled by the following settings. `discovery.find_peers_interval`:: @@ -199,8 +209,9 @@ then it tries again. This process is controlled by the following settings. that the cluster did not form. Defaults to `10s`. If a cluster has not formed after `discovery.cluster_formation_warning_timeout` -has elapsed then the node will log a warning message starting with `master not -discovered` which describes the current state of the discovery module. +has elapsed then the node will log a warning message that starts with the phrase +`master not discovered` which describes the current state of the discovery +process. [float] [[modules-discovery-bootstrap-cluster]] From 02b7ebd86e020e4caf1c7c3fc1abfe00c5802efd Mon Sep 17 00:00:00 2001 From: David Turner Date: Tue, 11 Dec 2018 15:48:22 +0000 Subject: [PATCH 057/106] Use the leader/follower terminology less --- docs/reference/modules/discovery.asciidoc | 58 ++++++++++++----------- 1 file changed, 30 insertions(+), 28 deletions(-) diff --git a/docs/reference/modules/discovery.asciidoc b/docs/reference/modules/discovery.asciidoc index c7ffdb22c8f94..83b187047ea0e 100644 --- a/docs/reference/modules/discovery.asciidoc +++ b/docs/reference/modules/discovery.asciidoc @@ -500,9 +500,9 @@ scheduling of elections. `cluster.election.initial_timeout`:: - Sets the upper bound on how long a node will wait initially, or after a - leader failure, before attempting its first election. This defaults to - `100ms`. + Sets the upper bound on how long a node will wait initially, or after the + elected master fails, before attempting its first election. This defaults + to `100ms`. `cluster.election.back_off_time`:: @@ -538,17 +538,20 @@ cluster. This join process can be configured with the following settings. [[fault-detection]] === Fault Detection -An elected master periodically checks each of its followers in order to ensure -that they are still connected and healthy, and in turn each follower -periodically checks the health of the elected master. Elasticsearch allows for -these checks occasionally to fail or timeout without taking any action, and will -only consider a node to be truly faulty after a number of consecutive checks -have failed. The following settings control the behaviour of fault detection. +An elected master periodically checks each of the nodes in the cluster in order +to ensure that they are still connected and healthy, and in turn each node in +the cluster periodically checks the health of the elected master. These checks +are known respectively as _follower checks_ and _leader checks_. + +Elasticsearch allows for these checks occasionally to fail or timeout without +taking any action, and will only consider a node to be truly faulty after a +number of consecutive checks have failed. The following settings control the +behaviour of fault detection. `cluster.fault_detection.follower_check.interval`:: - Sets how long the elected master waits between checks of its followers. - Defaults to `1s`. + Sets how long the elected master waits between follower checks to each + other node in the cluster. Defaults to `1s`. `cluster.fault_detection.follower_check.timeout`:: @@ -557,33 +560,32 @@ have failed. The following settings control the behaviour of fault detection. `cluster.fault_detection.follower_check.retry_count`:: - Sets how many consecutive follower check failures must occur before the - elected master considers a follower node to be faulty and removes it from - the cluster. Defaults to `3`. + Sets how many consecutive follower check failures must occur to each node + before the elected master considers that node to be faulty and removes it + from the cluster. Defaults to `3`. `cluster.fault_detection.leader_check.interval`:: - Sets how long each follower node waits between checks of its leader. + Sets how long each node waits between checks of the elected master. Defaults to `1s`. `cluster.fault_detection.leader_check.timeout`:: - Sets how long each follower node waits for a response to a leader check - before considering it to have failed. Defaults to `30s`. + Sets how long each node waits for a response to a leader check from the + elected master before considering it to have failed. Defaults to `30s`. `cluster.fault_detection.leader_check.retry_count`:: - Sets how many consecutive leader check failures must occur before a follower - node considers the elected master to be faulty and attempts to find or elect - a new master. Defaults to `3`. - -If the elected master detects that a follower has disconnected then this is -treated as an immediate failure, bypassing the timeouts and retries listed -above, and the master attempts to remove the node from the cluster. Similarly, -if a follower detects that the elected master has disconnected then this is -treated as an immediate failure, bypassing the timeouts and retries listed -above, and the follower restarts its discovery phase to try and find or elect a -new master. + Sets how many consecutive leader check failures must occur before a node + considers the elected master to be faulty and attempts to find or elect a + new master. Defaults to `3`. + +If the elected master detects that a node has disconnected then this is treated +as an immediate failure, bypassing the timeouts and retries listed above, and +the master attempts to remove the node from the cluster. Similarly, if a node +detects that the elected master has disconnected then this is treated as an +immediate failure, bypassing the timeouts and retries listed above, and the +follower restarts its discovery phase to try and find or elect a new master. [float] [[modules-discovery-quorums]] From 7714003fc7aee217b037d001c9e14a1ecee2ca92 Mon Sep 17 00:00:00 2001 From: Yannick Welsch Date: Tue, 11 Dec 2018 22:40:45 +0100 Subject: [PATCH 058/106] fix link --- docs/plugins/discovery.asciidoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/plugins/discovery.asciidoc b/docs/plugins/discovery.asciidoc index fb77e60898ff9..17b223478eb51 100644 --- a/docs/plugins/discovery.asciidoc +++ b/docs/plugins/discovery.asciidoc @@ -2,7 +2,7 @@ == Discovery Plugins Discovery plugins extend Elasticsearch by adding new host providers that -can be used to extend the {ref}/modules-discovery-zen.html[cluster formation module]. +can be used to extend the {ref}/modules-discovery.html[cluster formation module]. [float] ==== Core discovery plugins From dddc3cf5ba5103906bb6f29020116c60cce7958f Mon Sep 17 00:00:00 2001 From: Yannick Welsch Date: Wed, 12 Dec 2018 09:59:26 +0100 Subject: [PATCH 059/106] smaller changes --- docs/reference/modules/discovery.asciidoc | 89 ++++++++++++++--------- 1 file changed, 53 insertions(+), 36 deletions(-) diff --git a/docs/reference/modules/discovery.asciidoc b/docs/reference/modules/discovery.asciidoc index 83b187047ea0e..e670a5496bd9b 100644 --- a/docs/reference/modules/discovery.asciidoc +++ b/docs/reference/modules/discovery.asciidoc @@ -24,8 +24,7 @@ It is separated into several sections, which are explained below: master-ineligible nodes only. However there are situations in which it may be desirable to add or remove some master-eligible nodes to or from a cluster. A section on <> describes how Elasticsearch supports dynamically adding and - removing master-eligible nodes and also describes the extra steps that need + removing nodes>> describes this process as well as the extra steps that need to be performed when removing more than half of the master-eligible nodes at the same time. * <> covers how a master @@ -33,8 +32,9 @@ It is separated into several sections, which are explained below: * The <> is put in place when there is no known elected master, and can be configured to determine which operations should be rejected when it is in place. -* <> and <> sections cover advanced settings - to influence the election and fault detection processes. +* <> and <> + sections cover advanced settings to influence the election and fault + detection processes. * <> explains the design behind the master election and auto-reconfiguration logic. @@ -48,11 +48,11 @@ Elasticsearch tries to connect to each seed node in its list, and holds a gossip-like conversation with them to find other nodes and to build a complete picture of the master-eligible nodes in the cluster. By default the cluster formation module offers two hosts providers to configure the list of seed -nodes: a _settings-based_ and a _file-based_ hosts provider, but can be -extended to support cloud environments and other forms of host providers via -{plugins}/discovery.html[discovery plugins]. Host providers are configured +nodes: a _settings_-based and a _file_-based hosts provider, but can be +extended to support cloud environments and other forms of hosts providers via +{plugins}/discovery.html[discovery plugins]. Hosts providers are configured using the `discovery.zen.hosts_provider` setting, which defaults to the -_settings-based_ hosts provider. Multiple hosts providers can be specified as a +_settings_-based hosts provider. Multiple hosts providers can be specified as a list. [float] @@ -62,7 +62,7 @@ list. The settings-based hosts provider use a node setting to configure a static list of hosts to use as seed nodes. These hosts can be specified as hostnames or IP addresses; hosts specified as hostnames are resolved to IP addresses during each -round of pinging. Note that if you are in an environment where DNS resolutions +round of discovery. Note that if you are in an environment where DNS resolutions vary with time, you might need to adjust your <>. @@ -73,8 +73,20 @@ the setting `transport.profiles.default.port` falling back to `transport.tcp.port` if not set). Note that IPv6 hosts must be bracketed. The default for this setting is `127.0.0.1, [::1]` +[source,yaml] +-------------------------------------------------- +discovery.zen.ping.unicast.hosts: + - 192.168.1.10:9300 + - 192.168.1.11 <1> + - seeds.mydomain.com <2> +-------------------------------------------------- +<1> The port will default to `transport.profiles.default.port` and fallback to + `transport.tcp.port` if not specified. +<2> A hostname that resolves to multiple IP addresses will try all resolved + addresses. + Additionally, the `discovery.zen.ping.unicast.hosts.resolve_timeout` configures -the amount of time to wait for DNS lookups on each round of pinging. This is +the amount of time to wait for DNS lookups on each round of discovery. This is specified as a <> and defaults to 5s. Unicast discovery uses the <> module to perform the @@ -223,17 +235,17 @@ master-eligible nodes in the cluster using this setting: `cluster.initial_master_nodes`:: - Sets a list of the node names or transport addresses of the initial set of - master-eligible nodes in a brand-new cluster. By default this list is - empty, meaning that this node expects to join a cluster that has already - been bootstrapped. + Sets a list of the <> or transport addresses of the + initial set of master-eligible nodes in a brand-new cluster. By default + this list is empty, meaning that this node expects to join a cluster that + has already been bootstrapped. This setting can be given on the command line when starting up each node, or added to the `elasticsearch.yml` configuration file. Once the cluster has formed this setting is no longer required and should be removed. -For a cluster with 3 master-eligible nodes (named `master-a`, `master-b` and -`master-c`) the configuration will look as follows: +For a cluster with 3 master-eligible nodes (with <> +`master-a`, `master-b` and `master-c`) the configuration will look as follows: [source,yaml] -------------------------------------------------- @@ -243,7 +255,8 @@ cluster.initial_master_nodes: - master-c -------------------------------------------------- -Alternatively the IP addresses or hostnames of the nodes can be used. If there +Alternatively the IP addresses or hostnames +(<>) can be used. If there is more than one Elasticsearch node with the same IP address or hostname then the transport ports must also be given @@ -256,8 +269,8 @@ cluster.initial_master_nodes: - master-node-hostname -------------------------------------------------- -It is also possible to set the initial set of master nodes on the command-line -used to start Elasticsearch: +Like all node settings, it is also possible to specify the initial set of +master nodes on the command-line that is used to start Elasticsearch: [source,bash] -------------------------------------------------- @@ -265,10 +278,10 @@ $ bin/elasticsearch -Ecluster.initial_master_nodes=master-a,master-b,master-c -------------------------------------------------- It is technically sufficient to set this on a single master-eligible node in -the cluster, and only to mention a single master-eligible node, but this does -not allow for this single node to fail before the cluster has fully formed. It -is therefore better to bootstrap using multiple master-eligible nodes. In any -case, when specifying the list of initial master nodes, **it is vitally +the cluster, and only to mention that single node in the setting, but this +provides no fault tolerance before the cluster has fully formed. It +is therefore better to bootstrap using at least three master-eligible nodes. +In any case, when specifying the list of initial master nodes, **it is vitally important** to configure each node with exactly the same list of nodes, to prevent two independent clusters from forming. Typically you will set this on the nodes that are mentioned in the list of initial master nodes. @@ -278,14 +291,15 @@ NOTE: In alpha releases, all listed master-eligible nodes are required to be relaxed in production-ready releases. WARNING: You must put exactly the same set of initial master nodes in each - configuration file in order to be sure that only a single cluster forms during - bootstrapping and therefore to avoid the risk of data loss. + configuration file (or leave the configuration empty) in order to be sure + that only a single cluster forms during bootstrapping and therefore to + avoid the risk of data loss. [float] ==== Choosing a cluster name The `cluster.name` allows you to create multiple clusters which are separated -from each other. Nodes verify that they agree on their cluster name when they +from each other. Nodes verify that they agree on their cluster name when they first connect to each other, and if two nodes have different cluster names then they will not communicate meaningfully and will not belong to the same cluster. The default value for the cluster name is `elasticsearch`, but it is @@ -336,11 +350,12 @@ cannot take any further actions. As long as there are at least three master-eligible nodes in the cluster, as a general rule it is best to remove nodes one-at-a-time, allowing enough time for -the auto-reconfiguration to take effect after each removal. +the cluster to <> the voting +configuration and adapt the fault tolerance level to the new set of nodes. -If there are only two master-eligible nodes then neither node can be safely -removed since both are required to reliably make progress, so you must first -inform Elasticsearch that one of the nodes should not be part of the voting +If there are only two master-eligible nodes remaining then neither node can be +safely removed since both are required to reliably make progress, so you must +first inform Elasticsearch that one of the nodes should not be part of the voting configuration, and that the voting power should instead be given to other nodes, allowing the excluded node to be taken offline without preventing the other node from making progress. A node which is added to a voting configuration exclusion @@ -349,8 +364,8 @@ voting configuration so its vote is no longer required, and will never automatically move such a node back into the voting configuration after it has been removed. Once a node has been successfully reconfigured out of the voting configuration, it is safe to shut it down without affecting the cluster's -availability. A node can be added to the voting configuration exclusion list -using the following API: +master-level availability. A node can be added to the voting configuration +exclusion list using the following API: [source,js] -------------------------------------------------- @@ -358,6 +373,7 @@ using the following API: # auto-reconfigure the node out of the voting configuration up to the default # timeout of 30 seconds POST /_cluster/voting_config_exclusions/node_name + # Add node to voting configuration exclusions list and wait for # auto-reconfiguration up to one minute POST /_cluster/voting_config_exclusions/node_name?timeout=1m @@ -367,12 +383,12 @@ POST /_cluster/voting_config_exclusions/node_name?timeout=1m The node that should be added to the exclusions list is specified using <> in place of `node_name` here. If a call to the voting configuration exclusions API fails then the call can safely be retried. -A successful response guarantees that the node has been removed from the voting -configuration and will not be reinstated. +Only a successful response guarantees that the node has actually been removed +from the voting configuration and will not be reinstated. Although the voting configuration exclusions API is most useful for down-scaling a two-node to a one-node cluster, it is also possible to use it to remove -multiple nodes from larger clusters all at the same time. Adding multiple nodes +multiple master-eligible nodes all at the same time. Adding multiple nodes to the exclusions list has the system try to auto-reconfigure all of these nodes out of the voting configuration, allowing them to be safely shut down while keeping the cluster available. In the example described above, shrinking a @@ -388,7 +404,7 @@ removing fewer than half of the master-eligible nodes. Adding an exclusion for a node creates an entry for that node in the voting configuration exclusions list, which has the system automatically try to reconfigure the voting configuration to remove that node and prevents it from -returning to the voting configuration once it has removed. The current set of +returning to the voting configuration once it has removed. The current list of exclusions is stored in the cluster state and can be inspected as follows: [source,js] @@ -421,6 +437,7 @@ created in error or were only required temporarily: # the cluster and then remove all the exclusions, allowing any node to return to # the voting configuration in the future. DELETE /_cluster/voting_config_exclusions + # Immediately remove all the voting configuration exclusions, allowing any node # to return to the voting configuration in the future. DELETE /_cluster/voting_config_exclusions?wait_for_removal=false From 1888c97fa6bd0b93a473cc5ebe8be6f6991ab6a4 Mon Sep 17 00:00:00 2001 From: David Turner Date: Wed, 12 Dec 2018 11:07:34 +0000 Subject: [PATCH 060/106] Rewrite publishing bit --- docs/reference/modules/discovery.asciidoc | 53 +++++++++++++---------- 1 file changed, 30 insertions(+), 23 deletions(-) diff --git a/docs/reference/modules/discovery.asciidoc b/docs/reference/modules/discovery.asciidoc index e670a5496bd9b..e98ea46492d4a 100644 --- a/docs/reference/modules/discovery.asciidoc +++ b/docs/reference/modules/discovery.asciidoc @@ -450,33 +450,40 @@ DELETE /_cluster/voting_config_exclusions?wait_for_removal=false The master node is the only node in a cluster that can make changes to the cluster state. The master node processes one cluster state update at a time, -applies the required changes and publishes the updated cluster state to all the -other nodes in the cluster. Each node receives the publish message, acknowledges -it, but does *not* yet apply it. If the master does not receive acknowledgement -from enough master-eligible nodes within a certain time (controlled by the -`cluster.publish.timeout` setting which defaults to 30 seconds) the cluster -state change is rejected. - -Once enough nodes have responded, the cluster state is committed and a commit -message is sent to all the nodes. The nodes then proceed to apply the new -cluster state to their internal state. The master node waits for all nodes to -respond, or until `cluster.publish.timeout` has elapsed, before starting to -process the next update in the queue. The `cluster.publish.timeout` is measured -from the moment the publishing started. - -If a node fails to apply a cluster state update within the -`cluster.publish.timeout` timeout then its cluster state lags behind the most -recently-published state from the master. The master waits for a further -timeout, `cluster.follower_lag.timeout`, which defaults to 90 seconds, and if -the node has still not successfully applied the cluster state update then it is -removed from the cluster. +applying the required changes and publishing the updated cluster state to all +the other nodes in the cluster. Each publication starts with the master +broadcasting the updated cluster state to all nodes in the cluster, to which +each node responds with an acknowledgement but does not yet apply the +newly-received state. Once the master has collected acknowledgements from +enough master-eligible nodes the new cluster state is said to be _committed_, +and the master broadcasts another message instructing nodes to apply the +now-committed state. Each node receives this message, applies the updated +state, and then sends an acknowledgement back to the master. + +The master waits to receive this second acknowledgement from all nodes for some +time, defined by `cluster.publish.timeout`, which defaults to `30s`, measured +from the moment that publishing started. If this time is reached before the new +cluster state was committed then the master node failed to contact enough other +master-eligible nodes, so the cluster state change is rejected and the master +stands down and starts a new election process. + +If the `cluster.publish.timeout` time is reached after the new cluster state +was committed (but before all acknowledgements have been received) then the +master node starts processing and publishing the next cluster state update, +even though some nodes have not yet confirmed that they have applied the +current one. These nodes are said to be _lagging_ since their cluster states +are no longer up-to-date. The master waits for the lagging nodes to catch up +for a further time, `cluster.follower_lag.timeout`, which defaults to `90s`, +and if a node has still not successfully applied the cluster state update +within this time then it is removed from the cluster to prevent it from +disrupting the rest of the cluster. NOTE: Elasticsearch is a peer to peer based system, in which nodes communicate with one another directly. The high-throughput APIs (index, delete, search) do not normally interact with the master node. The responsibility of the master -node is to maintain the global cluster state, and act if nodes join or leave the -cluster by reassigning shards. Each time the cluster state is changed, the new -state is published to all nodes in the cluster as described above. +node is to maintain the global cluster state, and act if nodes join or leave +the cluster by reassigning shards. Each time the cluster state is changed, the +new state is published to all nodes in the cluster as described above. [float] [[no-master-block]] From b1e98bdde95aea78398b4e511167eb3230e3a45c Mon Sep 17 00:00:00 2001 From: David Turner Date: Wed, 12 Dec 2018 13:48:35 +0100 Subject: [PATCH 061/106] Skip attempts to destroy the test cluster --- docs/reference/modules/discovery.asciidoc | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/reference/modules/discovery.asciidoc b/docs/reference/modules/discovery.asciidoc index e98ea46492d4a..33dd9679740a2 100644 --- a/docs/reference/modules/discovery.asciidoc +++ b/docs/reference/modules/discovery.asciidoc @@ -379,6 +379,7 @@ POST /_cluster/voting_config_exclusions/node_name POST /_cluster/voting_config_exclusions/node_name?timeout=1m -------------------------------------------------- // CONSOLE +// TEST[skip:this would break the test cluster if executed] The node that should be added to the exclusions list is specified using <> in place of `node_name` here. If a call to the From a1c984316aa999b62c756510f5a3059871d5fb15 Mon Sep 17 00:00:00 2001 From: David Turner Date: Wed, 12 Dec 2018 13:31:21 +0000 Subject: [PATCH 062/106] Rewording --- docs/reference/modules/discovery.asciidoc | 40 +++++++++++------------ 1 file changed, 20 insertions(+), 20 deletions(-) diff --git a/docs/reference/modules/discovery.asciidoc b/docs/reference/modules/discovery.asciidoc index 33dd9679740a2..bf59a8c09d7a7 100644 --- a/docs/reference/modules/discovery.asciidoc +++ b/docs/reference/modules/discovery.asciidoc @@ -451,7 +451,7 @@ DELETE /_cluster/voting_config_exclusions?wait_for_removal=false The master node is the only node in a cluster that can make changes to the cluster state. The master node processes one cluster state update at a time, -applying the required changes and publishing the updated cluster state to all +computing the required changes and publishing the updated cluster state to all the other nodes in the cluster. Each publication starts with the master broadcasting the updated cluster state to all nodes in the cluster, to which each node responds with an acknowledgement but does not yet apply the @@ -459,25 +459,25 @@ newly-received state. Once the master has collected acknowledgements from enough master-eligible nodes the new cluster state is said to be _committed_, and the master broadcasts another message instructing nodes to apply the now-committed state. Each node receives this message, applies the updated -state, and then sends an acknowledgement back to the master. - -The master waits to receive this second acknowledgement from all nodes for some -time, defined by `cluster.publish.timeout`, which defaults to `30s`, measured -from the moment that publishing started. If this time is reached before the new -cluster state was committed then the master node failed to contact enough other -master-eligible nodes, so the cluster state change is rejected and the master -stands down and starts a new election process. - -If the `cluster.publish.timeout` time is reached after the new cluster state -was committed (but before all acknowledgements have been received) then the -master node starts processing and publishing the next cluster state update, -even though some nodes have not yet confirmed that they have applied the -current one. These nodes are said to be _lagging_ since their cluster states -are no longer up-to-date. The master waits for the lagging nodes to catch up -for a further time, `cluster.follower_lag.timeout`, which defaults to `90s`, -and if a node has still not successfully applied the cluster state update -within this time then it is removed from the cluster to prevent it from -disrupting the rest of the cluster. +state, and then sends a second acknowledgement back to the master. + +The master allows a limited amount of time for each cluster state update to be +completely published to all nodes, defined by `cluster.publish.timeout`, which +defaults to `30s`, measured from the time the publication started. If this time +is reached before the new cluster state is committed then the cluster state +change is rejected, the master considers itself to have failed, stands down, +and starts trying to elect a new master. + +However, if the new cluster state is committed before `cluster.publish.timeout` +has elapsed, but before all acknowledgements have been received, then the +master node considers the change to have succeeded and starts processing and +publishing the next cluster state update, even though some nodes have not yet +confirmed that they have applied the current one. These nodes are said to be +_lagging_ since their cluster states have fallen behind the master's latest +state. The master waits for the lagging nodes to catch up for a further time, +`cluster.follower_lag.timeout`, which defaults to `90s`, and if a node has +still not successfully applied the cluster state update within this time then +it is considered to have failed and is removed from the cluster. NOTE: Elasticsearch is a peer to peer based system, in which nodes communicate with one another directly. The high-throughput APIs (index, delete, search) do From 4b40b34ceaba31b432f819dcb201568a66a72401 Mon Sep 17 00:00:00 2001 From: David Turner Date: Thu, 13 Dec 2018 10:28:02 +0000 Subject: [PATCH 063/106] Weaken recommendation for removing bootstrap setting --- docs/reference/modules/discovery.asciidoc | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/docs/reference/modules/discovery.asciidoc b/docs/reference/modules/discovery.asciidoc index bf59a8c09d7a7..4f2fb8d2c4d88 100644 --- a/docs/reference/modules/discovery.asciidoc +++ b/docs/reference/modules/discovery.asciidoc @@ -240,9 +240,11 @@ master-eligible nodes in the cluster using this setting: this list is empty, meaning that this node expects to join a cluster that has already been bootstrapped. -This setting can be given on the command line when starting up each node, or -added to the `elasticsearch.yml` configuration file. Once the cluster has -formed this setting is no longer required and should be removed. +This setting can be given on the command line when starting up each +master-eligible node, or added to the `elasticsearch.yml` configuration file on +those nodes. Once the cluster has formed this setting is no longer required and +may be removed. It should be omitted on master-ineligible nodes, and on +master-eligible nodes that are started to join an existing cluster. For a cluster with 3 master-eligible nodes (with <> `master-a`, `master-b` and `master-c`) the configuration will look as follows: From 041494cba57d27d1da751f285ac6a04302aa838a Mon Sep 17 00:00:00 2001 From: David Turner Date: Mon, 17 Dec 2018 09:55:39 +0000 Subject: [PATCH 064/106] Rework discovery settings --- .../discovery-settings.asciidoc | 56 ++++++++++++------- 1 file changed, 35 insertions(+), 21 deletions(-) diff --git a/docs/reference/setup/important-settings/discovery-settings.asciidoc b/docs/reference/setup/important-settings/discovery-settings.asciidoc index 94d95f866ba4e..ad0deea96e4cd 100644 --- a/docs/reference/setup/important-settings/discovery-settings.asciidoc +++ b/docs/reference/setup/important-settings/discovery-settings.asciidoc @@ -1,9 +1,9 @@ [[discovery-settings]] -=== Discovery settings +=== Discovery and cluster formation settings -Elasticsearch uses a custom discovery implementation called "Zen Discovery" for -node-to-node clustering and master election. There are two important discovery -settings that should be configured before going to production. +There are two important discovery and cluster formation settings that should be +configured before going to production so that nodes in the cluster can discover +each other and elect a master node. [float] [[unicast.hosts]] @@ -14,9 +14,27 @@ the available loopback addresses and will scan ports 9300 to 9305 to try to connect to other nodes running on the same server. This provides an auto- clustering experience without having to do any configuration. -When the moment comes to form a cluster with nodes on other servers, you have to -provide a seed list of other nodes in the cluster that are likely to be live and -contactable. This can be specified as follows: +When the moment comes to form a cluster with nodes on other servers, you have +to provide a seed list of other nodes in the cluster that are master-eligible +and likely to be live and contactable, using the +`discovery.zen.ping.unicast.hosts` setting. This setting should normally +contain the addresses of all the master-eligible nodes in the cluster. + +[float] +[[initial_master_nodes]] +==== `cluster.initial_master_nodes` + +Starting a brand-new Elasticsearch cluster for the very first time requires a +<> step to determine +the set of master-eligible nodes whose votes should be counted in the very +first election. In <>, with no discovery +settings configured, this step is automatically performed by the nodes +themselves. As this auto-bootstrapping is +<>, starting a brand-new cluster +in <> requires an explicit list of the names +or IP addresses of the master-eligible nodes whose votes should be counted in +the very first election. This list is set using the +`cluster.initial_master_nodes` setting. [source,yaml] -------------------------------------------------- @@ -24,21 +42,17 @@ discovery.zen.ping.unicast.hosts: - 192.168.1.10:9300 - 192.168.1.11 <1> - seeds.mydomain.com <2> +cluster.initial_master_nodes: + - master-node-a <3> + - 192.168.1.12 <4> + - 192.168.1.13:9301 <5> -------------------------------------------------- <1> The port will default to `transport.profiles.default.port` and fallback to `transport.tcp.port` if not specified. -<2> A hostname that resolves to multiple IP addresses will try all resolved - addresses. - -[float] -[[initial_master_nodes]] -==== `cluster.initial_master_nodes` +<2> If a hostname resolves to multiple IP addresses then the node will attempt to + discover other nodes at all resolved addresses. +<3> Initial master nodes can be identified by their <>. +<4> Initial master nodes can also be identified by their IP address. +<5> If multiple master nodes share an IP address then the port must be used to + disambiguate them. -Starting an Elasticsearch cluster for the very first time requires a -<> step. -In <>, -with no discovery settings configured, this step is automatically -performed by the nodes themselves. As this auto-bootstrapping is -<>, running a node in -<> requires an explicit cluster -bootstrapping step. From f438a28bd0fa2a36df7b7ecc50fe99f7a8a93328 Mon Sep 17 00:00:00 2001 From: David Turner Date: Mon, 17 Dec 2018 09:57:30 +0000 Subject: [PATCH 065/106] Add link to discovery settings docs --- docs/reference/migration/migrate_7_0/discovery.asciidoc | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/docs/reference/migration/migrate_7_0/discovery.asciidoc b/docs/reference/migration/migrate_7_0/discovery.asciidoc index b9fdd79d4c037..870d69f30eb14 100644 --- a/docs/reference/migration/migrate_7_0/discovery.asciidoc +++ b/docs/reference/migration/migrate_7_0/discovery.asciidoc @@ -8,7 +8,8 @@ The first time a cluster is started, `cluster.initial_master_nodes` must be set to perform cluster bootstrapping. It should contain the names of the master-eligible nodes in the initial cluster and be defined on every -master-eligible node in the cluster. The +master-eligible node in the cluster. See <> for an example, and the <> describes this setting in more detail. From 4180e0093a12f4900cf55a04e50aa5f6f58f2d4d Mon Sep 17 00:00:00 2001 From: David Turner Date: Mon, 17 Dec 2018 10:01:45 +0000 Subject: [PATCH 066/106] Emphasize again that this is only for new clusters --- docs/reference/modules/discovery.asciidoc | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/docs/reference/modules/discovery.asciidoc b/docs/reference/modules/discovery.asciidoc index 4f2fb8d2c4d88..73be65c256145 100644 --- a/docs/reference/modules/discovery.asciidoc +++ b/docs/reference/modules/discovery.asciidoc @@ -231,7 +231,11 @@ process. Starting an Elasticsearch cluster for the very first time requires the initial set of master-eligible nodes to be explicitly set on one or more of the -master-eligible nodes in the cluster using this setting: +master-eligible nodes in the cluster. This is only required the very first time +the cluster starts up: nodes that have already joined a cluster will store this +information in their data folder, and freshly-started nodes that are intended +to join an existing cluster will obtain this information from the cluster's +elected master. This information is given using this setting: `cluster.initial_master_nodes`:: From 76ec76ceb1ae4d54985e420677e94231cd78bd4f Mon Sep 17 00:00:00 2001 From: David Turner Date: Mon, 17 Dec 2018 10:01:55 +0000 Subject: [PATCH 067/106] Reformat --- docs/reference/modules/discovery.asciidoc | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/reference/modules/discovery.asciidoc b/docs/reference/modules/discovery.asciidoc index 73be65c256145..0b06c8d71a64a 100644 --- a/docs/reference/modules/discovery.asciidoc +++ b/docs/reference/modules/discovery.asciidoc @@ -261,10 +261,10 @@ cluster.initial_master_nodes: - master-c -------------------------------------------------- -Alternatively the IP addresses or hostnames -(<>) can be used. If there -is more than one Elasticsearch node with the same IP address or hostname then -the transport ports must also be given +Alternatively the IP addresses or hostnames (<>) can be used. If there is more than one Elasticsearch node +with the same IP address or hostname then the transport ports must also be +given to specify exactly which node is meant: [source,yaml] -------------------------------------------------- From 8d1b118404c898fda00ed71f2b57e50ae7f576f0 Mon Sep 17 00:00:00 2001 From: David Turner Date: Mon, 17 Dec 2018 10:03:24 +0000 Subject: [PATCH 068/106] Define 'cluster bootstrapping' --- docs/reference/modules/discovery.asciidoc | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/docs/reference/modules/discovery.asciidoc b/docs/reference/modules/discovery.asciidoc index 0b06c8d71a64a..080e673ee1e27 100644 --- a/docs/reference/modules/discovery.asciidoc +++ b/docs/reference/modules/discovery.asciidoc @@ -231,11 +231,12 @@ process. Starting an Elasticsearch cluster for the very first time requires the initial set of master-eligible nodes to be explicitly set on one or more of the -master-eligible nodes in the cluster. This is only required the very first time -the cluster starts up: nodes that have already joined a cluster will store this -information in their data folder, and freshly-started nodes that are intended -to join an existing cluster will obtain this information from the cluster's -elected master. This information is given using this setting: +master-eligible nodes in the cluster. This is known as _cluster bootstrapping_. +This is only required the very first time the cluster starts up: nodes that +have already joined a cluster will store this information in their data folder, +and freshly-started nodes that are intended to join an existing cluster will +obtain this information from the cluster's elected master. This information is +given using this setting: `cluster.initial_master_nodes`:: From 2df3878b7746bd48888219560dd3b0e950a8b477 Mon Sep 17 00:00:00 2001 From: David Turner Date: Mon, 17 Dec 2018 10:08:39 +0000 Subject: [PATCH 069/106] Weaken recommendation further, with more qualification --- docs/reference/modules/discovery.asciidoc | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/docs/reference/modules/discovery.asciidoc b/docs/reference/modules/discovery.asciidoc index 080e673ee1e27..798edd43b1b6c 100644 --- a/docs/reference/modules/discovery.asciidoc +++ b/docs/reference/modules/discovery.asciidoc @@ -248,8 +248,12 @@ given using this setting: This setting can be given on the command line when starting up each master-eligible node, or added to the `elasticsearch.yml` configuration file on those nodes. Once the cluster has formed this setting is no longer required and -may be removed. It should be omitted on master-ineligible nodes, and on -master-eligible nodes that are started to join an existing cluster. +is ignored. It need not be set on master-ineligible nodes, nor on +master-eligible nodes that are started to join an existing cluster. Note that +master-eligible nodes should use storage that persists across restarts. If they +do not, and `cluster.initial_master_nodes` is set, and a full cluster restart +occurs, then another brand-new cluster will form and this may result in data +loss. For a cluster with 3 master-eligible nodes (with <> `master-a`, `master-b` and `master-c`) the configuration will look as follows: From 17be8bb2494cbbee7c52cdb5f120d2348b45a640 Mon Sep 17 00:00:00 2001 From: David Turner Date: Mon, 17 Dec 2018 10:11:45 +0000 Subject: [PATCH 070/106] Clarify that auto-bootstrapping will only find local nodes --- docs/reference/modules/discovery.asciidoc | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/docs/reference/modules/discovery.asciidoc b/docs/reference/modules/discovery.asciidoc index 798edd43b1b6c..30f0541ee2a2e 100644 --- a/docs/reference/modules/discovery.asciidoc +++ b/docs/reference/modules/discovery.asciidoc @@ -320,10 +320,13 @@ recommended to change this to reflect the logical name of the cluster. ==== Auto-bootstrapping in development mode If the cluster is running with a completely default configuration then it will -automatically bootstrap based on the nodes that could be discovered within a -short time after startup. Since nodes may not always reliably discover each -other quickly enough this automatic bootstrapping is not always reliable and -cannot be used in production deployments. +automatically bootstrap a cluster based on the nodes that could be discovered +to be running on the same host within a short time after startup. This means +that by default it is possible to start up several nodes on a single machine +and have them automatically form a cluster which is very useful for development +environments and experimentation. However, since nodes may not always +successfully discover each other quickly enough this automatic bootstrapping +cannot be relied upon and cannot be used in production deployments. If any of the following settings are configured then auto-bootstrapping will not take place, and you must configure `cluster.initial_master_nodes` as From 7ca6cc8ce3d461475b34f8c95235011f346b8c5c Mon Sep 17 00:00:00 2001 From: David Turner Date: Mon, 17 Dec 2018 10:12:16 +0000 Subject: [PATCH 071/106] +automatically --- docs/reference/modules/discovery.asciidoc | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/reference/modules/discovery.asciidoc b/docs/reference/modules/discovery.asciidoc index 30f0541ee2a2e..333f65028367d 100644 --- a/docs/reference/modules/discovery.asciidoc +++ b/docs/reference/modules/discovery.asciidoc @@ -342,9 +342,9 @@ bootstrapping>>: === Adding and removing nodes As nodes are added or removed Elasticsearch maintains an optimal level of fault -tolerance by updating the cluster's _voting configuration_, which is the set of -master-eligible nodes whose responses are counted when making decisions such as -electing a new master or committing a new cluster state. +tolerance by automatically updating the cluster's _voting configuration_, which +is the set of master-eligible nodes whose responses are counted when making +decisions such as electing a new master or committing a new cluster state. It is recommended to have a small and fixed number of master-eligible nodes in a cluster, and to scale the cluster up and down by adding and removing From 2fdb92f7eb997d9255b1bf821395c5b6a1f5acc2 Mon Sep 17 00:00:00 2001 From: David Turner Date: Mon, 17 Dec 2018 10:14:09 +0000 Subject: [PATCH 072/106] Shorter sentences --- docs/reference/modules/discovery.asciidoc | 23 ++++++++++++----------- 1 file changed, 12 insertions(+), 11 deletions(-) diff --git a/docs/reference/modules/discovery.asciidoc b/docs/reference/modules/discovery.asciidoc index 333f65028367d..5c0435eae29ac 100644 --- a/docs/reference/modules/discovery.asciidoc +++ b/docs/reference/modules/discovery.asciidoc @@ -369,17 +369,18 @@ configuration and adapt the fault tolerance level to the new set of nodes. If there are only two master-eligible nodes remaining then neither node can be safely removed since both are required to reliably make progress, so you must -first inform Elasticsearch that one of the nodes should not be part of the voting -configuration, and that the voting power should instead be given to other nodes, -allowing the excluded node to be taken offline without preventing the other node -from making progress. A node which is added to a voting configuration exclusion -list still works normally, but Elasticsearch will try and remove it from the -voting configuration so its vote is no longer required, and will never -automatically move such a node back into the voting configuration after it has -been removed. Once a node has been successfully reconfigured out of the voting -configuration, it is safe to shut it down without affecting the cluster's -master-level availability. A node can be added to the voting configuration -exclusion list using the following API: +first inform Elasticsearch that one of the nodes should not be part of the +voting configuration, and that the voting power should instead be given to +other nodes, allowing the excluded node to be taken offline without preventing +the other node from making progress. A node which is added to a voting +configuration exclusion list still works normally, but Elasticsearch will try +and remove it from the voting configuration so its vote is no longer required. +Importantly, Elasticsearch will never automatically move a node on the voting +exclusions list back into the voting configuration. Once an excluded node has +been successfully auto-reconfigured out of the voting configuration, it is safe +to shut it down without affecting the cluster's master-level availability. A +node can be added to the voting configuration exclusion list using the +following API: [source,js] -------------------------------------------------- From c4fd3352e0983d5302b96c91dd38da05ff8b9120 Mon Sep 17 00:00:00 2001 From: David Turner Date: Mon, 17 Dec 2018 10:14:46 +0000 Subject: [PATCH 073/106] Add 'batch of' --- docs/reference/modules/discovery.asciidoc | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/reference/modules/discovery.asciidoc b/docs/reference/modules/discovery.asciidoc index 5c0435eae29ac..aeb39ca5e7eed 100644 --- a/docs/reference/modules/discovery.asciidoc +++ b/docs/reference/modules/discovery.asciidoc @@ -465,9 +465,9 @@ DELETE /_cluster/voting_config_exclusions?wait_for_removal=false === Cluster state publishing The master node is the only node in a cluster that can make changes to the -cluster state. The master node processes one cluster state update at a time, -computing the required changes and publishing the updated cluster state to all -the other nodes in the cluster. Each publication starts with the master +cluster state. The master node processes one batch of cluster state updates at +a time, computing the required changes and publishing the updated cluster state +to all the other nodes in the cluster. Each publication starts with the master broadcasting the updated cluster state to all nodes in the cluster, to which each node responds with an acknowledgement but does not yet apply the newly-received state. Once the master has collected acknowledgements from From 1d69b0a268d585c5fd280157458de44fbabd91bc Mon Sep 17 00:00:00 2001 From: David Turner Date: Mon, 17 Dec 2018 10:19:06 +0000 Subject: [PATCH 074/106] Link up bootstrapping/setting initial quorum sections a bit --- docs/reference/modules/discovery.asciidoc | 32 ++++++++++++----------- 1 file changed, 17 insertions(+), 15 deletions(-) diff --git a/docs/reference/modules/discovery.asciidoc b/docs/reference/modules/discovery.asciidoc index aeb39ca5e7eed..68be52db04926 100644 --- a/docs/reference/modules/discovery.asciidoc +++ b/docs/reference/modules/discovery.asciidoc @@ -674,16 +674,18 @@ to complete before removing more nodes from the cluster. ==== Setting the initial quorum When a brand-new cluster starts up for the first time, one of the tasks it must -perform is to elect its first master node, for which it needs to know the set of -master-eligible nodes whose votes should count in this first election. This -initial voting configuration is known as the _bootstrap configuration_. +perform is to elect its first master node, for which it needs to know the set +of master-eligible nodes whose votes should count in this first election. This +initial voting configuration is known as the _bootstrap configuration_ and is +set in the <>. It is important that the bootstrap configuration identifies exactly which nodes should vote in the first election, and it is not sufficient to configure each -node with an expectation of how many nodes there should be in the cluster. It is -also important to note that the bootstrap configuration must come from outside -the cluster: there is no safe way for the cluster to determine the bootstrap -configuration correctly on its own. +node with an expectation of how many nodes there should be in the cluster. It +is also important to note that the bootstrap configuration must come from +outside the cluster: there is no safe way for the cluster to determine the +bootstrap configuration correctly on its own. If the bootstrap configuration is not set correctly then there is a risk when starting up a brand-new cluster is that you accidentally form two separate @@ -700,14 +702,14 @@ that four nodes were erroneously started instead of three: in this case there are enough nodes to form two separate clusters. Of course if each node is started manually then it's unlikely that too many nodes are started, but it's certainly possible to get into this situation if using a more automated -orchestrator, particularly if the orchestrator is not resilient to failures such -as network partitions. - -The <> is -only required the very first time a whole cluster starts up: new nodes joining -an established cluster can safely obtain all the information they need from the -elected master, and nodes that have previously been part of a cluster will have -stored to disk all the information required when restarting. +orchestrator, particularly if the orchestrator is not resilient to failures +such as network partitions. + +The initial quorum is only required the very first time a whole cluster starts +up: new nodes joining an established cluster can safely obtain all the +information they need from the elected master, and nodes that have previously +been part of a cluster will have stored to disk all the information required +when restarting. [float] ==== Cluster maintenance, rolling restarts and migrations From 9003c047e7ce73112c83662e52842a00db4f17fb Mon Sep 17 00:00:00 2001 From: David Turner Date: Mon, 17 Dec 2018 10:24:41 +0000 Subject: [PATCH 075/106] Remove note on migration and TODO --- docs/reference/modules/discovery.asciidoc | 10 ---------- 1 file changed, 10 deletions(-) diff --git a/docs/reference/modules/discovery.asciidoc b/docs/reference/modules/discovery.asciidoc index 68be52db04926..25bf4a3018d17 100644 --- a/docs/reference/modules/discovery.asciidoc +++ b/docs/reference/modules/discovery.asciidoc @@ -723,16 +723,6 @@ and then started again then it will automatically recover, such as during a action with the APIs described here in these cases, because the set of master nodes is not changing permanently. -It is also possible to perform a migration of a cluster onto entirely new nodes -without taking the cluster offline, via a _rolling migration_. A rolling -migration is similar to a rolling restart, in that it is performed one node at a -time, and also requires no special handling for the master-eligible nodes as -long as there are at least two of them available at all times. - -TODO the above is only true if the maintenance happens slowly enough, otherwise -the configuration might not catch up. Need to add this to the rolling restart -docs. - [float] ==== Auto-reconfiguration From 771cf61ce69c10d2a1c1c583128a6c2a54373809 Mon Sep 17 00:00:00 2001 From: David Turner Date: Mon, 17 Dec 2018 10:27:07 +0000 Subject: [PATCH 076/106] Fix ref to voting exclusions --- docs/reference/modules/discovery.asciidoc | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/docs/reference/modules/discovery.asciidoc b/docs/reference/modules/discovery.asciidoc index 25bf4a3018d17..5eefcc46957ac 100644 --- a/docs/reference/modules/discovery.asciidoc +++ b/docs/reference/modules/discovery.asciidoc @@ -777,8 +777,9 @@ healthy. There are situations in which Elasticsearch might tolerate the loss of multiple nodes, but this is not guaranteed under all sequences of failures. If this setting is set to `false` then departed nodes must be removed from the voting -configuration manually, using the vote withdrawal API described below, to -achieve the desired level of resilience. +configuration manually, using the +<>, to achieve +the desired level of resilience. Note that Elasticsearch will not suffer from a "split-brain" inconsistency however it is configured. This setting only affects its availability in the From 14de23ca5edcbd7bb69e106c81a1de2abe21f0d1 Mon Sep 17 00:00:00 2001 From: Lisa Cawley Date: Tue, 18 Dec 2018 11:32:49 +0000 Subject: [PATCH 077/106] Apply suggestions from code review Better wording in breaking changes docs Co-Authored-By: DaveCTurner --- docs/reference/migration/migrate_7_0/discovery.asciidoc | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/reference/migration/migrate_7_0/discovery.asciidoc b/docs/reference/migration/migrate_7_0/discovery.asciidoc index 870d69f30eb14..c532cff16ea5e 100644 --- a/docs/reference/migration/migrate_7_0/discovery.asciidoc +++ b/docs/reference/migration/migrate_7_0/discovery.asciidoc @@ -22,10 +22,10 @@ upgrade from 6.x, but can be removed in all other circumstances. If you wish to remove half or more of the master-eligible nodes from a cluster, you must first exclude the affected nodes from the voting configuration using the <>. -This is not required if removing fewer than half of the master-eligible nodes -at once. This is also not required when only removing master-ineligible nodes -such as data-only nodes or coordinating-only nodes. Finally, this is not -required when adding nodes to the cluster, only when removing them. +If you remove fewer than half of the master-eligible nodes at the same time, voting exclusions are not required. +If you remove only master-ineligible nodes +such as data-only nodes or coordinating-only nodes, voting exclusions are not required. Likewise, if you +add nodes to the cluster, voting exclusions are not required. [float] ==== Discovery configuration is required in production From 58c2a528803f25af7286997cc0f626789b340a55 Mon Sep 17 00:00:00 2001 From: Lisa Cawley Date: Tue, 18 Dec 2018 11:34:19 +0000 Subject: [PATCH 078/106] Apply suggestions from code review Updates to discovery settings docs. Co-Authored-By: DaveCTurner --- .../discovery-settings.asciidoc | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/docs/reference/setup/important-settings/discovery-settings.asciidoc b/docs/reference/setup/important-settings/discovery-settings.asciidoc index ad0deea96e4cd..0825cde2f627f 100644 --- a/docs/reference/setup/important-settings/discovery-settings.asciidoc +++ b/docs/reference/setup/important-settings/discovery-settings.asciidoc @@ -14,24 +14,24 @@ the available loopback addresses and will scan ports 9300 to 9305 to try to connect to other nodes running on the same server. This provides an auto- clustering experience without having to do any configuration. -When the moment comes to form a cluster with nodes on other servers, you have -to provide a seed list of other nodes in the cluster that are master-eligible -and likely to be live and contactable, using the -`discovery.zen.ping.unicast.hosts` setting. This setting should normally +When the moment comes to form a cluster with nodes on other servers, you must +use the `discovery.zen.ping.unicast.hosts` setting to provide a seed list of other nodes in the cluster that are master-eligible +and likely to be live and contactable. +This setting should normally contain the addresses of all the master-eligible nodes in the cluster. [float] [[initial_master_nodes]] ==== `cluster.initial_master_nodes` -Starting a brand-new Elasticsearch cluster for the very first time requires a -<> step to determine -the set of master-eligible nodes whose votes should be counted in the very +When you start a brand new Elasticsearch cluster for the very first time, there is a +<> step, which determines +the set of master-eligible nodes whose votes are counted in the very first election. In <>, with no discovery settings configured, this step is automatically performed by the nodes themselves. As this auto-bootstrapping is <>, starting a brand-new cluster -in <> requires an explicit list of the names +in <>, you must explicitly list the names or IP addresses of the master-eligible nodes whose votes should be counted in the very first election. This list is set using the `cluster.initial_master_nodes` setting. From 98f1485d54dbfa431937179f768a1a80213c7a7a Mon Sep 17 00:00:00 2001 From: Lisa Cawley Date: Tue, 18 Dec 2018 11:35:00 +0000 Subject: [PATCH 079/106] FIXUP missed suggestion Co-Authored-By: DaveCTurner --- .../setup/important-settings/discovery-settings.asciidoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/reference/setup/important-settings/discovery-settings.asciidoc b/docs/reference/setup/important-settings/discovery-settings.asciidoc index 0825cde2f627f..c1480d1c7d742 100644 --- a/docs/reference/setup/important-settings/discovery-settings.asciidoc +++ b/docs/reference/setup/important-settings/discovery-settings.asciidoc @@ -30,7 +30,7 @@ the set of master-eligible nodes whose votes are counted in the very first election. In <>, with no discovery settings configured, this step is automatically performed by the nodes themselves. As this auto-bootstrapping is -<>, starting a brand-new cluster +<>, when you start a brand new cluster in <>, you must explicitly list the names or IP addresses of the master-eligible nodes whose votes should be counted in the very first election. This list is set using the From ebe1a1f83037b836227ff40ff06514ec9bc7fff6 Mon Sep 17 00:00:00 2001 From: David Turner Date: Tue, 18 Dec 2018 11:36:37 +0000 Subject: [PATCH 080/106] Reformat --- .../migration/migrate_7_0/discovery.asciidoc | 9 ++++--- .../discovery-settings.asciidoc | 25 +++++++++---------- 2 files changed, 17 insertions(+), 17 deletions(-) diff --git a/docs/reference/migration/migrate_7_0/discovery.asciidoc b/docs/reference/migration/migrate_7_0/discovery.asciidoc index c532cff16ea5e..d568e7fe32c25 100644 --- a/docs/reference/migration/migrate_7_0/discovery.asciidoc +++ b/docs/reference/migration/migrate_7_0/discovery.asciidoc @@ -22,10 +22,11 @@ upgrade from 6.x, but can be removed in all other circumstances. If you wish to remove half or more of the master-eligible nodes from a cluster, you must first exclude the affected nodes from the voting configuration using the <>. -If you remove fewer than half of the master-eligible nodes at the same time, voting exclusions are not required. -If you remove only master-ineligible nodes -such as data-only nodes or coordinating-only nodes, voting exclusions are not required. Likewise, if you -add nodes to the cluster, voting exclusions are not required. +If you remove fewer than half of the master-eligible nodes at the same time, +voting exclusions are not required. If you remove only master-ineligible nodes +such as data-only nodes or coordinating-only nodes, voting exclusions are not +required. Likewise, if you add nodes to the cluster, voting exclusions are not +required. [float] ==== Discovery configuration is required in production diff --git a/docs/reference/setup/important-settings/discovery-settings.asciidoc b/docs/reference/setup/important-settings/discovery-settings.asciidoc index c1480d1c7d742..9d65f6d67ddd2 100644 --- a/docs/reference/setup/important-settings/discovery-settings.asciidoc +++ b/docs/reference/setup/important-settings/discovery-settings.asciidoc @@ -15,26 +15,25 @@ connect to other nodes running on the same server. This provides an auto- clustering experience without having to do any configuration. When the moment comes to form a cluster with nodes on other servers, you must -use the `discovery.zen.ping.unicast.hosts` setting to provide a seed list of other nodes in the cluster that are master-eligible -and likely to be live and contactable. -This setting should normally -contain the addresses of all the master-eligible nodes in the cluster. +use the `discovery.zen.ping.unicast.hosts` setting to provide a seed list of +other nodes in the cluster that are master-eligible and likely to be live and +contactable. This setting should normally contain the addresses of all the +master-eligible nodes in the cluster. [float] [[initial_master_nodes]] ==== `cluster.initial_master_nodes` -When you start a brand new Elasticsearch cluster for the very first time, there is a -<> step, which determines -the set of master-eligible nodes whose votes are counted in the very +When you start a brand new Elasticsearch cluster for the very first time, there +is a <> step, which +determines the set of master-eligible nodes whose votes are counted in the very first election. In <>, with no discovery settings configured, this step is automatically performed by the nodes -themselves. As this auto-bootstrapping is -<>, when you start a brand new cluster -in <>, you must explicitly list the names -or IP addresses of the master-eligible nodes whose votes should be counted in -the very first election. This list is set using the -`cluster.initial_master_nodes` setting. +themselves. As this auto-bootstrapping is <>, when you start a brand new cluster in <>, you must explicitly list the names or IP addresses of the +master-eligible nodes whose votes should be counted in the very first election. +This list is set using the `cluster.initial_master_nodes` setting. [source,yaml] -------------------------------------------------- From 2cac91f6848a562d6aa78b0832f09f4a2b755b64 Mon Sep 17 00:00:00 2001 From: David Turner Date: Tue, 18 Dec 2018 11:38:20 +0000 Subject: [PATCH 081/106] local ports --- .../setup/important-settings/discovery-settings.asciidoc | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/reference/setup/important-settings/discovery-settings.asciidoc b/docs/reference/setup/important-settings/discovery-settings.asciidoc index 9d65f6d67ddd2..978480eb3086c 100644 --- a/docs/reference/setup/important-settings/discovery-settings.asciidoc +++ b/docs/reference/setup/important-settings/discovery-settings.asciidoc @@ -10,8 +10,8 @@ each other and elect a master node. ==== `discovery.zen.ping.unicast.hosts` Out of the box, without any network configuration, Elasticsearch will bind to -the available loopback addresses and will scan ports 9300 to 9305 to try to -connect to other nodes running on the same server. This provides an auto- +the available loopback addresses and will scan local ports 9300 to 9305 to try +to connect to other nodes running on the same server. This provides an auto- clustering experience without having to do any configuration. When the moment comes to form a cluster with nodes on other servers, you must From b4dd8741d0d4b055fb787036a4019e1fbeebdee9 Mon Sep 17 00:00:00 2001 From: Lisa Cawley Date: Tue, 18 Dec 2018 11:40:46 +0000 Subject: [PATCH 082/106] Apply suggestions from code review Co-Authored-By: DaveCTurner --- docs/reference/modules/discovery.asciidoc | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/reference/modules/discovery.asciidoc b/docs/reference/modules/discovery.asciidoc index 5eefcc46957ac..ca1b0496f3660 100644 --- a/docs/reference/modules/discovery.asciidoc +++ b/docs/reference/modules/discovery.asciidoc @@ -3,7 +3,7 @@ The discovery and cluster formation module is responsible for discovering nodes, electing a master, forming a cluster, and publishing the cluster state -each time it changes. It is integrated with other modules, for example, all +each time it changes. It is integrated with other modules. For example, all communication between nodes is done using the <> module. @@ -48,7 +48,7 @@ Elasticsearch tries to connect to each seed node in its list, and holds a gossip-like conversation with them to find other nodes and to build a complete picture of the master-eligible nodes in the cluster. By default the cluster formation module offers two hosts providers to configure the list of seed -nodes: a _settings_-based and a _file_-based hosts provider, but can be +nodes: a _settings_-based and a _file_-based hosts provider. It can be extended to support cloud environments and other forms of hosts providers via {plugins}/discovery.html[discovery plugins]. Hosts providers are configured using the `discovery.zen.hosts_provider` setting, which defaults to the From 9d787f9bd7378b4199ab7936478a13096e065bf5 Mon Sep 17 00:00:00 2001 From: David Turner Date: Tue, 18 Dec 2018 11:45:14 +0000 Subject: [PATCH 083/106] Reword 'at startup' --- docs/reference/modules/discovery.asciidoc | 15 ++++++++------- 1 file changed, 8 insertions(+), 7 deletions(-) diff --git a/docs/reference/modules/discovery.asciidoc b/docs/reference/modules/discovery.asciidoc index ca1b0496f3660..8b095683ac748 100644 --- a/docs/reference/modules/discovery.asciidoc +++ b/docs/reference/modules/discovery.asciidoc @@ -43,13 +43,14 @@ It is separated into several sections, which are explained below: === Discovery The cluster formation module uses a list of _seed_ nodes in order to start off -the discovery process. At startup, or when disconnected from a master, -Elasticsearch tries to connect to each seed node in its list, and holds a -gossip-like conversation with them to find other nodes and to build a complete -picture of the master-eligible nodes in the cluster. By default the cluster -formation module offers two hosts providers to configure the list of seed -nodes: a _settings_-based and a _file_-based hosts provider. It can be -extended to support cloud environments and other forms of hosts providers via +the discovery process. When you start an Elasticsearch node, or when a node +believes the master node to have failed, that node tries to connect to each +seed node in its list, and once connected the two nodes will repeatedly share +information about the other known master-eligible nodes in the cluster in order +to build a complete picture of the cluster. By default the cluster formation +module offers two hosts providers to configure the list of seed nodes: a +_settings_-based and a _file_-based hosts provider. It can be extended to +support cloud environments and other forms of hosts providers via {plugins}/discovery.html[discovery plugins]. Hosts providers are configured using the `discovery.zen.hosts_provider` setting, which defaults to the _settings_-based hosts provider. Multiple hosts providers can be specified as a From 400b2e46d68832d6944d4d17fdfad15544cf532f Mon Sep 17 00:00:00 2001 From: David Turner Date: Tue, 18 Dec 2018 12:11:48 +0000 Subject: [PATCH 084/106] Add redirects --- docs/reference/redirects.asciidoc | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) diff --git a/docs/reference/redirects.asciidoc b/docs/reference/redirects.asciidoc index f07d1d09747e7..fe2954b015a02 100644 --- a/docs/reference/redirects.asciidoc +++ b/docs/reference/redirects.asciidoc @@ -560,3 +560,19 @@ See <>. The standard token filter has been removed. +[role="exclude",id="modules-discovery-azure-classic"] + +See <>. + +[role="exclude",id="modules-discovery-ec2"] + +See <>. + +[role="exclude",id="modules-discovery-gce"] + +See <>. + +[role="exclude",id="modules-discovery-zen"] + +Zen discovery is replaced by the <>. From 802a413cb80b6f4ba2346441189928a63e2fc96d Mon Sep 17 00:00:00 2001 From: David Turner Date: Tue, 18 Dec 2018 12:28:21 +0000 Subject: [PATCH 085/106] Split up monolith --- docs/reference/modules/discovery.asciidoc | 778 +----------------- .../discovery/adding-removing-nodes.asciidoc | 121 +++ .../modules/discovery/bootstrapping.asciidoc | 110 +++ .../modules/discovery/discovery.asciidoc | 187 +++++ .../discovery/fault-detection.asciidoc | 52 ++ .../discovery/master-election.asciidoc | 48 ++ .../discovery/no-master-block.asciidoc | 22 + .../modules/discovery/publishing.asciidoc | 39 + .../modules/discovery/quorums.asciidoc | 181 ++++ 9 files changed, 774 insertions(+), 764 deletions(-) create mode 100644 docs/reference/modules/discovery/adding-removing-nodes.asciidoc create mode 100644 docs/reference/modules/discovery/bootstrapping.asciidoc create mode 100644 docs/reference/modules/discovery/discovery.asciidoc create mode 100644 docs/reference/modules/discovery/fault-detection.asciidoc create mode 100644 docs/reference/modules/discovery/master-election.asciidoc create mode 100644 docs/reference/modules/discovery/no-master-block.asciidoc create mode 100644 docs/reference/modules/discovery/publishing.asciidoc create mode 100644 docs/reference/modules/discovery/quorums.asciidoc diff --git a/docs/reference/modules/discovery.asciidoc b/docs/reference/modules/discovery.asciidoc index 8b095683ac748..0acceb04595ef 100644 --- a/docs/reference/modules/discovery.asciidoc +++ b/docs/reference/modules/discovery.asciidoc @@ -12,6 +12,7 @@ It is separated into several sections, which are explained below: * <> is the process where nodes find each other when the master is unknown, such as when a node has just started up or when the previous master has failed. + * <> is required when an Elasticsearch cluster starts up for the very first time. In <>, with no discovery settings configured, this is automatically @@ -19,6 +20,7 @@ It is separated into several sections, which are explained below: <>, running a node in <> requires bootstrapping to be explicitly configured via the `cluster.initial_master_nodes` setting. + * It is recommended to have a small and fixed number of master-eligible nodes in a cluster, and to scale the cluster up and down by adding and removing master-ineligible nodes only. However there are situations in which it may @@ -27,786 +29,34 @@ It is separated into several sections, which are explained below: removing nodes>> describes this process as well as the extra steps that need to be performed when removing more than half of the master-eligible nodes at the same time. + * <> covers how a master publishes cluster states to the other nodes in the cluster. + * The <> is put in place when there is no known elected master, and can be configured to determine which operations should be rejected when it is in place. + * <> and <> sections cover advanced settings to influence the election and fault detection processes. + * <> explains the design behind the master election and auto-reconfiguration logic. -[float] -[[modules-discovery-hosts-providers]] -=== Discovery - -The cluster formation module uses a list of _seed_ nodes in order to start off -the discovery process. When you start an Elasticsearch node, or when a node -believes the master node to have failed, that node tries to connect to each -seed node in its list, and once connected the two nodes will repeatedly share -information about the other known master-eligible nodes in the cluster in order -to build a complete picture of the cluster. By default the cluster formation -module offers two hosts providers to configure the list of seed nodes: a -_settings_-based and a _file_-based hosts provider. It can be extended to -support cloud environments and other forms of hosts providers via -{plugins}/discovery.html[discovery plugins]. Hosts providers are configured -using the `discovery.zen.hosts_provider` setting, which defaults to the -_settings_-based hosts provider. Multiple hosts providers can be specified as a -list. - -[float] -[[settings-based-hosts-provider]] -===== Settings-based hosts provider - -The settings-based hosts provider use a node setting to configure a static list -of hosts to use as seed nodes. These hosts can be specified as hostnames or IP -addresses; hosts specified as hostnames are resolved to IP addresses during each -round of discovery. Note that if you are in an environment where DNS resolutions -vary with time, you might need to adjust your <>. - -The list of hosts is set using the `discovery.zen.ping.unicast.hosts` static -setting. This is either an array of hosts or a comma-delimited string. Each -value should be in the form of `host:port` or `host` (where `port` defaults to -the setting `transport.profiles.default.port` falling back to -`transport.tcp.port` if not set). Note that IPv6 hosts must be bracketed. The -default for this setting is `127.0.0.1, [::1]` - -[source,yaml] --------------------------------------------------- -discovery.zen.ping.unicast.hosts: - - 192.168.1.10:9300 - - 192.168.1.11 <1> - - seeds.mydomain.com <2> --------------------------------------------------- -<1> The port will default to `transport.profiles.default.port` and fallback to - `transport.tcp.port` if not specified. -<2> A hostname that resolves to multiple IP addresses will try all resolved - addresses. - -Additionally, the `discovery.zen.ping.unicast.hosts.resolve_timeout` configures -the amount of time to wait for DNS lookups on each round of discovery. This is -specified as a <> and defaults to 5s. - -Unicast discovery uses the <> module to perform the -discovery. - -[float] -[[file-based-hosts-provider]] -===== File-based hosts provider - -The file-based hosts provider configures a list of hosts via an external file. -Elasticsearch reloads this file when it changes, so that the list of seed nodes -can change dynamically without needing to restart each node. For example, this -gives a convenient mechanism for an Elasticsearch instance that is run in a -Docker container to be dynamically supplied with a list of IP addresses to -connect to when those IP addresses may not be known at node startup. - -To enable file-based discovery, configure the `file` hosts provider as follows: - -[source,txt] ----------------------------------------------------------------- -discovery.zen.hosts_provider: file ----------------------------------------------------------------- - -Then create a file at `$ES_PATH_CONF/unicast_hosts.txt` in the format described -below. Any time a change is made to the `unicast_hosts.txt` file the new changes -will be picked up by Elasticsearch and the new hosts list will be used. - -Note that the file-based discovery plugin augments the unicast hosts list in -`elasticsearch.yml`: if there are valid unicast host entries in -`discovery.zen.ping.unicast.hosts` then they will be used in addition to those -supplied in `unicast_hosts.txt`. - -The `discovery.zen.ping.unicast.hosts.resolve_timeout` setting also applies to -DNS lookups for nodes specified by address via file-based discovery. This is -specified as a <> and defaults to 5s. - -The format of the file is to specify one node entry per line. Each node entry -consists of the host (host name or IP address) and an optional transport port -number. If the port number is specified, is must come immediately after the -host (on the same line) separated by a `:`. If the port number is not -specified, a default value of 9300 is used. - -For example, this is an example of `unicast_hosts.txt` for a cluster with four -nodes that participate in unicast discovery, some of which are not running on -the default port: - -[source,txt] ----------------------------------------------------------------- -10.10.10.5 -10.10.10.6:9305 -10.10.10.5:10005 -# an IPv6 address -[2001:0db8:85a3:0000:0000:8a2e:0370:7334]:9301 ----------------------------------------------------------------- - -Host names are allowed instead of IP addresses (similar to -`discovery.zen.ping.unicast.hosts`), and IPv6 addresses must be specified in -brackets with the port coming after the brackets. - -It is also possible to add comments to this file. All comments must appear on -their lines starting with `#` (i.e. comments cannot start in the middle of a -line). - -[float] -[[ec2-hosts-provider]] -===== EC2 hosts provider - -The {plugins}/discovery-ec2.html[EC2 discovery plugin] adds a hosts provider -that uses the https://github.com/aws/aws-sdk-java[AWS API] to find a list of -seed nodes. - -[float] -[[azure-classic-hosts-provider]] -===== Azure Classic hosts provider - -The {plugins}/discovery-azure-classic.html[Azure Classic discovery plugin] adds -a hosts provider that uses the Azure Classic API find a list of seed nodes. - -[float] -[[gce-hosts-provider]] -===== Google Compute Engine hosts provider - -The {plugins}/discovery-gce.html[GCE discovery plugin] adds a hosts provider -that uses the GCE API find a list of seed nodes. - -[float] -==== Discovery settings - -Discovery operates in two phases: First, each node probes the addresses of all -known master-eligible nodes by connecting to each address and attempting to -identify the node to which it is connected. Secondly it shares with the remote -node a list of all of its known master-eligible peers and the remote node -responds with _its_ peers in turn. The node then probes all the new nodes that -it just discovered, requests their peers, and so on. - -If the node is not master-eligible then it continues this discovery process -until it has discovered an elected master node. If no elected master is -discovered then the node will retry after `discovery.find_peers_interval` which -defaults to `1s`. - -If the node is master-eligible then it continues this discovery process until it -has either discovered an elected master node or else it has discovered enough -masterless master-eligible nodes to complete an election. If neither of these -occur quickly enough then the node will retry after -`discovery.find_peers_interval` which defaults to `1s`. - -The discovery process is controlled by the following settings. - -`discovery.find_peers_interval`:: - - Sets how long a node will wait before attempting another discovery round. - Defaults to `1s`. - -`discovery.request_peers_timeout`:: - - Sets how long a node will wait after asking its peers again before - considering the request to have failed. Defaults to `3s`. - -`discovery.probe.connect_timeout`:: - - Sets how long to wait when attempting to connect to each address. Defaults - to `3s`. - -`discovery.probe.handshake_timeout`:: - - Sets how long to wait when attempting to identify the remote node via a - handshake. Defaults to `1s`. - -`discovery.cluster_formation_warning_timeout`:: - - Sets how long a node will try to form a cluster before logging a warning - that the cluster did not form. Defaults to `10s`. - -If a cluster has not formed after `discovery.cluster_formation_warning_timeout` -has elapsed then the node will log a warning message that starts with the phrase -`master not discovered` which describes the current state of the discovery -process. - -[float] -[[modules-discovery-bootstrap-cluster]] -=== Bootstrapping a cluster - -Starting an Elasticsearch cluster for the very first time requires the initial -set of master-eligible nodes to be explicitly set on one or more of the -master-eligible nodes in the cluster. This is known as _cluster bootstrapping_. -This is only required the very first time the cluster starts up: nodes that -have already joined a cluster will store this information in their data folder, -and freshly-started nodes that are intended to join an existing cluster will -obtain this information from the cluster's elected master. This information is -given using this setting: - -`cluster.initial_master_nodes`:: - - Sets a list of the <> or transport addresses of the - initial set of master-eligible nodes in a brand-new cluster. By default - this list is empty, meaning that this node expects to join a cluster that - has already been bootstrapped. - -This setting can be given on the command line when starting up each -master-eligible node, or added to the `elasticsearch.yml` configuration file on -those nodes. Once the cluster has formed this setting is no longer required and -is ignored. It need not be set on master-ineligible nodes, nor on -master-eligible nodes that are started to join an existing cluster. Note that -master-eligible nodes should use storage that persists across restarts. If they -do not, and `cluster.initial_master_nodes` is set, and a full cluster restart -occurs, then another brand-new cluster will form and this may result in data -loss. - -For a cluster with 3 master-eligible nodes (with <> -`master-a`, `master-b` and `master-c`) the configuration will look as follows: - -[source,yaml] --------------------------------------------------- -cluster.initial_master_nodes: - - master-a - - master-b - - master-c --------------------------------------------------- - -Alternatively the IP addresses or hostnames (<>) can be used. If there is more than one Elasticsearch node -with the same IP address or hostname then the transport ports must also be -given to specify exactly which node is meant: - -[source,yaml] --------------------------------------------------- -cluster.initial_master_nodes: - - 10.0.10.101 - - 10.0.10.102:9300 - - 10.0.10.102:9301 - - master-node-hostname --------------------------------------------------- - -Like all node settings, it is also possible to specify the initial set of -master nodes on the command-line that is used to start Elasticsearch: - -[source,bash] --------------------------------------------------- -$ bin/elasticsearch -Ecluster.initial_master_nodes=master-a,master-b,master-c --------------------------------------------------- - -It is technically sufficient to set this on a single master-eligible node in -the cluster, and only to mention that single node in the setting, but this -provides no fault tolerance before the cluster has fully formed. It -is therefore better to bootstrap using at least three master-eligible nodes. -In any case, when specifying the list of initial master nodes, **it is vitally -important** to configure each node with exactly the same list of nodes, to -prevent two independent clusters from forming. Typically you will set this on -the nodes that are mentioned in the list of initial master nodes. - -NOTE: In alpha releases, all listed master-eligible nodes are required to be - discovered before bootstrapping can take place. This requirement will be - relaxed in production-ready releases. - -WARNING: You must put exactly the same set of initial master nodes in each - configuration file (or leave the configuration empty) in order to be sure - that only a single cluster forms during bootstrapping and therefore to - avoid the risk of data loss. - -[float] -==== Choosing a cluster name - -The `cluster.name` allows you to create multiple clusters which are separated -from each other. Nodes verify that they agree on their cluster name when they -first connect to each other, and if two nodes have different cluster names then -they will not communicate meaningfully and will not belong to the same cluster. -The default value for the cluster name is `elasticsearch`, but it is -recommended to change this to reflect the logical name of the cluster. - -[float] -==== Auto-bootstrapping in development mode - -If the cluster is running with a completely default configuration then it will -automatically bootstrap a cluster based on the nodes that could be discovered -to be running on the same host within a short time after startup. This means -that by default it is possible to start up several nodes on a single machine -and have them automatically form a cluster which is very useful for development -environments and experimentation. However, since nodes may not always -successfully discover each other quickly enough this automatic bootstrapping -cannot be relied upon and cannot be used in production deployments. - -If any of the following settings are configured then auto-bootstrapping will -not take place, and you must configure `cluster.initial_master_nodes` as -described in the <>: - -* `discovery.zen.hosts_provider` -* `discovery.zen.ping.unicast.hosts` -* `cluster.initial_master_nodes` - -[float] -[[modules-discovery-adding-removing-nodes]] -=== Adding and removing nodes - -As nodes are added or removed Elasticsearch maintains an optimal level of fault -tolerance by automatically updating the cluster's _voting configuration_, which -is the set of master-eligible nodes whose responses are counted when making -decisions such as electing a new master or committing a new cluster state. - -It is recommended to have a small and fixed number of master-eligible nodes in a -cluster, and to scale the cluster up and down by adding and removing -master-ineligible nodes only. However there are situations in which it may be -desirable to add or remove some master-eligible nodes to or from a cluster. - -If you wish to add some master-eligible nodes to your cluster, simply configure -the new nodes to find the existing cluster and start them up. Elasticsearch will -add the new nodes to the voting configuration if it is appropriate to do so. - -When removing master-eligible nodes, it is important not to remove too many all -at the same time. For instance, if there are currently seven master-eligible -nodes and you wish to reduce this to three, it is not possible simply to stop -four of the nodes at once: to do so would leave only three nodes remaining, -which is less than half of the voting configuration, which means the cluster -cannot take any further actions. - -As long as there are at least three master-eligible nodes in the cluster, as a -general rule it is best to remove nodes one-at-a-time, allowing enough time for -the cluster to <> the voting -configuration and adapt the fault tolerance level to the new set of nodes. - -If there are only two master-eligible nodes remaining then neither node can be -safely removed since both are required to reliably make progress, so you must -first inform Elasticsearch that one of the nodes should not be part of the -voting configuration, and that the voting power should instead be given to -other nodes, allowing the excluded node to be taken offline without preventing -the other node from making progress. A node which is added to a voting -configuration exclusion list still works normally, but Elasticsearch will try -and remove it from the voting configuration so its vote is no longer required. -Importantly, Elasticsearch will never automatically move a node on the voting -exclusions list back into the voting configuration. Once an excluded node has -been successfully auto-reconfigured out of the voting configuration, it is safe -to shut it down without affecting the cluster's master-level availability. A -node can be added to the voting configuration exclusion list using the -following API: - -[source,js] --------------------------------------------------- -# Add node to voting configuration exclusions list and wait for the system to -# auto-reconfigure the node out of the voting configuration up to the default -# timeout of 30 seconds -POST /_cluster/voting_config_exclusions/node_name - -# Add node to voting configuration exclusions list and wait for -# auto-reconfiguration up to one minute -POST /_cluster/voting_config_exclusions/node_name?timeout=1m --------------------------------------------------- -// CONSOLE -// TEST[skip:this would break the test cluster if executed] - -The node that should be added to the exclusions list is specified using -<> in place of `node_name` here. If a call to the -voting configuration exclusions API fails then the call can safely be retried. -Only a successful response guarantees that the node has actually been removed -from the voting configuration and will not be reinstated. - -Although the voting configuration exclusions API is most useful for down-scaling -a two-node to a one-node cluster, it is also possible to use it to remove -multiple master-eligible nodes all at the same time. Adding multiple nodes -to the exclusions list has the system try to auto-reconfigure all of these nodes -out of the voting configuration, allowing them to be safely shut down while -keeping the cluster available. In the example described above, shrinking a -seven-master-node cluster down to only have three master nodes, you could add -four nodes to the exclusions list, wait for confirmation, and then shut them -down simultaneously. - -NOTE: Voting exclusions are only required when removing at least half of the -master-eligible nodes from a cluster in a short time period. They are not -required when removing master-ineligible nodes, nor are they required when -removing fewer than half of the master-eligible nodes. - -Adding an exclusion for a node creates an entry for that node in the voting -configuration exclusions list, which has the system automatically try to -reconfigure the voting configuration to remove that node and prevents it from -returning to the voting configuration once it has removed. The current list of -exclusions is stored in the cluster state and can be inspected as follows: - -[source,js] --------------------------------------------------- -GET /_cluster/state?filter_path=metadata.cluster_coordination.voting_config_exclusions --------------------------------------------------- -// CONSOLE - -This list is limited in size by the following setting: - -`cluster.max_voting_config_exclusions`:: - - Sets a limits on the number of voting configuration exclusions at any one - time. Defaults to `10`. - -Since voting configuration exclusions are persistent and limited in number, they -must be cleaned up. Normally an exclusion is added when performing some -maintenance on the cluster, and the exclusions should be cleaned up when the -maintenance is complete. Clusters should have no voting configuration exclusions -in normal operation. - -If a node is excluded from the voting configuration because it is to be shut -down permanently then its exclusion can be removed once it has shut down and -been removed from the cluster. Exclusions can also be cleared if they were -created in error or were only required temporarily: - -[source,js] --------------------------------------------------- -# Wait for all the nodes with voting configuration exclusions to be removed from -# the cluster and then remove all the exclusions, allowing any node to return to -# the voting configuration in the future. -DELETE /_cluster/voting_config_exclusions - -# Immediately remove all the voting configuration exclusions, allowing any node -# to return to the voting configuration in the future. -DELETE /_cluster/voting_config_exclusions?wait_for_removal=false --------------------------------------------------- -// CONSOLE - -[float] -[[cluster-state-publishing]] -=== Cluster state publishing - -The master node is the only node in a cluster that can make changes to the -cluster state. The master node processes one batch of cluster state updates at -a time, computing the required changes and publishing the updated cluster state -to all the other nodes in the cluster. Each publication starts with the master -broadcasting the updated cluster state to all nodes in the cluster, to which -each node responds with an acknowledgement but does not yet apply the -newly-received state. Once the master has collected acknowledgements from -enough master-eligible nodes the new cluster state is said to be _committed_, -and the master broadcasts another message instructing nodes to apply the -now-committed state. Each node receives this message, applies the updated -state, and then sends a second acknowledgement back to the master. - -The master allows a limited amount of time for each cluster state update to be -completely published to all nodes, defined by `cluster.publish.timeout`, which -defaults to `30s`, measured from the time the publication started. If this time -is reached before the new cluster state is committed then the cluster state -change is rejected, the master considers itself to have failed, stands down, -and starts trying to elect a new master. - -However, if the new cluster state is committed before `cluster.publish.timeout` -has elapsed, but before all acknowledgements have been received, then the -master node considers the change to have succeeded and starts processing and -publishing the next cluster state update, even though some nodes have not yet -confirmed that they have applied the current one. These nodes are said to be -_lagging_ since their cluster states have fallen behind the master's latest -state. The master waits for the lagging nodes to catch up for a further time, -`cluster.follower_lag.timeout`, which defaults to `90s`, and if a node has -still not successfully applied the cluster state update within this time then -it is considered to have failed and is removed from the cluster. - -NOTE: Elasticsearch is a peer to peer based system, in which nodes communicate -with one another directly. The high-throughput APIs (index, delete, search) do -not normally interact with the master node. The responsibility of the master -node is to maintain the global cluster state, and act if nodes join or leave -the cluster by reassigning shards. Each time the cluster state is changed, the -new state is published to all nodes in the cluster as described above. - -[float] -[[no-master-block]] -=== No master block - -For the cluster to be fully operational, it must have an active master. The -`discovery.zen.no_master_block` settings controls what operations should be -rejected when there is no active master. - -The `discovery.zen.no_master_block` setting has two valid values: - -[horizontal] -`all`:: All operations on the node--i.e. both read & writes--will be rejected. -This also applies for api cluster state read or write operations, like the get -index settings, put mapping and cluster state api. -`write`:: (default) Write operations will be rejected. Read operations will -succeed, based on the last known cluster configuration. This may result in -partial reads of stale data as this node may be isolated from the rest of the -cluster. - -The `discovery.zen.no_master_block` setting doesn't apply to nodes-based APIs -(for example cluster stats, node info, and node stats APIs). Requests to these -APIs will not be blocked and can run on any available node. - -[float] -[[master-election]] -=== Master Election - -Elasticsearch uses an election process to agree on an elected master node, both -at startup and if the existing elected master fails. Any master-eligible node -can start an election, and normally the first election that takes place will -succeed. Elections only usually fail when two nodes both happen to start their -elections at about the same time, so elections are scheduled randomly on each -node to avoid this happening. Nodes will retry elections until a master is -elected, backing off on failure, so that eventually an election will succeed -(with arbitrarily high probability). The following settings control the -scheduling of elections. - -`cluster.election.initial_timeout`:: - - Sets the upper bound on how long a node will wait initially, or after the - elected master fails, before attempting its first election. This defaults - to `100ms`. - -`cluster.election.back_off_time`:: - - Sets the amount to increase the upper bound on the wait before an election - on each election failure. Note that this is _linear_ backoff. This defaults - to `100ms` - -`cluster.election.max_timeout`:: - - Sets the maximum upper bound on how long a node will wait before attempting - an first election, so that an network partition that lasts for a long time - does not result in excessively sparse elections. This defaults to `10s` - -`cluster.election.duration`:: - - Sets how long each election is allowed to take before a node considers it to - have failed and schedules a retry. This defaults to `500ms`. - -[float] -==== Joining an elected master - -During master election, or when joining an existing formed cluster, a node will -send a join request to the master in order to be officially added to the -cluster. This join process can be configured with the following settings. - -`cluster.join.timeout`:: - - Sets how long a node will wait after sending a request to join a cluster - before it considers the request to have failed and retries. Defaults to - `60s`. - -[float] -[[fault-detection]] -=== Fault Detection - -An elected master periodically checks each of the nodes in the cluster in order -to ensure that they are still connected and healthy, and in turn each node in -the cluster periodically checks the health of the elected master. These checks -are known respectively as _follower checks_ and _leader checks_. - -Elasticsearch allows for these checks occasionally to fail or timeout without -taking any action, and will only consider a node to be truly faulty after a -number of consecutive checks have failed. The following settings control the -behaviour of fault detection. - -`cluster.fault_detection.follower_check.interval`:: - - Sets how long the elected master waits between follower checks to each - other node in the cluster. Defaults to `1s`. - -`cluster.fault_detection.follower_check.timeout`:: - - Sets how long the elected master waits for a response to a follower check - before considering it to have failed. Defaults to `30s`. - -`cluster.fault_detection.follower_check.retry_count`:: - - Sets how many consecutive follower check failures must occur to each node - before the elected master considers that node to be faulty and removes it - from the cluster. Defaults to `3`. - -`cluster.fault_detection.leader_check.interval`:: - - Sets how long each node waits between checks of the elected master. - Defaults to `1s`. - -`cluster.fault_detection.leader_check.timeout`:: - - Sets how long each node waits for a response to a leader check from the - elected master before considering it to have failed. Defaults to `30s`. - -`cluster.fault_detection.leader_check.retry_count`:: - - Sets how many consecutive leader check failures must occur before a node - considers the elected master to be faulty and attempts to find or elect a - new master. Defaults to `3`. - -If the elected master detects that a node has disconnected then this is treated -as an immediate failure, bypassing the timeouts and retries listed above, and -the master attempts to remove the node from the cluster. Similarly, if a node -detects that the elected master has disconnected then this is treated as an -immediate failure, bypassing the timeouts and retries listed above, and the -follower restarts its discovery phase to try and find or elect a new master. - -[float] -[[modules-discovery-quorums]] -=== Quorum-based decision making - -Electing a master node and changing the cluster state are the two fundamental -tasks that master-eligible nodes must work together to perform. It is important -that these activities work robustly even if some nodes have failed, and -Elasticsearch achieves this robustness by only considering each action to have -succeeded on receipt of responses from a _quorum_, a subset of the -master-eligible nodes in the cluster. The advantage of requiring only a subset -of the nodes to respond is that it allows for some of the nodes to fail without -preventing the cluster from making progress, and the quorums are carefully -chosen so as not to allow the cluster to "split brain", i.e. to be partitioned -into two pieces each of which may make decisions that are inconsistent with -those of the other piece. - -Elasticsearch allows you to add and remove master-eligible nodes to a running -cluster. In many cases you can do this simply by starting or stopping the nodes -as required, as described in more detail in the -<>. - -As nodes are added or removed Elasticsearch maintains an optimal level of fault -tolerance by updating the cluster's _voting configuration_, which is the set of -master-eligible nodes whose responses are counted when making decisions such as -electing a new master or committing a new cluster state. A decision is only made -once more than half of the nodes in the voting configuration have responded. -Usually the voting configuration is the same as the set of all the -master-eligible nodes that are currently in the cluster, but there are some -situations in which they may be different. - -To be sure that the cluster remains available you **must not stop half or more -of the nodes in the voting configuration at the same time**. As long as more -than half of the voting nodes are available the cluster can still work normally. -This means that if there are three or four master-eligible nodes then the -cluster can tolerate one of them being unavailable; if there are two or fewer -master-eligible nodes then they must all remain available. - -After a node has joined or left the cluster the elected master must issue a -cluster-state update that adjusts the voting configuration to match, and this -can take a short time to complete. It is important to wait for this adjustment -to complete before removing more nodes from the cluster. - -[float] -==== Setting the initial quorum - -When a brand-new cluster starts up for the first time, one of the tasks it must -perform is to elect its first master node, for which it needs to know the set -of master-eligible nodes whose votes should count in this first election. This -initial voting configuration is known as the _bootstrap configuration_ and is -set in the <>. - -It is important that the bootstrap configuration identifies exactly which nodes -should vote in the first election, and it is not sufficient to configure each -node with an expectation of how many nodes there should be in the cluster. It -is also important to note that the bootstrap configuration must come from -outside the cluster: there is no safe way for the cluster to determine the -bootstrap configuration correctly on its own. - -If the bootstrap configuration is not set correctly then there is a risk when -starting up a brand-new cluster is that you accidentally form two separate -clusters instead of one. This could lead to data loss: you might start using -both clusters before noticing that anything had gone wrong, and it will then be -impossible to merge them together later. - -NOTE: To illustrate the problem with configuring each node to expect a certain -cluster size, imagine starting up a three-node cluster in which each node knows -that it is going to be part of a three-node cluster. A majority of three nodes -is two, so normally the first two nodes to discover each other will form a -cluster and the third node will join them a short time later. However, imagine -that four nodes were erroneously started instead of three: in this case there -are enough nodes to form two separate clusters. Of course if each node is -started manually then it's unlikely that too many nodes are started, but it's -certainly possible to get into this situation if using a more automated -orchestrator, particularly if the orchestrator is not resilient to failures -such as network partitions. - -The initial quorum is only required the very first time a whole cluster starts -up: new nodes joining an established cluster can safely obtain all the -information they need from the elected master, and nodes that have previously -been part of a cluster will have stored to disk all the information required -when restarting. - -[float] -==== Cluster maintenance, rolling restarts and migrations - -Many cluster maintenance tasks involve temporarily shutting down one or more -nodes and then starting them back up again. By default Elasticsearch can remain -available if one of its master-eligible nodes is taken offline, such as during a -<>. Furthermore, if multiple nodes are stopped -and then started again then it will automatically recover, such as during a -<>. There is no need to take any further -action with the APIs described here in these cases, because the set of master -nodes is not changing permanently. - -[float] -==== Auto-reconfiguration - -Nodes may join or leave the cluster, and Elasticsearch reacts by making -corresponding changes to the voting configuration in order to ensure that the -cluster is as resilient as possible. The default auto-reconfiguration behaviour -is expected to give the best results in most situation. The current voting -configuration is stored in the cluster state so you can inspect its current -contents as follows: - -[source,js] --------------------------------------------------- -GET /_cluster/state?filter_path=metadata.cluster_coordination.last_committed_config --------------------------------------------------- -// CONSOLE - -NOTE: The current voting configuration is not necessarily the same as the set of -all available master-eligible nodes in the cluster. Altering the voting -configuration itself involves taking a vote, so it takes some time to adjust the -configuration as nodes join or leave the cluster. Also, there are situations -where the most resilient configuration includes unavailable nodes, or does not -include some available nodes, and in these situations the voting configuration -will differ from the set of available master-eligible nodes in the cluster. - -Larger voting configurations are usually more resilient, so Elasticsearch will -normally prefer to add master-eligible nodes to the voting configuration once -they have joined the cluster. Similarly, if a node in the voting configuration -leaves the cluster and there is another master-eligible node in the cluster that -is not in the voting configuration then it is preferable to swap these two nodes -over, leaving the size of the voting configuration unchanged but increasing its -resilience. - -It is not so straightforward to automatically remove nodes from the voting -configuration after they have left the cluster, and different strategies have -different benefits and drawbacks, so the right choice depends on how the cluster -will be used and is controlled by the following setting. +include::discovery/adding-removing-nodes.asciidoc[] -`cluster.auto_shrink_voting_configuration`:: +include::discovery/bootstrapping.asciidoc[] - Defaults to `true`, meaning that the voting configuration will automatically - shrink, shedding departed nodes, as long as it still contains at least 3 - nodes. If set to `false`, the voting configuration never automatically - shrinks; departed nodes must be removed manually using the - <>. +include::discovery/discovery.asciidoc[] -NOTE: If `cluster.auto_shrink_voting_configuration` is set to `true`, the -recommended and default setting, and there are at least three master-eligible -nodes in the cluster, then Elasticsearch remains capable of processing -cluster-state updates as long as all but one of its master-eligible nodes are -healthy. +include::discovery/fault-detection.asciidoc[] -There are situations in which Elasticsearch might tolerate the loss of multiple -nodes, but this is not guaranteed under all sequences of failures. If this -setting is set to `false` then departed nodes must be removed from the voting -configuration manually, using the -<>, to achieve -the desired level of resilience. +include::discovery/master-election.asciidoc[] -Note that Elasticsearch will not suffer from a "split-brain" inconsistency -however it is configured. This setting only affects its availability in the -event of the failure of some of its nodes, and the administrative tasks that -must be performed as nodes join and leave the cluster. +include::discovery/no-master-block.asciidoc[] -[float] -==== Even numbers of master-eligible nodes +include::discovery/publishing.asciidoc[] -There should normally be an odd number of master-eligible nodes in a cluster. -If there is an even number then Elasticsearch will leave one of them out of the -voting configuration to ensure that it has an odd size. This does not decrease -the failure-tolerance of the cluster, and in fact improves it slightly: if the -cluster is partitioned into two even halves then one of the halves will contain -a majority of the voting configuration and will be able to keep operating, -whereas if all of the master-eligible nodes' votes were counted then neither -side could make any progress in this situation. +include::discovery/quorums.asciidoc[] -For instance if there are four master-eligible nodes in the cluster and the -voting configuration contained all of them then any quorum-based decision would -require votes from at least three of them, which means that the cluster can only -tolerate the loss of a single master-eligible node. If this cluster were split -into two equal halves then neither half would contain three master-eligible -nodes so would not be able to make any progress. However if the voting -configuration contains only three of the four master-eligible nodes then the -cluster is still only fully tolerant to the loss of one node, but quorum-based -decisions require votes from two of the three voting nodes. In the event of an -even split, one half will contain two of the three voting nodes so will remain -available. diff --git a/docs/reference/modules/discovery/adding-removing-nodes.asciidoc b/docs/reference/modules/discovery/adding-removing-nodes.asciidoc new file mode 100644 index 0000000000000..0e759acb4972a --- /dev/null +++ b/docs/reference/modules/discovery/adding-removing-nodes.asciidoc @@ -0,0 +1,121 @@ +[[modules-discovery-adding-removing-nodes]] +=== Adding and removing nodes + +As nodes are added or removed Elasticsearch maintains an optimal level of fault +tolerance by automatically updating the cluster's _voting configuration_, which +is the set of master-eligible nodes whose responses are counted when making +decisions such as electing a new master or committing a new cluster state. + +It is recommended to have a small and fixed number of master-eligible nodes in a +cluster, and to scale the cluster up and down by adding and removing +master-ineligible nodes only. However there are situations in which it may be +desirable to add or remove some master-eligible nodes to or from a cluster. + +If you wish to add some master-eligible nodes to your cluster, simply configure +the new nodes to find the existing cluster and start them up. Elasticsearch will +add the new nodes to the voting configuration if it is appropriate to do so. + +When removing master-eligible nodes, it is important not to remove too many all +at the same time. For instance, if there are currently seven master-eligible +nodes and you wish to reduce this to three, it is not possible simply to stop +four of the nodes at once: to do so would leave only three nodes remaining, +which is less than half of the voting configuration, which means the cluster +cannot take any further actions. + +As long as there are at least three master-eligible nodes in the cluster, as a +general rule it is best to remove nodes one-at-a-time, allowing enough time for +the cluster to <> the voting +configuration and adapt the fault tolerance level to the new set of nodes. + +If there are only two master-eligible nodes remaining then neither node can be +safely removed since both are required to reliably make progress, so you must +first inform Elasticsearch that one of the nodes should not be part of the +voting configuration, and that the voting power should instead be given to +other nodes, allowing the excluded node to be taken offline without preventing +the other node from making progress. A node which is added to a voting +configuration exclusion list still works normally, but Elasticsearch will try +and remove it from the voting configuration so its vote is no longer required. +Importantly, Elasticsearch will never automatically move a node on the voting +exclusions list back into the voting configuration. Once an excluded node has +been successfully auto-reconfigured out of the voting configuration, it is safe +to shut it down without affecting the cluster's master-level availability. A +node can be added to the voting configuration exclusion list using the +following API: + +[source,js] +-------------------------------------------------- +# Add node to voting configuration exclusions list and wait for the system to +# auto-reconfigure the node out of the voting configuration up to the default +# timeout of 30 seconds +POST /_cluster/voting_config_exclusions/node_name + +# Add node to voting configuration exclusions list and wait for +# auto-reconfiguration up to one minute +POST /_cluster/voting_config_exclusions/node_name?timeout=1m +-------------------------------------------------- +// CONSOLE +// TEST[skip:this would break the test cluster if executed] + +The node that should be added to the exclusions list is specified using +<> in place of `node_name` here. If a call to the +voting configuration exclusions API fails then the call can safely be retried. +Only a successful response guarantees that the node has actually been removed +from the voting configuration and will not be reinstated. + +Although the voting configuration exclusions API is most useful for down-scaling +a two-node to a one-node cluster, it is also possible to use it to remove +multiple master-eligible nodes all at the same time. Adding multiple nodes +to the exclusions list has the system try to auto-reconfigure all of these nodes +out of the voting configuration, allowing them to be safely shut down while +keeping the cluster available. In the example described above, shrinking a +seven-master-node cluster down to only have three master nodes, you could add +four nodes to the exclusions list, wait for confirmation, and then shut them +down simultaneously. + +NOTE: Voting exclusions are only required when removing at least half of the +master-eligible nodes from a cluster in a short time period. They are not +required when removing master-ineligible nodes, nor are they required when +removing fewer than half of the master-eligible nodes. + +Adding an exclusion for a node creates an entry for that node in the voting +configuration exclusions list, which has the system automatically try to +reconfigure the voting configuration to remove that node and prevents it from +returning to the voting configuration once it has removed. The current list of +exclusions is stored in the cluster state and can be inspected as follows: + +[source,js] +-------------------------------------------------- +GET /_cluster/state?filter_path=metadata.cluster_coordination.voting_config_exclusions +-------------------------------------------------- +// CONSOLE + +This list is limited in size by the following setting: + +`cluster.max_voting_config_exclusions`:: + + Sets a limits on the number of voting configuration exclusions at any one + time. Defaults to `10`. + +Since voting configuration exclusions are persistent and limited in number, they +must be cleaned up. Normally an exclusion is added when performing some +maintenance on the cluster, and the exclusions should be cleaned up when the +maintenance is complete. Clusters should have no voting configuration exclusions +in normal operation. + +If a node is excluded from the voting configuration because it is to be shut +down permanently then its exclusion can be removed once it has shut down and +been removed from the cluster. Exclusions can also be cleared if they were +created in error or were only required temporarily: + +[source,js] +-------------------------------------------------- +# Wait for all the nodes with voting configuration exclusions to be removed from +# the cluster and then remove all the exclusions, allowing any node to return to +# the voting configuration in the future. +DELETE /_cluster/voting_config_exclusions + +# Immediately remove all the voting configuration exclusions, allowing any node +# to return to the voting configuration in the future. +DELETE /_cluster/voting_config_exclusions?wait_for_removal=false +-------------------------------------------------- +// CONSOLE diff --git a/docs/reference/modules/discovery/bootstrapping.asciidoc b/docs/reference/modules/discovery/bootstrapping.asciidoc new file mode 100644 index 0000000000000..ffcab074e6724 --- /dev/null +++ b/docs/reference/modules/discovery/bootstrapping.asciidoc @@ -0,0 +1,110 @@ +[[modules-discovery-bootstrap-cluster]] +=== Bootstrapping a cluster + +Starting an Elasticsearch cluster for the very first time requires the initial +set of master-eligible nodes to be explicitly set on one or more of the +master-eligible nodes in the cluster. This is known as _cluster bootstrapping_. +This is only required the very first time the cluster starts up: nodes that +have already joined a cluster will store this information in their data folder, +and freshly-started nodes that are intended to join an existing cluster will +obtain this information from the cluster's elected master. This information is +given using this setting: + +`cluster.initial_master_nodes`:: + + Sets a list of the <> or transport addresses of the + initial set of master-eligible nodes in a brand-new cluster. By default + this list is empty, meaning that this node expects to join a cluster that + has already been bootstrapped. + +This setting can be given on the command line when starting up each +master-eligible node, or added to the `elasticsearch.yml` configuration file on +those nodes. Once the cluster has formed this setting is no longer required and +is ignored. It need not be set on master-ineligible nodes, nor on +master-eligible nodes that are started to join an existing cluster. Note that +master-eligible nodes should use storage that persists across restarts. If they +do not, and `cluster.initial_master_nodes` is set, and a full cluster restart +occurs, then another brand-new cluster will form and this may result in data +loss. + +For a cluster with 3 master-eligible nodes (with <> +`master-a`, `master-b` and `master-c`) the configuration will look as follows: + +[source,yaml] +-------------------------------------------------- +cluster.initial_master_nodes: + - master-a + - master-b + - master-c +-------------------------------------------------- + +Alternatively the IP addresses or hostnames (<>) can be used. If there is more than one Elasticsearch node +with the same IP address or hostname then the transport ports must also be +given to specify exactly which node is meant: + +[source,yaml] +-------------------------------------------------- +cluster.initial_master_nodes: + - 10.0.10.101 + - 10.0.10.102:9300 + - 10.0.10.102:9301 + - master-node-hostname +-------------------------------------------------- + +Like all node settings, it is also possible to specify the initial set of +master nodes on the command-line that is used to start Elasticsearch: + +[source,bash] +-------------------------------------------------- +$ bin/elasticsearch -Ecluster.initial_master_nodes=master-a,master-b,master-c +-------------------------------------------------- + +It is technically sufficient to set this on a single master-eligible node in +the cluster, and only to mention that single node in the setting, but this +provides no fault tolerance before the cluster has fully formed. It +is therefore better to bootstrap using at least three master-eligible nodes. +In any case, when specifying the list of initial master nodes, **it is vitally +important** to configure each node with exactly the same list of nodes, to +prevent two independent clusters from forming. Typically you will set this on +the nodes that are mentioned in the list of initial master nodes. + +NOTE: In alpha releases, all listed master-eligible nodes are required to be + discovered before bootstrapping can take place. This requirement will be + relaxed in production-ready releases. + +WARNING: You must put exactly the same set of initial master nodes in each + configuration file (or leave the configuration empty) in order to be sure + that only a single cluster forms during bootstrapping and therefore to + avoid the risk of data loss. + +[float] +==== Choosing a cluster name + +The `cluster.name` allows you to create multiple clusters which are separated +from each other. Nodes verify that they agree on their cluster name when they +first connect to each other, and if two nodes have different cluster names then +they will not communicate meaningfully and will not belong to the same cluster. +The default value for the cluster name is `elasticsearch`, but it is +recommended to change this to reflect the logical name of the cluster. + +[float] +==== Auto-bootstrapping in development mode + +If the cluster is running with a completely default configuration then it will +automatically bootstrap a cluster based on the nodes that could be discovered +to be running on the same host within a short time after startup. This means +that by default it is possible to start up several nodes on a single machine +and have them automatically form a cluster which is very useful for development +environments and experimentation. However, since nodes may not always +successfully discover each other quickly enough this automatic bootstrapping +cannot be relied upon and cannot be used in production deployments. + +If any of the following settings are configured then auto-bootstrapping will +not take place, and you must configure `cluster.initial_master_nodes` as +described in the <>: + +* `discovery.zen.hosts_provider` +* `discovery.zen.ping.unicast.hosts` +* `cluster.initial_master_nodes` diff --git a/docs/reference/modules/discovery/discovery.asciidoc b/docs/reference/modules/discovery/discovery.asciidoc new file mode 100644 index 0000000000000..5467dd2f376f6 --- /dev/null +++ b/docs/reference/modules/discovery/discovery.asciidoc @@ -0,0 +1,187 @@ +[[modules-discovery-hosts-providers]] +=== Discovery + +The cluster formation module uses a list of _seed_ nodes in order to start off +the discovery process. When you start an Elasticsearch node, or when a node +believes the master node to have failed, that node tries to connect to each +seed node in its list, and once connected the two nodes will repeatedly share +information about the other known master-eligible nodes in the cluster in order +to build a complete picture of the cluster. By default the cluster formation +module offers two hosts providers to configure the list of seed nodes: a +_settings_-based and a _file_-based hosts provider. It can be extended to +support cloud environments and other forms of hosts providers via +{plugins}/discovery.html[discovery plugins]. Hosts providers are configured +using the `discovery.zen.hosts_provider` setting, which defaults to the +_settings_-based hosts provider. Multiple hosts providers can be specified as a +list. + +[float] +[[settings-based-hosts-provider]] +===== Settings-based hosts provider + +The settings-based hosts provider use a node setting to configure a static list +of hosts to use as seed nodes. These hosts can be specified as hostnames or IP +addresses; hosts specified as hostnames are resolved to IP addresses during each +round of discovery. Note that if you are in an environment where DNS resolutions +vary with time, you might need to adjust your <>. + +The list of hosts is set using the `discovery.zen.ping.unicast.hosts` static +setting. This is either an array of hosts or a comma-delimited string. Each +value should be in the form of `host:port` or `host` (where `port` defaults to +the setting `transport.profiles.default.port` falling back to +`transport.tcp.port` if not set). Note that IPv6 hosts must be bracketed. The +default for this setting is `127.0.0.1, [::1]` + +[source,yaml] +-------------------------------------------------- +discovery.zen.ping.unicast.hosts: + - 192.168.1.10:9300 + - 192.168.1.11 <1> + - seeds.mydomain.com <2> +-------------------------------------------------- +<1> The port will default to `transport.profiles.default.port` and fallback to + `transport.tcp.port` if not specified. +<2> A hostname that resolves to multiple IP addresses will try all resolved + addresses. + +Additionally, the `discovery.zen.ping.unicast.hosts.resolve_timeout` configures +the amount of time to wait for DNS lookups on each round of discovery. This is +specified as a <> and defaults to 5s. + +Unicast discovery uses the <> module to perform the +discovery. + +[float] +[[file-based-hosts-provider]] +===== File-based hosts provider + +The file-based hosts provider configures a list of hosts via an external file. +Elasticsearch reloads this file when it changes, so that the list of seed nodes +can change dynamically without needing to restart each node. For example, this +gives a convenient mechanism for an Elasticsearch instance that is run in a +Docker container to be dynamically supplied with a list of IP addresses to +connect to when those IP addresses may not be known at node startup. + +To enable file-based discovery, configure the `file` hosts provider as follows: + +[source,txt] +---------------------------------------------------------------- +discovery.zen.hosts_provider: file +---------------------------------------------------------------- + +Then create a file at `$ES_PATH_CONF/unicast_hosts.txt` in the format described +below. Any time a change is made to the `unicast_hosts.txt` file the new changes +will be picked up by Elasticsearch and the new hosts list will be used. + +Note that the file-based discovery plugin augments the unicast hosts list in +`elasticsearch.yml`: if there are valid unicast host entries in +`discovery.zen.ping.unicast.hosts` then they will be used in addition to those +supplied in `unicast_hosts.txt`. + +The `discovery.zen.ping.unicast.hosts.resolve_timeout` setting also applies to +DNS lookups for nodes specified by address via file-based discovery. This is +specified as a <> and defaults to 5s. + +The format of the file is to specify one node entry per line. Each node entry +consists of the host (host name or IP address) and an optional transport port +number. If the port number is specified, is must come immediately after the +host (on the same line) separated by a `:`. If the port number is not +specified, a default value of 9300 is used. + +For example, this is an example of `unicast_hosts.txt` for a cluster with four +nodes that participate in unicast discovery, some of which are not running on +the default port: + +[source,txt] +---------------------------------------------------------------- +10.10.10.5 +10.10.10.6:9305 +10.10.10.5:10005 +# an IPv6 address +[2001:0db8:85a3:0000:0000:8a2e:0370:7334]:9301 +---------------------------------------------------------------- + +Host names are allowed instead of IP addresses (similar to +`discovery.zen.ping.unicast.hosts`), and IPv6 addresses must be specified in +brackets with the port coming after the brackets. + +It is also possible to add comments to this file. All comments must appear on +their lines starting with `#` (i.e. comments cannot start in the middle of a +line). + +[float] +[[ec2-hosts-provider]] +===== EC2 hosts provider + +The {plugins}/discovery-ec2.html[EC2 discovery plugin] adds a hosts provider +that uses the https://github.com/aws/aws-sdk-java[AWS API] to find a list of +seed nodes. + +[float] +[[azure-classic-hosts-provider]] +===== Azure Classic hosts provider + +The {plugins}/discovery-azure-classic.html[Azure Classic discovery plugin] adds +a hosts provider that uses the Azure Classic API find a list of seed nodes. + +[float] +[[gce-hosts-provider]] +===== Google Compute Engine hosts provider + +The {plugins}/discovery-gce.html[GCE discovery plugin] adds a hosts provider +that uses the GCE API find a list of seed nodes. + +[float] +==== Discovery settings + +Discovery operates in two phases: First, each node probes the addresses of all +known master-eligible nodes by connecting to each address and attempting to +identify the node to which it is connected. Secondly it shares with the remote +node a list of all of its known master-eligible peers and the remote node +responds with _its_ peers in turn. The node then probes all the new nodes that +it just discovered, requests their peers, and so on. + +If the node is not master-eligible then it continues this discovery process +until it has discovered an elected master node. If no elected master is +discovered then the node will retry after `discovery.find_peers_interval` which +defaults to `1s`. + +If the node is master-eligible then it continues this discovery process until it +has either discovered an elected master node or else it has discovered enough +masterless master-eligible nodes to complete an election. If neither of these +occur quickly enough then the node will retry after +`discovery.find_peers_interval` which defaults to `1s`. + +The discovery process is controlled by the following settings. + +`discovery.find_peers_interval`:: + + Sets how long a node will wait before attempting another discovery round. + Defaults to `1s`. + +`discovery.request_peers_timeout`:: + + Sets how long a node will wait after asking its peers again before + considering the request to have failed. Defaults to `3s`. + +`discovery.probe.connect_timeout`:: + + Sets how long to wait when attempting to connect to each address. Defaults + to `3s`. + +`discovery.probe.handshake_timeout`:: + + Sets how long to wait when attempting to identify the remote node via a + handshake. Defaults to `1s`. + +`discovery.cluster_formation_warning_timeout`:: + + Sets how long a node will try to form a cluster before logging a warning + that the cluster did not form. Defaults to `10s`. + +If a cluster has not formed after `discovery.cluster_formation_warning_timeout` +has elapsed then the node will log a warning message that starts with the phrase +`master not discovered` which describes the current state of the discovery +process. + diff --git a/docs/reference/modules/discovery/fault-detection.asciidoc b/docs/reference/modules/discovery/fault-detection.asciidoc new file mode 100644 index 0000000000000..6d9fa0587c513 --- /dev/null +++ b/docs/reference/modules/discovery/fault-detection.asciidoc @@ -0,0 +1,52 @@ +[[fault-detection]] +=== Fault Detection + +An elected master periodically checks each of the nodes in the cluster in order +to ensure that they are still connected and healthy, and in turn each node in +the cluster periodically checks the health of the elected master. These checks +are known respectively as _follower checks_ and _leader checks_. + +Elasticsearch allows for these checks occasionally to fail or timeout without +taking any action, and will only consider a node to be truly faulty after a +number of consecutive checks have failed. The following settings control the +behaviour of fault detection. + +`cluster.fault_detection.follower_check.interval`:: + + Sets how long the elected master waits between follower checks to each + other node in the cluster. Defaults to `1s`. + +`cluster.fault_detection.follower_check.timeout`:: + + Sets how long the elected master waits for a response to a follower check + before considering it to have failed. Defaults to `30s`. + +`cluster.fault_detection.follower_check.retry_count`:: + + Sets how many consecutive follower check failures must occur to each node + before the elected master considers that node to be faulty and removes it + from the cluster. Defaults to `3`. + +`cluster.fault_detection.leader_check.interval`:: + + Sets how long each node waits between checks of the elected master. + Defaults to `1s`. + +`cluster.fault_detection.leader_check.timeout`:: + + Sets how long each node waits for a response to a leader check from the + elected master before considering it to have failed. Defaults to `30s`. + +`cluster.fault_detection.leader_check.retry_count`:: + + Sets how many consecutive leader check failures must occur before a node + considers the elected master to be faulty and attempts to find or elect a + new master. Defaults to `3`. + +If the elected master detects that a node has disconnected then this is treated +as an immediate failure, bypassing the timeouts and retries listed above, and +the master attempts to remove the node from the cluster. Similarly, if a node +detects that the elected master has disconnected then this is treated as an +immediate failure, bypassing the timeouts and retries listed above, and the +follower restarts its discovery phase to try and find or elect a new master. + diff --git a/docs/reference/modules/discovery/master-election.asciidoc b/docs/reference/modules/discovery/master-election.asciidoc new file mode 100644 index 0000000000000..6d3f5b9e34b5f --- /dev/null +++ b/docs/reference/modules/discovery/master-election.asciidoc @@ -0,0 +1,48 @@ +[[master-election]] +=== Master Election + +Elasticsearch uses an election process to agree on an elected master node, both +at startup and if the existing elected master fails. Any master-eligible node +can start an election, and normally the first election that takes place will +succeed. Elections only usually fail when two nodes both happen to start their +elections at about the same time, so elections are scheduled randomly on each +node to avoid this happening. Nodes will retry elections until a master is +elected, backing off on failure, so that eventually an election will succeed +(with arbitrarily high probability). The following settings control the +scheduling of elections. + +`cluster.election.initial_timeout`:: + + Sets the upper bound on how long a node will wait initially, or after the + elected master fails, before attempting its first election. This defaults + to `100ms`. + +`cluster.election.back_off_time`:: + + Sets the amount to increase the upper bound on the wait before an election + on each election failure. Note that this is _linear_ backoff. This defaults + to `100ms` + +`cluster.election.max_timeout`:: + + Sets the maximum upper bound on how long a node will wait before attempting + an first election, so that an network partition that lasts for a long time + does not result in excessively sparse elections. This defaults to `10s` + +`cluster.election.duration`:: + + Sets how long each election is allowed to take before a node considers it to + have failed and schedules a retry. This defaults to `500ms`. + +[float] +==== Joining an elected master + +During master election, or when joining an existing formed cluster, a node will +send a join request to the master in order to be officially added to the +cluster. This join process can be configured with the following settings. + +`cluster.join.timeout`:: + + Sets how long a node will wait after sending a request to join a cluster + before it considers the request to have failed and retries. Defaults to + `60s`. diff --git a/docs/reference/modules/discovery/no-master-block.asciidoc b/docs/reference/modules/discovery/no-master-block.asciidoc new file mode 100644 index 0000000000000..dc87b6745285f --- /dev/null +++ b/docs/reference/modules/discovery/no-master-block.asciidoc @@ -0,0 +1,22 @@ +[[no-master-block]] +=== No master block + +For the cluster to be fully operational, it must have an active master. The +`discovery.zen.no_master_block` settings controls what operations should be +rejected when there is no active master. + +The `discovery.zen.no_master_block` setting has two valid values: + +[horizontal] +`all`:: All operations on the node--i.e. both read & writes--will be rejected. +This also applies for api cluster state read or write operations, like the get +index settings, put mapping and cluster state api. +`write`:: (default) Write operations will be rejected. Read operations will +succeed, based on the last known cluster configuration. This may result in +partial reads of stale data as this node may be isolated from the rest of the +cluster. + +The `discovery.zen.no_master_block` setting doesn't apply to nodes-based APIs +(for example cluster stats, node info, and node stats APIs). Requests to these +APIs will not be blocked and can run on any available node. + diff --git a/docs/reference/modules/discovery/publishing.asciidoc b/docs/reference/modules/discovery/publishing.asciidoc new file mode 100644 index 0000000000000..1baa9c2dfbc31 --- /dev/null +++ b/docs/reference/modules/discovery/publishing.asciidoc @@ -0,0 +1,39 @@ +[[cluster-state-publishing]] +=== Cluster state publishing + +The master node is the only node in a cluster that can make changes to the +cluster state. The master node processes one batch of cluster state updates at +a time, computing the required changes and publishing the updated cluster state +to all the other nodes in the cluster. Each publication starts with the master +broadcasting the updated cluster state to all nodes in the cluster, to which +each node responds with an acknowledgement but does not yet apply the +newly-received state. Once the master has collected acknowledgements from +enough master-eligible nodes the new cluster state is said to be _committed_, +and the master broadcasts another message instructing nodes to apply the +now-committed state. Each node receives this message, applies the updated +state, and then sends a second acknowledgement back to the master. + +The master allows a limited amount of time for each cluster state update to be +completely published to all nodes, defined by `cluster.publish.timeout`, which +defaults to `30s`, measured from the time the publication started. If this time +is reached before the new cluster state is committed then the cluster state +change is rejected, the master considers itself to have failed, stands down, +and starts trying to elect a new master. + +However, if the new cluster state is committed before `cluster.publish.timeout` +has elapsed, but before all acknowledgements have been received, then the +master node considers the change to have succeeded and starts processing and +publishing the next cluster state update, even though some nodes have not yet +confirmed that they have applied the current one. These nodes are said to be +_lagging_ since their cluster states have fallen behind the master's latest +state. The master waits for the lagging nodes to catch up for a further time, +`cluster.follower_lag.timeout`, which defaults to `90s`, and if a node has +still not successfully applied the cluster state update within this time then +it is considered to have failed and is removed from the cluster. + +NOTE: Elasticsearch is a peer to peer based system, in which nodes communicate +with one another directly. The high-throughput APIs (index, delete, search) do +not normally interact with the master node. The responsibility of the master +node is to maintain the global cluster state, and act if nodes join or leave +the cluster by reassigning shards. Each time the cluster state is changed, the +new state is published to all nodes in the cluster as described above. diff --git a/docs/reference/modules/discovery/quorums.asciidoc b/docs/reference/modules/discovery/quorums.asciidoc new file mode 100644 index 0000000000000..ff159c0535475 --- /dev/null +++ b/docs/reference/modules/discovery/quorums.asciidoc @@ -0,0 +1,181 @@ +[[modules-discovery-quorums]] +=== Quorum-based decision making + +Electing a master node and changing the cluster state are the two fundamental +tasks that master-eligible nodes must work together to perform. It is important +that these activities work robustly even if some nodes have failed, and +Elasticsearch achieves this robustness by only considering each action to have +succeeded on receipt of responses from a _quorum_, a subset of the +master-eligible nodes in the cluster. The advantage of requiring only a subset +of the nodes to respond is that it allows for some of the nodes to fail without +preventing the cluster from making progress, and the quorums are carefully +chosen so as not to allow the cluster to "split brain", i.e. to be partitioned +into two pieces each of which may make decisions that are inconsistent with +those of the other piece. + +Elasticsearch allows you to add and remove master-eligible nodes to a running +cluster. In many cases you can do this simply by starting or stopping the nodes +as required, as described in more detail in the +<>. + +As nodes are added or removed Elasticsearch maintains an optimal level of fault +tolerance by updating the cluster's _voting configuration_, which is the set of +master-eligible nodes whose responses are counted when making decisions such as +electing a new master or committing a new cluster state. A decision is only made +once more than half of the nodes in the voting configuration have responded. +Usually the voting configuration is the same as the set of all the +master-eligible nodes that are currently in the cluster, but there are some +situations in which they may be different. + +To be sure that the cluster remains available you **must not stop half or more +of the nodes in the voting configuration at the same time**. As long as more +than half of the voting nodes are available the cluster can still work normally. +This means that if there are three or four master-eligible nodes then the +cluster can tolerate one of them being unavailable; if there are two or fewer +master-eligible nodes then they must all remain available. + +After a node has joined or left the cluster the elected master must issue a +cluster-state update that adjusts the voting configuration to match, and this +can take a short time to complete. It is important to wait for this adjustment +to complete before removing more nodes from the cluster. + +[float] +==== Setting the initial quorum + +When a brand-new cluster starts up for the first time, one of the tasks it must +perform is to elect its first master node, for which it needs to know the set +of master-eligible nodes whose votes should count in this first election. This +initial voting configuration is known as the _bootstrap configuration_ and is +set in the <>. + +It is important that the bootstrap configuration identifies exactly which nodes +should vote in the first election, and it is not sufficient to configure each +node with an expectation of how many nodes there should be in the cluster. It +is also important to note that the bootstrap configuration must come from +outside the cluster: there is no safe way for the cluster to determine the +bootstrap configuration correctly on its own. + +If the bootstrap configuration is not set correctly then there is a risk when +starting up a brand-new cluster is that you accidentally form two separate +clusters instead of one. This could lead to data loss: you might start using +both clusters before noticing that anything had gone wrong, and it will then be +impossible to merge them together later. + +NOTE: To illustrate the problem with configuring each node to expect a certain +cluster size, imagine starting up a three-node cluster in which each node knows +that it is going to be part of a three-node cluster. A majority of three nodes +is two, so normally the first two nodes to discover each other will form a +cluster and the third node will join them a short time later. However, imagine +that four nodes were erroneously started instead of three: in this case there +are enough nodes to form two separate clusters. Of course if each node is +started manually then it's unlikely that too many nodes are started, but it's +certainly possible to get into this situation if using a more automated +orchestrator, particularly if the orchestrator is not resilient to failures +such as network partitions. + +The initial quorum is only required the very first time a whole cluster starts +up: new nodes joining an established cluster can safely obtain all the +information they need from the elected master, and nodes that have previously +been part of a cluster will have stored to disk all the information required +when restarting. + +[float] +==== Cluster maintenance, rolling restarts and migrations + +Many cluster maintenance tasks involve temporarily shutting down one or more +nodes and then starting them back up again. By default Elasticsearch can remain +available if one of its master-eligible nodes is taken offline, such as during a +<>. Furthermore, if multiple nodes are stopped +and then started again then it will automatically recover, such as during a +<>. There is no need to take any further +action with the APIs described here in these cases, because the set of master +nodes is not changing permanently. + +[float] +==== Auto-reconfiguration + +Nodes may join or leave the cluster, and Elasticsearch reacts by making +corresponding changes to the voting configuration in order to ensure that the +cluster is as resilient as possible. The default auto-reconfiguration behaviour +is expected to give the best results in most situation. The current voting +configuration is stored in the cluster state so you can inspect its current +contents as follows: + +[source,js] +-------------------------------------------------- +GET /_cluster/state?filter_path=metadata.cluster_coordination.last_committed_config +-------------------------------------------------- +// CONSOLE + +NOTE: The current voting configuration is not necessarily the same as the set of +all available master-eligible nodes in the cluster. Altering the voting +configuration itself involves taking a vote, so it takes some time to adjust the +configuration as nodes join or leave the cluster. Also, there are situations +where the most resilient configuration includes unavailable nodes, or does not +include some available nodes, and in these situations the voting configuration +will differ from the set of available master-eligible nodes in the cluster. + +Larger voting configurations are usually more resilient, so Elasticsearch will +normally prefer to add master-eligible nodes to the voting configuration once +they have joined the cluster. Similarly, if a node in the voting configuration +leaves the cluster and there is another master-eligible node in the cluster that +is not in the voting configuration then it is preferable to swap these two nodes +over, leaving the size of the voting configuration unchanged but increasing its +resilience. + +It is not so straightforward to automatically remove nodes from the voting +configuration after they have left the cluster, and different strategies have +different benefits and drawbacks, so the right choice depends on how the cluster +will be used and is controlled by the following setting. + +`cluster.auto_shrink_voting_configuration`:: + + Defaults to `true`, meaning that the voting configuration will automatically + shrink, shedding departed nodes, as long as it still contains at least 3 + nodes. If set to `false`, the voting configuration never automatically + shrinks; departed nodes must be removed manually using the + <>. + +NOTE: If `cluster.auto_shrink_voting_configuration` is set to `true`, the +recommended and default setting, and there are at least three master-eligible +nodes in the cluster, then Elasticsearch remains capable of processing +cluster-state updates as long as all but one of its master-eligible nodes are +healthy. + +There are situations in which Elasticsearch might tolerate the loss of multiple +nodes, but this is not guaranteed under all sequences of failures. If this +setting is set to `false` then departed nodes must be removed from the voting +configuration manually, using the +<>, to achieve +the desired level of resilience. + +Note that Elasticsearch will not suffer from a "split-brain" inconsistency +however it is configured. This setting only affects its availability in the +event of the failure of some of its nodes, and the administrative tasks that +must be performed as nodes join and leave the cluster. + +[float] +==== Even numbers of master-eligible nodes + +There should normally be an odd number of master-eligible nodes in a cluster. +If there is an even number then Elasticsearch will leave one of them out of the +voting configuration to ensure that it has an odd size. This does not decrease +the failure-tolerance of the cluster, and in fact improves it slightly: if the +cluster is partitioned into two even halves then one of the halves will contain +a majority of the voting configuration and will be able to keep operating, +whereas if all of the master-eligible nodes' votes were counted then neither +side could make any progress in this situation. + +For instance if there are four master-eligible nodes in the cluster and the +voting configuration contained all of them then any quorum-based decision would +require votes from at least three of them, which means that the cluster can only +tolerate the loss of a single master-eligible node. If this cluster were split +into two equal halves then neither half would contain three master-eligible +nodes so would not be able to make any progress. However if the voting +configuration contains only three of the four master-eligible nodes then the +cluster is still only fully tolerant to the loss of one node, but quorum-based +decisions require votes from two of the three voting nodes. In the event of an +even split, one half will contain two of the three voting nodes so will remain +available. From 0ea5488e39979d6257cd1e03c13f007c8aead115 Mon Sep 17 00:00:00 2001 From: David Turner Date: Tue, 18 Dec 2018 12:33:53 +0000 Subject: [PATCH 086/106] Rework the discovery module front page --- docs/reference/modules/discovery.asciidoc | 23 ++++++++++++----------- 1 file changed, 12 insertions(+), 11 deletions(-) diff --git a/docs/reference/modules/discovery.asciidoc b/docs/reference/modules/discovery.asciidoc index 0acceb04595ef..b3eb72dfec200 100644 --- a/docs/reference/modules/discovery.asciidoc +++ b/docs/reference/modules/discovery.asciidoc @@ -7,8 +7,6 @@ each time it changes. It is integrated with other modules. For example, all communication between nodes is done using the <> module. -It is separated into several sections, which are explained below: - * <> is the process where nodes find each other when the master is unknown, such as when a node has just started up or when the previous master has failed. @@ -19,27 +17,30 @@ It is separated into several sections, which are explained below: performed by the nodes themselves. As this auto-bootstrapping is <>, running a node in <> requires bootstrapping to be explicitly - configured via the `cluster.initial_master_nodes` setting. + configured via the + <>. * It is recommended to have a small and fixed number of master-eligible nodes in a cluster, and to scale the cluster up and down by adding and removing master-ineligible nodes only. However there are situations in which it may be desirable to add or remove some master-eligible nodes to or from a cluster. A section on <> describes this process as well as the extra steps that need - to be performed when removing more than half of the master-eligible nodes at - the same time. + removing master-eligible nodes>> describes this process as well as the extra + steps that need to be performed when removing more than half of the + master-eligible nodes at the same time. -* <> covers how a master - publishes cluster states to the other nodes in the cluster. +* <> is the process by + which the elected master node updates the cluster state on all the other + nodes in the cluster. * The <> is put in place when there is no known elected master, and can be configured to determine which operations should be rejected when it is in place. -* <> and <> - sections cover advanced settings to influence the election and fault - detection processes. +* The <> and <> sections cover advanced settings that influence the election and + fault detection processes. * <> explains the design behind the master election and auto-reconfiguration logic. From 00a0145a2366edf105724871e8d7d4323598c79e Mon Sep 17 00:00:00 2001 From: David Turner Date: Tue, 18 Dec 2018 12:54:54 +0000 Subject: [PATCH 087/106] Better front page --- docs/reference/modules/discovery.asciidoc | 64 ++++++++++++++--------- 1 file changed, 38 insertions(+), 26 deletions(-) diff --git a/docs/reference/modules/discovery.asciidoc b/docs/reference/modules/discovery.asciidoc index b3eb72dfec200..d9cc83008f5aa 100644 --- a/docs/reference/modules/discovery.asciidoc +++ b/docs/reference/modules/discovery.asciidoc @@ -5,45 +5,57 @@ The discovery and cluster formation module is responsible for discovering nodes, electing a master, forming a cluster, and publishing the cluster state each time it changes. It is integrated with other modules. For example, all communication between nodes is done using the <> -module. +module. This module is divided into the following sections: -* <> is the process where nodes - find each other when the master is unknown, such as when a node has just - started up or when the previous master has failed. +<>:: -* <> is required when an Elasticsearch - cluster starts up for the very first time. In <>, with no discovery settings configured, this is automatically - performed by the nodes themselves. As this auto-bootstrapping is + Discovery is the process where nodes find each other when the master is + unknown, such as when a node has just started up or when the previous + master has failed. + +<>:: + + Bootstrapping a cluster is required when an Elasticsearch cluster starts up + for the very first time. In <>, with no + discovery settings configured, this is automatically performed by the nodes + themselves. As this auto-bootstrapping is <>, running a node in - <> requires bootstrapping to be explicitly - configured via the + <> requires bootstrapping to be + explicitly configured via the <>. -* It is recommended to have a small and fixed number of master-eligible nodes +<>:: + + It is recommended to have a small and fixed number of master-eligible nodes in a cluster, and to scale the cluster up and down by adding and removing master-ineligible nodes only. However there are situations in which it may be desirable to add or remove some master-eligible nodes to or from a - cluster. A section on <> describes this process as well as the extra - steps that need to be performed when removing more than half of the - master-eligible nodes at the same time. + cluster. This section describes the process for adding or removing + master-eligible nodes, including the extra steps that need to be performed + when removing more than half of the master-eligible nodes at the same time. + +<>:: + + Cluster state publishing is the process by which the elected master node + updates the cluster state on all the other nodes in the cluster. + +<>:: + + The no-master block is put in place when there is no known elected master, + and can be configured to determine which operations should be rejected when + it is in place. -* <> is the process by - which the elected master node updates the cluster state on all the other - nodes in the cluster. +Advanced settings:: -* The <> is put in place when there is no - known elected master, and can be configured to determine which operations - should be rejected when it is in place. + There are settings that allow advanced users to influence the + <> and <> + processes. -* The <> and <> sections cover advanced settings that influence the election and - fault detection processes. +<>:: -* <> explains the - design behind the master election and auto-reconfiguration logic. + This section describes the detailed design behind the master election and + auto-reconfiguration logic. include::discovery/adding-removing-nodes.asciidoc[] From 9cdc18f2897b5403e70b8c67968c9dcccab06eb9 Mon Sep 17 00:00:00 2001 From: David Turner Date: Tue, 18 Dec 2018 13:14:50 +0000 Subject: [PATCH 088/106] Reorder sections --- docs/reference/modules/discovery.asciidoc | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/docs/reference/modules/discovery.asciidoc b/docs/reference/modules/discovery.asciidoc index d9cc83008f5aa..225c6e28d8705 100644 --- a/docs/reference/modules/discovery.asciidoc +++ b/docs/reference/modules/discovery.asciidoc @@ -57,19 +57,19 @@ Advanced settings:: This section describes the detailed design behind the master election and auto-reconfiguration logic. -include::discovery/adding-removing-nodes.asciidoc[] +include::discovery/discovery.asciidoc[] include::discovery/bootstrapping.asciidoc[] -include::discovery/discovery.asciidoc[] - -include::discovery/fault-detection.asciidoc[] +include::discovery/adding-removing-nodes.asciidoc[] -include::discovery/master-election.asciidoc[] +include::discovery/publishing.asciidoc[] include::discovery/no-master-block.asciidoc[] -include::discovery/publishing.asciidoc[] +include::discovery/master-election.asciidoc[] + +include::discovery/fault-detection.asciidoc[] include::discovery/quorums.asciidoc[] From 4b55b1e625bcc90711d9721406ff2d8bef00482e Mon Sep 17 00:00:00 2001 From: Lisa Cawley Date: Thu, 20 Dec 2018 08:51:42 +0000 Subject: [PATCH 089/106] Suggested changes to adding & removing nodes Co-Authored-By: DaveCTurner --- .../discovery/adding-removing-nodes.asciidoc | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/docs/reference/modules/discovery/adding-removing-nodes.asciidoc b/docs/reference/modules/discovery/adding-removing-nodes.asciidoc index 0e759acb4972a..5a3431bf13ca0 100644 --- a/docs/reference/modules/discovery/adding-removing-nodes.asciidoc +++ b/docs/reference/modules/discovery/adding-removing-nodes.asciidoc @@ -3,7 +3,7 @@ As nodes are added or removed Elasticsearch maintains an optimal level of fault tolerance by automatically updating the cluster's _voting configuration_, which -is the set of master-eligible nodes whose responses are counted when making +is the set of <> whose responses are counted when making decisions such as electing a new master or committing a new cluster state. It is recommended to have a small and fixed number of master-eligible nodes in a @@ -28,13 +28,13 @@ the cluster to <> the voting configuration and adapt the fault tolerance level to the new set of nodes. If there are only two master-eligible nodes remaining then neither node can be -safely removed since both are required to reliably make progress, so you must +safely removed since both are required to reliably make progress. You must first inform Elasticsearch that one of the nodes should not be part of the voting configuration, and that the voting power should instead be given to -other nodes, allowing the excluded node to be taken offline without preventing +other nodes. You can then take the excluded node offline without preventing the other node from making progress. A node which is added to a voting -configuration exclusion list still works normally, but Elasticsearch will try -and remove it from the voting configuration so its vote is no longer required. +configuration exclusion list still works normally, but Elasticsearch +tries to remove it from the voting configuration so its vote is no longer required. Importantly, Elasticsearch will never automatically move a node on the voting exclusions list back into the voting configuration. Once an excluded node has been successfully auto-reconfigured out of the voting configuration, it is safe @@ -58,7 +58,7 @@ POST /_cluster/voting_config_exclusions/node_name?timeout=1m The node that should be added to the exclusions list is specified using <> in place of `node_name` here. If a call to the -voting configuration exclusions API fails then the call can safely be retried. +voting configuration exclusions API fails, you can safely retry it. Only a successful response guarantees that the node has actually been removed from the voting configuration and will not be reinstated. @@ -103,8 +103,8 @@ maintenance is complete. Clusters should have no voting configuration exclusions in normal operation. If a node is excluded from the voting configuration because it is to be shut -down permanently then its exclusion can be removed once it has shut down and -been removed from the cluster. Exclusions can also be cleared if they were +down permanently, its exclusion can be removed after it is shut down and +removed from the cluster. Exclusions can also be cleared if they were created in error or were only required temporarily: [source,js] From 6778c0a5176f380d78e49dffd80c7c084071733f Mon Sep 17 00:00:00 2001 From: Lisa Cawley Date: Thu, 20 Dec 2018 08:53:24 +0000 Subject: [PATCH 090/106] Suggested changes to bootstrapping doc Co-Authored-By: DaveCTurner --- .../reference/modules/discovery/bootstrapping.asciidoc | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/docs/reference/modules/discovery/bootstrapping.asciidoc b/docs/reference/modules/discovery/bootstrapping.asciidoc index ffcab074e6724..f5b4cfddf9d8f 100644 --- a/docs/reference/modules/discovery/bootstrapping.asciidoc +++ b/docs/reference/modules/discovery/bootstrapping.asciidoc @@ -2,11 +2,11 @@ === Bootstrapping a cluster Starting an Elasticsearch cluster for the very first time requires the initial -set of master-eligible nodes to be explicitly set on one or more of the +set of <> to be explicitly defined on one or more of the master-eligible nodes in the cluster. This is known as _cluster bootstrapping_. This is only required the very first time the cluster starts up: nodes that -have already joined a cluster will store this information in their data folder, -and freshly-started nodes that are intended to join an existing cluster will +have already joined a cluster store this information in their data folder +and freshly-started nodes that are joining an existing cluster obtain this information from the cluster's elected master. This information is given using this setting: @@ -81,9 +81,9 @@ WARNING: You must put exactly the same set of initial master nodes in each [float] ==== Choosing a cluster name -The `cluster.name` allows you to create multiple clusters which are separated +The <> setting enables you to create multiple clusters which are separated from each other. Nodes verify that they agree on their cluster name when they -first connect to each other, and if two nodes have different cluster names then +first connect to each other. If two nodes have different cluster names, they will not communicate meaningfully and will not belong to the same cluster. The default value for the cluster name is `elasticsearch`, but it is recommended to change this to reflect the logical name of the cluster. From b349806554ebc1031eaa3c1d945ef880399a3a79 Mon Sep 17 00:00:00 2001 From: Lisa Cawley Date: Thu, 20 Dec 2018 08:55:48 +0000 Subject: [PATCH 091/106] Suggested changes to discovery docs Co-Authored-By: DaveCTurner --- .../modules/discovery/discovery.asciidoc | 18 ++++++++---------- 1 file changed, 8 insertions(+), 10 deletions(-) diff --git a/docs/reference/modules/discovery/discovery.asciidoc b/docs/reference/modules/discovery/discovery.asciidoc index c41a3e0dadbb9..9ecc21c0581de 100644 --- a/docs/reference/modules/discovery/discovery.asciidoc +++ b/docs/reference/modules/discovery/discovery.asciidoc @@ -3,8 +3,8 @@ The cluster formation module uses a list of _seed_ nodes in order to start off the discovery process. When you start an Elasticsearch node, or when a node -believes the master node to have failed, that node tries to connect to each -seed node in its list, and once connected the two nodes will repeatedly share +believes the master node failed, that node tries to connect to each +seed node in its list. After a connection occurs, the two nodes repeatedly share information about the other known master-eligible nodes in the cluster in order to build a complete picture of the cluster. By default the cluster formation module offers two hosts providers to configure the list of seed nodes: a @@ -19,17 +19,15 @@ list. [[settings-based-hosts-provider]] ===== Settings-based hosts provider -The settings-based hosts provider use a node setting to configure a static list +The settings-based hosts provider uses a node setting to configure a static list of hosts to use as seed nodes. These hosts can be specified as hostnames or IP addresses; hosts specified as hostnames are resolved to IP addresses during each round of discovery. Note that if you are in an environment where DNS resolutions vary with time, you might need to adjust your <>. -The list of hosts is set using the `discovery.zen.ping.unicast.hosts` static -setting. This is either an array of hosts or a comma-delimited string. Each -value should be in the form of `host:port` or `host` (where `port` defaults to -the setting `transport.profiles.default.port` falling back to `transport.port` +The list of hosts is set using the <> static +setting. For example: if not set). Note that IPv6 hosts must be bracketed. The default for this setting is `127.0.0.1, [::1]` @@ -75,8 +73,8 @@ below. Any time a change is made to the `unicast_hosts.txt` file the new changes will be picked up by Elasticsearch and the new hosts list will be used. Note that the file-based discovery plugin augments the unicast hosts list in -`elasticsearch.yml`: if there are valid unicast host entries in -`discovery.zen.ping.unicast.hosts` then they will be used in addition to those +`elasticsearch.yml`. If there are valid unicast host entries in +`discovery.zen.ping.unicast.hosts`, they are used in addition to those supplied in `unicast_hosts.txt`. The `discovery.zen.ping.unicast.hosts.resolve_timeout` setting also applies to @@ -106,7 +104,7 @@ Host names are allowed instead of IP addresses (similar to `discovery.zen.ping.unicast.hosts`), and IPv6 addresses must be specified in brackets with the port coming after the brackets. -It is also possible to add comments to this file. All comments must appear on +You can also add comments to this file. All comments must appear on their lines starting with `#` (i.e. comments cannot start in the middle of a line). From ccba8ba780f248024bd7dff3c1121e23ac91baf3 Mon Sep 17 00:00:00 2001 From: Lisa Cawley Date: Thu, 20 Dec 2018 08:56:32 +0000 Subject: [PATCH 092/106] Apply suggestions from code review Co-Authored-By: DaveCTurner --- docs/reference/modules/discovery/discovery.asciidoc | 2 -- 1 file changed, 2 deletions(-) diff --git a/docs/reference/modules/discovery/discovery.asciidoc b/docs/reference/modules/discovery/discovery.asciidoc index 9ecc21c0581de..9daab9723ac0e 100644 --- a/docs/reference/modules/discovery/discovery.asciidoc +++ b/docs/reference/modules/discovery/discovery.asciidoc @@ -28,8 +28,6 @@ security settings>>. The list of hosts is set using the <> static setting. For example: -if not set). Note that IPv6 hosts must be bracketed. The default for this -setting is `127.0.0.1, [::1]` [source,yaml] -------------------------------------------------- From 8835ef244a3013ed0c6fa49ec53c5805587d0bac Mon Sep 17 00:00:00 2001 From: Lisa Cawley Date: Thu, 20 Dec 2018 08:58:00 +0000 Subject: [PATCH 093/106] Suggested title changes Co-Authored-By: DaveCTurner --- docs/reference/modules/discovery/fault-detection.asciidoc | 2 +- docs/reference/modules/discovery/master-election.asciidoc | 2 +- docs/reference/modules/discovery/no-master-block.asciidoc | 2 +- docs/reference/modules/discovery/publishing.asciidoc | 2 +- 4 files changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/reference/modules/discovery/fault-detection.asciidoc b/docs/reference/modules/discovery/fault-detection.asciidoc index 6d9fa0587c513..5805fbce60477 100644 --- a/docs/reference/modules/discovery/fault-detection.asciidoc +++ b/docs/reference/modules/discovery/fault-detection.asciidoc @@ -1,5 +1,5 @@ [[fault-detection]] -=== Fault Detection +=== Cluster fault detection settings An elected master periodically checks each of the nodes in the cluster in order to ensure that they are still connected and healthy, and in turn each node in diff --git a/docs/reference/modules/discovery/master-election.asciidoc b/docs/reference/modules/discovery/master-election.asciidoc index 6d3f5b9e34b5f..bf4d5f0d28871 100644 --- a/docs/reference/modules/discovery/master-election.asciidoc +++ b/docs/reference/modules/discovery/master-election.asciidoc @@ -1,5 +1,5 @@ [[master-election]] -=== Master Election +=== Master election settings Elasticsearch uses an election process to agree on an elected master node, both at startup and if the existing elected master fails. Any master-eligible node diff --git a/docs/reference/modules/discovery/no-master-block.asciidoc b/docs/reference/modules/discovery/no-master-block.asciidoc index dc87b6745285f..3099aaf66796d 100644 --- a/docs/reference/modules/discovery/no-master-block.asciidoc +++ b/docs/reference/modules/discovery/no-master-block.asciidoc @@ -1,5 +1,5 @@ [[no-master-block]] -=== No master block +=== No master block settings For the cluster to be fully operational, it must have an active master. The `discovery.zen.no_master_block` settings controls what operations should be diff --git a/docs/reference/modules/discovery/publishing.asciidoc b/docs/reference/modules/discovery/publishing.asciidoc index 1baa9c2dfbc31..b88bd9e327197 100644 --- a/docs/reference/modules/discovery/publishing.asciidoc +++ b/docs/reference/modules/discovery/publishing.asciidoc @@ -1,5 +1,5 @@ [[cluster-state-publishing]] -=== Cluster state publishing +=== Publishing the cluster state The master node is the only node in a cluster that can make changes to the cluster state. The master node processes one batch of cluster state updates at From 06439292fe7922ad428b49d435973622f20b1264 Mon Sep 17 00:00:00 2001 From: Lisa Cawley Date: Thu, 20 Dec 2018 09:01:34 +0000 Subject: [PATCH 094/106] Suggested changes to publishing docs Co-Authored-By: DaveCTurner --- .../modules/discovery/publishing.asciidoc | 22 +++++++++---------- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/docs/reference/modules/discovery/publishing.asciidoc b/docs/reference/modules/discovery/publishing.asciidoc index b88bd9e327197..2d25ad2b29c0f 100644 --- a/docs/reference/modules/discovery/publishing.asciidoc +++ b/docs/reference/modules/discovery/publishing.asciidoc @@ -5,35 +5,35 @@ The master node is the only node in a cluster that can make changes to the cluster state. The master node processes one batch of cluster state updates at a time, computing the required changes and publishing the updated cluster state to all the other nodes in the cluster. Each publication starts with the master -broadcasting the updated cluster state to all nodes in the cluster, to which -each node responds with an acknowledgement but does not yet apply the +broadcasting the updated cluster state to all nodes in the cluster. +Each node responds with an acknowledgement but does not yet apply the newly-received state. Once the master has collected acknowledgements from -enough master-eligible nodes the new cluster state is said to be _committed_, +enough master-eligible nodes, the new cluster state is said to be _committed_ and the master broadcasts another message instructing nodes to apply the now-committed state. Each node receives this message, applies the updated state, and then sends a second acknowledgement back to the master. The master allows a limited amount of time for each cluster state update to be -completely published to all nodes, defined by `cluster.publish.timeout`, which +completely published to all nodes. It is defined by the `cluster.publish.timeout` setting, which defaults to `30s`, measured from the time the publication started. If this time is reached before the new cluster state is committed then the cluster state -change is rejected, the master considers itself to have failed, stands down, +change is rejected and the master considers itself to have failed. It stands down and starts trying to elect a new master. -However, if the new cluster state is committed before `cluster.publish.timeout` +If the new cluster state is committed before `cluster.publish.timeout` has elapsed, but before all acknowledgements have been received, then the master node considers the change to have succeeded and starts processing and -publishing the next cluster state update, even though some nodes have not yet -confirmed that they have applied the current one. These nodes are said to be +publishing the next cluster state update. If some acknowledgements have not been received (i.e. some nodes have not yet +confirmed that they have applied the current update), these nodes are said to be _lagging_ since their cluster states have fallen behind the master's latest state. The master waits for the lagging nodes to catch up for a further time, -`cluster.follower_lag.timeout`, which defaults to `90s`, and if a node has +`cluster.follower_lag.timeout`, which defaults to `90s`. If a node has still not successfully applied the cluster state update within this time then it is considered to have failed and is removed from the cluster. NOTE: Elasticsearch is a peer to peer based system, in which nodes communicate with one another directly. The high-throughput APIs (index, delete, search) do not normally interact with the master node. The responsibility of the master -node is to maintain the global cluster state, and act if nodes join or leave -the cluster by reassigning shards. Each time the cluster state is changed, the +node is to maintain the global cluster state and reassign shards when nodes join or leave +the cluster. Each time the cluster state is changed, the new state is published to all nodes in the cluster as described above. From 442a7a76934601ab45656bed75e874a02bdfb003 Mon Sep 17 00:00:00 2001 From: David Turner Date: Thu, 20 Dec 2018 09:14:54 +0000 Subject: [PATCH 095/106] Further updates to publishing.asciidoc --- .../modules/discovery/publishing.asciidoc | 47 ++++++++++--------- 1 file changed, 25 insertions(+), 22 deletions(-) diff --git a/docs/reference/modules/discovery/publishing.asciidoc b/docs/reference/modules/discovery/publishing.asciidoc index 2d25ad2b29c0f..8c69290edc706 100644 --- a/docs/reference/modules/discovery/publishing.asciidoc +++ b/docs/reference/modules/discovery/publishing.asciidoc @@ -5,31 +5,34 @@ The master node is the only node in a cluster that can make changes to the cluster state. The master node processes one batch of cluster state updates at a time, computing the required changes and publishing the updated cluster state to all the other nodes in the cluster. Each publication starts with the master -broadcasting the updated cluster state to all nodes in the cluster. -Each node responds with an acknowledgement but does not yet apply the -newly-received state. Once the master has collected acknowledgements from -enough master-eligible nodes, the new cluster state is said to be _committed_ -and the master broadcasts another message instructing nodes to apply the -now-committed state. Each node receives this message, applies the updated -state, and then sends a second acknowledgement back to the master. +broadcasting the updated cluster state to all nodes in the cluster. Each node +responds with an acknowledgement but does not yet apply the newly-received +state. Once the master has collected acknowledgements from enough +master-eligible nodes, the new cluster state is said to be _committed_ and the +master broadcasts another message instructing nodes to apply the now-committed +state. Each node receives this message, applies the updated state, and then +sends a second acknowledgement back to the master. The master allows a limited amount of time for each cluster state update to be -completely published to all nodes. It is defined by the `cluster.publish.timeout` setting, which -defaults to `30s`, measured from the time the publication started. If this time -is reached before the new cluster state is committed then the cluster state -change is rejected and the master considers itself to have failed. It stands down -and starts trying to elect a new master. +completely published to all nodes. It is defined by the +`cluster.publish.timeout` setting, which defaults to `30s`, measured from the +time the publication started. If this time is reached before the new cluster +state is committed then the cluster state change is rejected and the master +considers itself to have failed. It stands down and starts trying to elect a +new master. -If the new cluster state is committed before `cluster.publish.timeout` -has elapsed, but before all acknowledgements have been received, then the -master node considers the change to have succeeded and starts processing and -publishing the next cluster state update. If some acknowledgements have not been received (i.e. some nodes have not yet -confirmed that they have applied the current update), these nodes are said to be -_lagging_ since their cluster states have fallen behind the master's latest -state. The master waits for the lagging nodes to catch up for a further time, -`cluster.follower_lag.timeout`, which defaults to `90s`. If a node has -still not successfully applied the cluster state update within this time then -it is considered to have failed and is removed from the cluster. +If the new cluster state is committed before `cluster.publish.timeout` has +elapsed, the master node considers the change to have succeeded. It waits until +the timeout has elapsed or until it has received acknowledgements that each +node in the cluster has applied the updated state, and then starts processing +and publishing the next cluster state update. If some acknowledgements have not +been received (i.e. some nodes have not yet confirmed that they have applied +the current update), these nodes are said to be _lagging_ since their cluster +states have fallen behind the master's latest state. The master waits for the +lagging nodes to catch up for a further time, `cluster.follower_lag.timeout`, +which defaults to `90s`. If a node has still not successfully applied the +cluster state update within this time then it is considered to have failed and +is removed from the cluster. NOTE: Elasticsearch is a peer to peer based system, in which nodes communicate with one another directly. The high-throughput APIs (index, delete, search) do From 14194c60815dfa82f6b97902cd9ee8e812f28642 Mon Sep 17 00:00:00 2001 From: Lisa Cawley Date: Thu, 20 Dec 2018 09:25:31 +0000 Subject: [PATCH 096/106] Suggested changes to quorums.asciidoc Co-Authored-By: DaveCTurner --- .../modules/discovery/quorums.asciidoc | 113 +++++++++--------- 1 file changed, 56 insertions(+), 57 deletions(-) diff --git a/docs/reference/modules/discovery/quorums.asciidoc b/docs/reference/modules/discovery/quorums.asciidoc index ff159c0535475..8cbf60545bc65 100644 --- a/docs/reference/modules/discovery/quorums.asciidoc +++ b/docs/reference/modules/discovery/quorums.asciidoc @@ -3,37 +3,36 @@ Electing a master node and changing the cluster state are the two fundamental tasks that master-eligible nodes must work together to perform. It is important -that these activities work robustly even if some nodes have failed, and -Elasticsearch achieves this robustness by only considering each action to have -succeeded on receipt of responses from a _quorum_, a subset of the +that these activities work robustly even if some nodes have failed. +Elasticsearch achieves this robustness by considering each action to have +succeeded on receipt of responses from a _quorum_, which is a subset of the master-eligible nodes in the cluster. The advantage of requiring only a subset -of the nodes to respond is that it allows for some of the nodes to fail without -preventing the cluster from making progress, and the quorums are carefully -chosen so as not to allow the cluster to "split brain", i.e. to be partitioned -into two pieces each of which may make decisions that are inconsistent with +of the nodes to respond is that it means some of the nodes can fail without +preventing the cluster from making progress. The quorums are carefully +chosen so the cluster does not have a "split brain" scenario where it's partitioned +into two pieces--each of which may make decisions that are inconsistent with those of the other piece. Elasticsearch allows you to add and remove master-eligible nodes to a running cluster. In many cases you can do this simply by starting or stopping the nodes -as required, as described in more detail in the -<>. +as required. See +<>. As nodes are added or removed Elasticsearch maintains an optimal level of fault tolerance by updating the cluster's _voting configuration_, which is the set of master-eligible nodes whose responses are counted when making decisions such as -electing a new master or committing a new cluster state. A decision is only made -once more than half of the nodes in the voting configuration have responded. +electing a new master or committing a new cluster state. A decision is made +only after more than half of the nodes in the voting configuration have responded. Usually the voting configuration is the same as the set of all the -master-eligible nodes that are currently in the cluster, but there are some +master-eligible nodes that are currently in the cluster. However, there are some situations in which they may be different. To be sure that the cluster remains available you **must not stop half or more of the nodes in the voting configuration at the same time**. As long as more than half of the voting nodes are available the cluster can still work normally. -This means that if there are three or four master-eligible nodes then the -cluster can tolerate one of them being unavailable; if there are two or fewer -master-eligible nodes then they must all remain available. +This means that if there are three or four master-eligible nodes, the +cluster can tolerate one of them being unavailable. If there are two or fewer +master-eligible nodes, they must all remain available. After a node has joined or left the cluster the elected master must issue a cluster-state update that adjusts the voting configuration to match, and this @@ -43,43 +42,43 @@ to complete before removing more nodes from the cluster. [float] ==== Setting the initial quorum -When a brand-new cluster starts up for the first time, one of the tasks it must -perform is to elect its first master node, for which it needs to know the set -of master-eligible nodes whose votes should count in this first election. This +When a brand-new cluster starts up for the first time, it must +elect its first master node. To do this election, it needs to know the set +of master-eligible nodes whose votes should count. This initial voting configuration is known as the _bootstrap configuration_ and is set in the <>. It is important that the bootstrap configuration identifies exactly which nodes -should vote in the first election, and it is not sufficient to configure each +should vote in the first election. It is not sufficient to configure each node with an expectation of how many nodes there should be in the cluster. It is also important to note that the bootstrap configuration must come from outside the cluster: there is no safe way for the cluster to determine the bootstrap configuration correctly on its own. -If the bootstrap configuration is not set correctly then there is a risk when -starting up a brand-new cluster is that you accidentally form two separate -clusters instead of one. This could lead to data loss: you might start using -both clusters before noticing that anything had gone wrong, and it will then be +If the bootstrap configuration is not set correctly, when +you start a brand-new cluster there is a risk that you will accidentally form two separate +clusters instead of one. This situation can lead to data loss: you might start using +both clusters before you notice that anything has gone wrong and it is impossible to merge them together later. NOTE: To illustrate the problem with configuring each node to expect a certain cluster size, imagine starting up a three-node cluster in which each node knows that it is going to be part of a three-node cluster. A majority of three nodes -is two, so normally the first two nodes to discover each other will form a -cluster and the third node will join them a short time later. However, imagine -that four nodes were erroneously started instead of three: in this case there +is two, so normally the first two nodes to discover each other form a +cluster and the third node joins them a short time later. However, imagine +that four nodes were erroneously started instead of three. In this case, there are enough nodes to form two separate clusters. Of course if each node is -started manually then it's unlikely that too many nodes are started, but it's -certainly possible to get into this situation if using a more automated -orchestrator, particularly if the orchestrator is not resilient to failures +started manually then it's unlikely that too many nodes are started. If you're using an automated orchestrator, however, it's +certainly possible to get into this situation-- +particularly if the orchestrator is not resilient to failures such as network partitions. The initial quorum is only required the very first time a whole cluster starts -up: new nodes joining an established cluster can safely obtain all the -information they need from the elected master, and nodes that have previously -been part of a cluster will have stored to disk all the information required -when restarting. +up. New nodes joining an established cluster can safely obtain all the +information they need from the elected master. Nodes that have previously +been part of a cluster will have stored to disk all the information that is required +when they restart. [float] ==== Cluster maintenance, rolling restarts and migrations @@ -99,7 +98,7 @@ nodes is not changing permanently. Nodes may join or leave the cluster, and Elasticsearch reacts by making corresponding changes to the voting configuration in order to ensure that the cluster is as resilient as possible. The default auto-reconfiguration behaviour -is expected to give the best results in most situation. The current voting +is expected to give the best results in most situations. The current voting configuration is stored in the cluster state so you can inspect its current contents as follows: @@ -111,24 +110,24 @@ GET /_cluster/state?filter_path=metadata.cluster_coordination.last_committed_con NOTE: The current voting configuration is not necessarily the same as the set of all available master-eligible nodes in the cluster. Altering the voting -configuration itself involves taking a vote, so it takes some time to adjust the +configuration involves taking a vote, so it takes some time to adjust the configuration as nodes join or leave the cluster. Also, there are situations where the most resilient configuration includes unavailable nodes, or does not include some available nodes, and in these situations the voting configuration -will differ from the set of available master-eligible nodes in the cluster. +differs from the set of available master-eligible nodes in the cluster. -Larger voting configurations are usually more resilient, so Elasticsearch will -normally prefer to add master-eligible nodes to the voting configuration once -they have joined the cluster. Similarly, if a node in the voting configuration +Larger voting configurations are usually more resilient, so Elasticsearch +normally prefers to add master-eligible nodes to the voting configuration after +they join the cluster. Similarly, if a node in the voting configuration leaves the cluster and there is another master-eligible node in the cluster that is not in the voting configuration then it is preferable to swap these two nodes -over, leaving the size of the voting configuration unchanged but increasing its -resilience. +over. The size of the voting configuration is thus unchanged but its +resilience increases. It is not so straightforward to automatically remove nodes from the voting -configuration after they have left the cluster, and different strategies have +configuration after they have left the cluster. Different strategies have different benefits and drawbacks, so the right choice depends on how the cluster -will be used and is controlled by the following setting. +will be used. You can control whether the voting configuration automatically shrinks by using the following setting: `cluster.auto_shrink_voting_configuration`:: @@ -151,8 +150,8 @@ configuration manually, using the <>, to achieve the desired level of resilience. -Note that Elasticsearch will not suffer from a "split-brain" inconsistency -however it is configured. This setting only affects its availability in the +No matter how it is configured, Elasticsearch will not suffer from a "split-brain" inconsistency. +The `cluster.auto_shrink_voting_configuration` setting affects only its availability in the event of the failure of some of its nodes, and the administrative tasks that must be performed as nodes join and leave the cluster. @@ -160,21 +159,21 @@ must be performed as nodes join and leave the cluster. ==== Even numbers of master-eligible nodes There should normally be an odd number of master-eligible nodes in a cluster. -If there is an even number then Elasticsearch will leave one of them out of the -voting configuration to ensure that it has an odd size. This does not decrease -the failure-tolerance of the cluster, and in fact improves it slightly: if the +If there is an even number, Elasticsearch leaves one of them out of the +voting configuration to ensure that it has an odd size. This omission does not decrease +the failure-tolerance of the cluster. In fact, improves it slightly: if the cluster is partitioned into two even halves then one of the halves will contain -a majority of the voting configuration and will be able to keep operating, -whereas if all of the master-eligible nodes' votes were counted then neither +a majority of the voting configuration and will be able to keep operating. +If all of the master-eligible nodes' votes were counted, neither side could make any progress in this situation. For instance if there are four master-eligible nodes in the cluster and the -voting configuration contained all of them then any quorum-based decision would -require votes from at least three of them, which means that the cluster can only -tolerate the loss of a single master-eligible node. If this cluster were split -into two equal halves then neither half would contain three master-eligible -nodes so would not be able to make any progress. However if the voting -configuration contains only three of the four master-eligible nodes then the +voting configuration contained all of them, any quorum-based decision would +require votes from at least three of them. This situation means that the cluster can +tolerate the loss of only a single master-eligible node. If this cluster were split +into two equal halves, neither half would contain three master-eligible +nodes and the cluster would not be able to make any progress. If the voting +configuration contains only three of the four master-eligible nodes, however, the cluster is still only fully tolerant to the loss of one node, but quorum-based decisions require votes from two of the three voting nodes. In the event of an even split, one half will contain two of the three voting nodes so will remain From e466ed0b4a189179a8fb4aeac9fdbc669880d9b0 Mon Sep 17 00:00:00 2001 From: David Turner Date: Thu, 20 Dec 2018 10:59:14 +0000 Subject: [PATCH 097/106] Add headings --- .../discovery/adding-removing-nodes.asciidoc | 52 ++++++++++--------- 1 file changed, 28 insertions(+), 24 deletions(-) diff --git a/docs/reference/modules/discovery/adding-removing-nodes.asciidoc b/docs/reference/modules/discovery/adding-removing-nodes.asciidoc index 5a3431bf13ca0..d40e903fa88f1 100644 --- a/docs/reference/modules/discovery/adding-removing-nodes.asciidoc +++ b/docs/reference/modules/discovery/adding-removing-nodes.asciidoc @@ -3,18 +3,23 @@ As nodes are added or removed Elasticsearch maintains an optimal level of fault tolerance by automatically updating the cluster's _voting configuration_, which -is the set of <> whose responses are counted when making -decisions such as electing a new master or committing a new cluster state. +is the set of <> whose responses are counted +when making decisions such as electing a new master or committing a new cluster +state. It is recommended to have a small and fixed number of master-eligible nodes in a cluster, and to scale the cluster up and down by adding and removing master-ineligible nodes only. However there are situations in which it may be desirable to add or remove some master-eligible nodes to or from a cluster. +==== Adding master-eligible nodes + If you wish to add some master-eligible nodes to your cluster, simply configure the new nodes to find the existing cluster and start them up. Elasticsearch will add the new nodes to the voting configuration if it is appropriate to do so. +==== Removing master-eligible nodes + When removing master-eligible nodes, it is important not to remove too many all at the same time. For instance, if there are currently seven master-eligible nodes and you wish to reduce this to three, it is not possible simply to stop @@ -24,23 +29,22 @@ cannot take any further actions. As long as there are at least three master-eligible nodes in the cluster, as a general rule it is best to remove nodes one-at-a-time, allowing enough time for -the cluster to <> the voting +the cluster to <> the voting configuration and adapt the fault tolerance level to the new set of nodes. If there are only two master-eligible nodes remaining then neither node can be -safely removed since both are required to reliably make progress. You must -first inform Elasticsearch that one of the nodes should not be part of the -voting configuration, and that the voting power should instead be given to -other nodes. You can then take the excluded node offline without preventing -the other node from making progress. A node which is added to a voting -configuration exclusion list still works normally, but Elasticsearch -tries to remove it from the voting configuration so its vote is no longer required. -Importantly, Elasticsearch will never automatically move a node on the voting -exclusions list back into the voting configuration. Once an excluded node has -been successfully auto-reconfigured out of the voting configuration, it is safe -to shut it down without affecting the cluster's master-level availability. A -node can be added to the voting configuration exclusion list using the -following API: +safely removed since both are required to reliably make progress. You must first +inform Elasticsearch that one of the nodes should not be part of the voting +configuration, and that the voting power should instead be given to other nodes. +You can then take the excluded node offline without preventing the other node +from making progress. A node which is added to a voting configuration exclusion +list still works normally, but Elasticsearch tries to remove it from the voting +configuration so its vote is no longer required. Importantly, Elasticsearch +will never automatically move a node on the voting exclusions list back into the +voting configuration. Once an excluded node has been successfully +auto-reconfigured out of the voting configuration, it is safe to shut it down +without affecting the cluster's master-level availability. A node can be added +to the voting configuration exclusion list using the following API: [source,js] -------------------------------------------------- @@ -58,14 +62,14 @@ POST /_cluster/voting_config_exclusions/node_name?timeout=1m The node that should be added to the exclusions list is specified using <> in place of `node_name` here. If a call to the -voting configuration exclusions API fails, you can safely retry it. -Only a successful response guarantees that the node has actually been removed -from the voting configuration and will not be reinstated. +voting configuration exclusions API fails, you can safely retry it. Only a +successful response guarantees that the node has actually been removed from the +voting configuration and will not be reinstated. Although the voting configuration exclusions API is most useful for down-scaling a two-node to a one-node cluster, it is also possible to use it to remove -multiple master-eligible nodes all at the same time. Adding multiple nodes -to the exclusions list has the system try to auto-reconfigure all of these nodes +multiple master-eligible nodes all at the same time. Adding multiple nodes to +the exclusions list has the system try to auto-reconfigure all of these nodes out of the voting configuration, allowing them to be safely shut down while keeping the cluster available. In the example described above, shrinking a seven-master-node cluster down to only have three master nodes, you could add @@ -103,9 +107,9 @@ maintenance is complete. Clusters should have no voting configuration exclusions in normal operation. If a node is excluded from the voting configuration because it is to be shut -down permanently, its exclusion can be removed after it is shut down and -removed from the cluster. Exclusions can also be cleared if they were -created in error or were only required temporarily: +down permanently, its exclusion can be removed after it is shut down and removed +from the cluster. Exclusions can also be cleared if they were created in error +or were only required temporarily: [source,js] -------------------------------------------------- From 8e34a77b9d4d4643f5b6275ae1446f1c2d2583d7 Mon Sep 17 00:00:00 2001 From: David Turner Date: Thu, 20 Dec 2018 11:05:37 +0000 Subject: [PATCH 098/106] Move recommendation up in bootstrapping doc --- .../modules/discovery/bootstrapping.asciidoc | 95 +++++++++---------- 1 file changed, 45 insertions(+), 50 deletions(-) diff --git a/docs/reference/modules/discovery/bootstrapping.asciidoc b/docs/reference/modules/discovery/bootstrapping.asciidoc index f5b4cfddf9d8f..4b5aa532d4874 100644 --- a/docs/reference/modules/discovery/bootstrapping.asciidoc +++ b/docs/reference/modules/discovery/bootstrapping.asciidoc @@ -2,11 +2,11 @@ === Bootstrapping a cluster Starting an Elasticsearch cluster for the very first time requires the initial -set of <> to be explicitly defined on one or more of the -master-eligible nodes in the cluster. This is known as _cluster bootstrapping_. -This is only required the very first time the cluster starts up: nodes that -have already joined a cluster store this information in their data folder -and freshly-started nodes that are joining an existing cluster +set of <> to be explicitly defined on one or +more of the master-eligible nodes in the cluster. This is known as _cluster +bootstrapping_. This is only required the very first time the cluster starts +up: nodes that have already joined a cluster store this information in their +data folder and freshly-started nodes that are joining an existing cluster obtain this information from the cluster's elected master. This information is given using this setting: @@ -17,15 +17,29 @@ given using this setting: this list is empty, meaning that this node expects to join a cluster that has already been bootstrapped. -This setting can be given on the command line when starting up each -master-eligible node, or added to the `elasticsearch.yml` configuration file on -those nodes. Once the cluster has formed this setting is no longer required and -is ignored. It need not be set on master-ineligible nodes, nor on -master-eligible nodes that are started to join an existing cluster. Note that -master-eligible nodes should use storage that persists across restarts. If they -do not, and `cluster.initial_master_nodes` is set, and a full cluster restart -occurs, then another brand-new cluster will form and this may result in data -loss. +This setting can be given on the command line or in the `elasticsearch.yml` +configuration file when starting up a master-eligible node. Once the cluster +has formed this setting is no longer required and is ignored. It need not be set +on master-ineligible nodes, nor on master-eligible nodes that are started to +join an existing cluster. Note that master-eligible nodes should use storage +that persists across restarts. If they do not, and +`cluster.initial_master_nodes` is set, and a full cluster restart occurs, then +another brand-new cluster will form and this may result in data loss. + +It is technically sufficient to set `cluster.initial_master_nodes` on a single +master-eligible node in the cluster, and only to mention that single node in the +setting's value, but this provides no fault tolerance before the cluster has +fully formed. It is therefore better to bootstrap using at least three +master-eligible nodes, each with a `cluster.initial_master_nodes` setting +containing all three nodes. + +NOTE: In alpha releases, all listed master-eligible nodes are required to be +discovered before bootstrapping can take place. This requirement will be relaxed +in production-ready releases. + +WARNING: You must set `cluster.initial_master_nodes` to the same list of nodes +on each node on which it is set in order to be sure that only a single cluster +forms during bootstrapping and therefore to avoid the risk of data loss. For a cluster with 3 master-eligible nodes (with <> `master-a`, `master-b` and `master-c`) the configuration will look as follows: @@ -40,8 +54,8 @@ cluster.initial_master_nodes: Alternatively the IP addresses or hostnames (<>) can be used. If there is more than one Elasticsearch node -with the same IP address or hostname then the transport ports must also be -given to specify exactly which node is meant: +with the same IP address or hostname then the transport ports must also be given +to specify exactly which node is meant: [source,yaml] -------------------------------------------------- @@ -52,58 +66,39 @@ cluster.initial_master_nodes: - master-node-hostname -------------------------------------------------- -Like all node settings, it is also possible to specify the initial set of -master nodes on the command-line that is used to start Elasticsearch: +Like all node settings, it is also possible to specify the initial set of master +nodes on the command-line that is used to start Elasticsearch: [source,bash] -------------------------------------------------- $ bin/elasticsearch -Ecluster.initial_master_nodes=master-a,master-b,master-c -------------------------------------------------- -It is technically sufficient to set this on a single master-eligible node in -the cluster, and only to mention that single node in the setting, but this -provides no fault tolerance before the cluster has fully formed. It -is therefore better to bootstrap using at least three master-eligible nodes. -In any case, when specifying the list of initial master nodes, **it is vitally -important** to configure each node with exactly the same list of nodes, to -prevent two independent clusters from forming. Typically you will set this on -the nodes that are mentioned in the list of initial master nodes. - -NOTE: In alpha releases, all listed master-eligible nodes are required to be - discovered before bootstrapping can take place. This requirement will be - relaxed in production-ready releases. - -WARNING: You must put exactly the same set of initial master nodes in each - configuration file (or leave the configuration empty) in order to be sure - that only a single cluster forms during bootstrapping and therefore to - avoid the risk of data loss. - [float] ==== Choosing a cluster name -The <> setting enables you to create multiple clusters which are separated -from each other. Nodes verify that they agree on their cluster name when they -first connect to each other. If two nodes have different cluster names, -they will not communicate meaningfully and will not belong to the same cluster. -The default value for the cluster name is `elasticsearch`, but it is -recommended to change this to reflect the logical name of the cluster. +The <> setting enables you to create multiple +clusters which are separated from each other. Nodes verify that they agree on +their cluster name when they first connect to each other, and Elasticsearch +will only form a cluster from nodes that all have the same cluster name. The +default value for the cluster name is `elasticsearch`, but it is recommended to +change this to reflect the logical name of the cluster. [float] ==== Auto-bootstrapping in development mode If the cluster is running with a completely default configuration then it will -automatically bootstrap a cluster based on the nodes that could be discovered -to be running on the same host within a short time after startup. This means -that by default it is possible to start up several nodes on a single machine -and have them automatically form a cluster which is very useful for development +automatically bootstrap a cluster based on the nodes that could be discovered to +be running on the same host within a short time after startup. This means that +by default it is possible to start up several nodes on a single machine and have +them automatically form a cluster which is very useful for development environments and experimentation. However, since nodes may not always successfully discover each other quickly enough this automatic bootstrapping cannot be relied upon and cannot be used in production deployments. -If any of the following settings are configured then auto-bootstrapping will -not take place, and you must configure `cluster.initial_master_nodes` as -described in the <>: +If any of the following settings are configured then auto-bootstrapping will not +take place, and you must configure `cluster.initial_master_nodes` as described +in the <>: * `discovery.zen.hosts_provider` * `discovery.zen.ping.unicast.hosts` From ec4e7394e7113612b7f77458c200ead881f5f1ca Mon Sep 17 00:00:00 2001 From: David Turner Date: Thu, 20 Dec 2018 11:27:12 +0000 Subject: [PATCH 099/106] Combine discovery overviews --- .../modules/discovery/discovery.asciidoc | 74 +++++++++++-------- 1 file changed, 43 insertions(+), 31 deletions(-) diff --git a/docs/reference/modules/discovery/discovery.asciidoc b/docs/reference/modules/discovery/discovery.asciidoc index 9daab9723ac0e..dd2dc47a79dfb 100644 --- a/docs/reference/modules/discovery/discovery.asciidoc +++ b/docs/reference/modules/discovery/discovery.asciidoc @@ -1,19 +1,49 @@ [[modules-discovery-hosts-providers]] === Discovery -The cluster formation module uses a list of _seed_ nodes in order to start off -the discovery process. When you start an Elasticsearch node, or when a node -believes the master node failed, that node tries to connect to each -seed node in its list. After a connection occurs, the two nodes repeatedly share -information about the other known master-eligible nodes in the cluster in order -to build a complete picture of the cluster. By default the cluster formation -module offers two hosts providers to configure the list of seed nodes: a -_settings_-based and a _file_-based hosts provider. It can be extended to -support cloud environments and other forms of hosts providers via -{plugins}/discovery.html[discovery plugins]. Hosts providers are configured -using the `discovery.zen.hosts_provider` setting, which defaults to the -_settings_-based hosts provider. Multiple hosts providers can be specified as a -list. +Discovery is the process by which the cluster formation module finds other +nodes with which to form a cluster. This process runs when you start an +Elasticsearch node or when a node believes the master node failed and continues +until the master node is found or a new master node is elected. + +Discovery operates in two phases: First, each node probes the addresses of all +known master-eligible nodes by connecting to each address and attempting to +identify the node to which it is connected. Secondly it shares with the remote +node a list of all of its known master-eligible peers and the remote node +responds with _its_ peers in turn. The node then probes all the new nodes that +it just discovered, requests their peers, and so on. + +This process starts with a list of _seed_ addresses from one or more +<>, together with the addresses of +any master-eligible nodes that were in the last known cluster. The process +operates in two phases: First, each node probes the seed addresses by +connecting to each address and attempting to identify the node to which it is +connected. Secondly it shares with the remote node a list of all of its known +master-eligible peers and the remote node responds with _its_ peers in turn. +The node then probes all the new nodes that it just discovered, requests their +peers, and so on. + +If the node is not master-eligible then it continues this discovery process +until it has discovered an elected master node. If no elected master is +discovered then the node will retry after `discovery.find_peers_interval` which +defaults to `1s`. + +If the node is master-eligible then it continues this discovery process until it +has either discovered an elected master node or else it has discovered enough +masterless master-eligible nodes to complete an election. If neither of these +occur quickly enough then the node will retry after +`discovery.find_peers_interval` which defaults to `1s`. + +[[built-in-hosts-providers]] +==== Hosts providers + +By default the cluster formation module offers two hosts providers to configure +the list of seed nodes: a _settings_-based and a _file_-based hosts provider. +It can be extended to support cloud environments and other forms of hosts +providers via {plugins}/discovery.html[discovery plugins]. Hosts providers are +configured using the `discovery.zen.hosts_provider` setting, which defaults to +the _settings_-based hosts provider. Multiple hosts providers can be specified +as a list. [float] [[settings-based-hosts-provider]] @@ -131,24 +161,6 @@ that uses the GCE API find a list of seed nodes. [float] ==== Discovery settings -Discovery operates in two phases: First, each node probes the addresses of all -known master-eligible nodes by connecting to each address and attempting to -identify the node to which it is connected. Secondly it shares with the remote -node a list of all of its known master-eligible peers and the remote node -responds with _its_ peers in turn. The node then probes all the new nodes that -it just discovered, requests their peers, and so on. - -If the node is not master-eligible then it continues this discovery process -until it has discovered an elected master node. If no elected master is -discovered then the node will retry after `discovery.find_peers_interval` which -defaults to `1s`. - -If the node is master-eligible then it continues this discovery process until it -has either discovered an elected master node or else it has discovered enough -masterless master-eligible nodes to complete an election. If neither of these -occur quickly enough then the node will retry after -`discovery.find_peers_interval` which defaults to `1s`. - The discovery process is controlled by the following settings. `discovery.find_peers_interval`:: From f4a41dbf9ad3a8e55cbd455523fab05173100da9 Mon Sep 17 00:00:00 2001 From: Lisa Cawley Date: Thu, 20 Dec 2018 11:35:33 +0000 Subject: [PATCH 100/106] Update docs/reference/setup/important-settings/discovery-settings.asciidoc Co-Authored-By: DaveCTurner --- .../setup/important-settings/discovery-settings.asciidoc | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/docs/reference/setup/important-settings/discovery-settings.asciidoc b/docs/reference/setup/important-settings/discovery-settings.asciidoc index 42be8e9820d8a..9c62f2da1af25 100644 --- a/docs/reference/setup/important-settings/discovery-settings.asciidoc +++ b/docs/reference/setup/important-settings/discovery-settings.asciidoc @@ -19,7 +19,11 @@ use the `discovery.zen.ping.unicast.hosts` setting to provide a seed list of other nodes in the cluster that are master-eligible and likely to be live and contactable. This setting should normally contain the addresses of all the master-eligible nodes in the cluster. - +This setting contains either an array of hosts or a comma-delimited string. Each +value should be in the form of `host:port` or `host` (where `port` defaults to +the setting `transport.profiles.default.port` falling back to `transport.port` +if not set). Note that IPv6 hosts must be bracketed. The default for this +setting is `127.0.0.1, [::1]` [float] [[initial_master_nodes]] ==== `cluster.initial_master_nodes` From 6af37215ff95786737a9ceed6ee56f0c7be39fbf Mon Sep 17 00:00:00 2001 From: David Turner Date: Thu, 20 Dec 2018 11:39:29 +0000 Subject: [PATCH 101/106] Change title --- .../modules/discovery/quorums.asciidoc | 69 +++++++++---------- 1 file changed, 34 insertions(+), 35 deletions(-) diff --git a/docs/reference/modules/discovery/quorums.asciidoc b/docs/reference/modules/discovery/quorums.asciidoc index 8cbf60545bc65..98f02b07e3e8b 100644 --- a/docs/reference/modules/discovery/quorums.asciidoc +++ b/docs/reference/modules/discovery/quorums.asciidoc @@ -42,42 +42,41 @@ to complete before removing more nodes from the cluster. [float] ==== Setting the initial quorum -When a brand-new cluster starts up for the first time, it must -elect its first master node. To do this election, it needs to know the set -of master-eligible nodes whose votes should count. This -initial voting configuration is known as the _bootstrap configuration_ and is -set in the <>. +When a brand-new cluster starts up for the first time, it must elect its first +master node. To do this election, it needs to know the set of master-eligible +nodes whose votes should count. This initial voting configuration is known as +the _bootstrap configuration_ and is set in the +<>. It is important that the bootstrap configuration identifies exactly which nodes -should vote in the first election. It is not sufficient to configure each -node with an expectation of how many nodes there should be in the cluster. It -is also important to note that the bootstrap configuration must come from -outside the cluster: there is no safe way for the cluster to determine the -bootstrap configuration correctly on its own. - -If the bootstrap configuration is not set correctly, when -you start a brand-new cluster there is a risk that you will accidentally form two separate -clusters instead of one. This situation can lead to data loss: you might start using -both clusters before you notice that anything has gone wrong and it is -impossible to merge them together later. +should vote in the first election. It is not sufficient to configure each node +with an expectation of how many nodes there should be in the cluster. It is also +important to note that the bootstrap configuration must come from outside the +cluster: there is no safe way for the cluster to determine the bootstrap +configuration correctly on its own. + +If the bootstrap configuration is not set correctly, when you start a brand-new +cluster there is a risk that you will accidentally form two separate clusters +instead of one. This situation can lead to data loss: you might start using both +clusters before you notice that anything has gone wrong and it is impossible to +merge them together later. NOTE: To illustrate the problem with configuring each node to expect a certain cluster size, imagine starting up a three-node cluster in which each node knows that it is going to be part of a three-node cluster. A majority of three nodes -is two, so normally the first two nodes to discover each other form a -cluster and the third node joins them a short time later. However, imagine -that four nodes were erroneously started instead of three. In this case, there -are enough nodes to form two separate clusters. Of course if each node is -started manually then it's unlikely that too many nodes are started. If you're using an automated orchestrator, however, it's -certainly possible to get into this situation-- -particularly if the orchestrator is not resilient to failures -such as network partitions. +is two, so normally the first two nodes to discover each other form a cluster +and the third node joins them a short time later. However, imagine that four +nodes were erroneously started instead of three. In this case, there are enough +nodes to form two separate clusters. Of course if each node is started manually +then it's unlikely that too many nodes are started. If you're using an automated +orchestrator, however, it's certainly possible to get into this situation-- +particularly if the orchestrator is not resilient to failures such as network +partitions. The initial quorum is only required the very first time a whole cluster starts up. New nodes joining an established cluster can safely obtain all the -information they need from the elected master. Nodes that have previously -been part of a cluster will have stored to disk all the information that is required +information they need from the elected master. Nodes that have previously been +part of a cluster will have stored to disk all the information that is required when they restart. [float] @@ -93,14 +92,14 @@ action with the APIs described here in these cases, because the set of master nodes is not changing permanently. [float] -==== Auto-reconfiguration - -Nodes may join or leave the cluster, and Elasticsearch reacts by making -corresponding changes to the voting configuration in order to ensure that the -cluster is as resilient as possible. The default auto-reconfiguration behaviour -is expected to give the best results in most situations. The current voting -configuration is stored in the cluster state so you can inspect its current -contents as follows: +==== Automatic changes to the voting configuration + +Nodes may join or leave the cluster, and Elasticsearch reacts by automatically +making corresponding changes to the voting configuration in order to ensure that +the cluster is as resilient as possible. The default auto-reconfiguration +behaviour is expected to give the best results in most situations. The current +voting configuration is stored in the cluster state so you can inspect its +current contents as follows: [source,js] -------------------------------------------------- From 0b2b63ccd2608c8e4bcd5705b7834b924be388b1 Mon Sep 17 00:00:00 2001 From: David Turner Date: Thu, 20 Dec 2018 11:43:17 +0000 Subject: [PATCH 102/106] Clarify the difference between a split brain and an even network partition --- .../modules/discovery/quorums.asciidoc | 48 +++++++++---------- 1 file changed, 24 insertions(+), 24 deletions(-) diff --git a/docs/reference/modules/discovery/quorums.asciidoc b/docs/reference/modules/discovery/quorums.asciidoc index 98f02b07e3e8b..968a0440be72c 100644 --- a/docs/reference/modules/discovery/quorums.asciidoc +++ b/docs/reference/modules/discovery/quorums.asciidoc @@ -8,21 +8,20 @@ Elasticsearch achieves this robustness by considering each action to have succeeded on receipt of responses from a _quorum_, which is a subset of the master-eligible nodes in the cluster. The advantage of requiring only a subset of the nodes to respond is that it means some of the nodes can fail without -preventing the cluster from making progress. The quorums are carefully -chosen so the cluster does not have a "split brain" scenario where it's partitioned -into two pieces--each of which may make decisions that are inconsistent with +preventing the cluster from making progress. The quorums are carefully chosen so +the cluster does not have a "split brain" scenario where it's partitioned into +two pieces such that each piece may make decisions that are inconsistent with those of the other piece. Elasticsearch allows you to add and remove master-eligible nodes to a running cluster. In many cases you can do this simply by starting or stopping the nodes -as required. See -<>. +as required. See <>. As nodes are added or removed Elasticsearch maintains an optimal level of fault tolerance by updating the cluster's _voting configuration_, which is the set of master-eligible nodes whose responses are counted when making decisions such as -electing a new master or committing a new cluster state. A decision is made -only after more than half of the nodes in the voting configuration have responded. +electing a new master or committing a new cluster state. A decision is made only +after more than half of the nodes in the voting configuration have responded. Usually the voting configuration is the same as the set of all the master-eligible nodes that are currently in the cluster. However, there are some situations in which they may be different. @@ -30,8 +29,8 @@ situations in which they may be different. To be sure that the cluster remains available you **must not stop half or more of the nodes in the voting configuration at the same time**. As long as more than half of the voting nodes are available the cluster can still work normally. -This means that if there are three or four master-eligible nodes, the -cluster can tolerate one of them being unavailable. If there are two or fewer +This means that if there are three or four master-eligible nodes, the cluster +can tolerate one of them being unavailable. If there are two or fewer master-eligible nodes, they must all remain available. After a node has joined or left the cluster the elected master must issue a @@ -158,22 +157,23 @@ must be performed as nodes join and leave the cluster. ==== Even numbers of master-eligible nodes There should normally be an odd number of master-eligible nodes in a cluster. -If there is an even number, Elasticsearch leaves one of them out of the -voting configuration to ensure that it has an odd size. This omission does not decrease +If there is an even number, Elasticsearch leaves one of them out of the voting +configuration to ensure that it has an odd size. This omission does not decrease the failure-tolerance of the cluster. In fact, improves it slightly: if the -cluster is partitioned into two even halves then one of the halves will contain -a majority of the voting configuration and will be able to keep operating. -If all of the master-eligible nodes' votes were counted, neither -side could make any progress in this situation. +cluster suffers from a network partition that divides it into two equally-sized +halves then one of the halves will contain a majority of the voting +configuration and will be able to keep operating. If all of the master-eligible +nodes' votes were counted, neither side would contain a strict majority of the +nodes and so the cluster would not be able to make any progress. For instance if there are four master-eligible nodes in the cluster and the voting configuration contained all of them, any quorum-based decision would -require votes from at least three of them. This situation means that the cluster can -tolerate the loss of only a single master-eligible node. If this cluster were split -into two equal halves, neither half would contain three master-eligible -nodes and the cluster would not be able to make any progress. If the voting -configuration contains only three of the four master-eligible nodes, however, the -cluster is still only fully tolerant to the loss of one node, but quorum-based -decisions require votes from two of the three voting nodes. In the event of an -even split, one half will contain two of the three voting nodes so will remain -available. +require votes from at least three of them. This situation means that the cluster +can tolerate the loss of only a single master-eligible node. If this cluster +were split into two equal halves, neither half would contain three +master-eligible nodes and the cluster would not be able to make any progress. If +the voting configuration contains only three of the four master-eligible nodes, +however, the cluster is still only fully tolerant to the loss of one node, but +quorum-based decisions require votes from two of the three voting nodes. In the +event of an even split, one half will contain two of the three voting nodes so +will remain available. From f24c1d9dc524202990ee9bc28e34e3ae770db68a Mon Sep 17 00:00:00 2001 From: David Turner Date: Thu, 20 Dec 2018 11:44:16 +0000 Subject: [PATCH 103/106] Add 'that half' --- docs/reference/modules/discovery/quorums.asciidoc | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/docs/reference/modules/discovery/quorums.asciidoc b/docs/reference/modules/discovery/quorums.asciidoc index 968a0440be72c..1ad163b22de31 100644 --- a/docs/reference/modules/discovery/quorums.asciidoc +++ b/docs/reference/modules/discovery/quorums.asciidoc @@ -171,9 +171,9 @@ voting configuration contained all of them, any quorum-based decision would require votes from at least three of them. This situation means that the cluster can tolerate the loss of only a single master-eligible node. If this cluster were split into two equal halves, neither half would contain three -master-eligible nodes and the cluster would not be able to make any progress. If -the voting configuration contains only three of the four master-eligible nodes, -however, the cluster is still only fully tolerant to the loss of one node, but -quorum-based decisions require votes from two of the three voting nodes. In the -event of an even split, one half will contain two of the three voting nodes so -will remain available. +master-eligible nodes and the cluster would not be able to make any progress. +If the voting configuration contains only three of the four master-eligible +nodes, however, the cluster is still only fully tolerant to the loss of one +node, but quorum-based decisions require votes from two of the three voting +nodes. In the event of an even split, one half will contain two of the three +voting nodes so that half will remain available. From 64564c0fb2a6b605b29bee9c4eb73668e1b21151 Mon Sep 17 00:00:00 2001 From: David Turner Date: Thu, 20 Dec 2018 11:50:42 +0000 Subject: [PATCH 104/106] Move elections overview to quorums page --- .../modules/discovery/master-election.asciidoc | 12 ++---------- docs/reference/modules/discovery/quorums.asciidoc | 14 ++++++++++++++ 2 files changed, 16 insertions(+), 10 deletions(-) diff --git a/docs/reference/modules/discovery/master-election.asciidoc b/docs/reference/modules/discovery/master-election.asciidoc index bf4d5f0d28871..60d09e5545b40 100644 --- a/docs/reference/modules/discovery/master-election.asciidoc +++ b/docs/reference/modules/discovery/master-election.asciidoc @@ -1,15 +1,7 @@ -[[master-election]] +[[master-election-settings]] === Master election settings -Elasticsearch uses an election process to agree on an elected master node, both -at startup and if the existing elected master fails. Any master-eligible node -can start an election, and normally the first election that takes place will -succeed. Elections only usually fail when two nodes both happen to start their -elections at about the same time, so elections are scheduled randomly on each -node to avoid this happening. Nodes will retry elections until a master is -elected, backing off on failure, so that eventually an election will succeed -(with arbitrarily high probability). The following settings control the -scheduling of elections. +The following settings control the scheduling of elections. `cluster.election.initial_timeout`:: diff --git a/docs/reference/modules/discovery/quorums.asciidoc b/docs/reference/modules/discovery/quorums.asciidoc index 1ad163b22de31..5642083b63b0b 100644 --- a/docs/reference/modules/discovery/quorums.asciidoc +++ b/docs/reference/modules/discovery/quorums.asciidoc @@ -78,6 +78,20 @@ information they need from the elected master. Nodes that have previously been part of a cluster will have stored to disk all the information that is required when they restart. +[float] +==== Master elections + +Elasticsearch uses an election process to agree on an elected master node, both +at startup and if the existing elected master fails. Any master-eligible node +can start an election, and normally the first election that takes place will +succeed. Elections only usually fail when two nodes both happen to start their +elections at about the same time, so elections are scheduled randomly on each +node to reduce the probability of this happening. Nodes will retry elections +until a master is elected, backing off on failure, so that eventually an +election will succeed (with arbitrarily high probability). The scheduling of +master elections are controlled by the <>. + [float] ==== Cluster maintenance, rolling restarts and migrations From 852caed65b9ed7518a1f4728f0e0ab941be56508 Mon Sep 17 00:00:00 2001 From: David Turner Date: Thu, 20 Dec 2018 12:11:35 +0000 Subject: [PATCH 105/106] Fix up broken link --- docs/reference/modules/discovery.asciidoc | 4 +- .../discovery/fault-detection.asciidoc | 2 +- docs/reference/modules/discovery/zen.asciidoc | 226 ------------------ .../upgrade/cluster_restart.asciidoc | 12 +- 4 files changed, 11 insertions(+), 233 deletions(-) delete mode 100644 docs/reference/modules/discovery/zen.asciidoc diff --git a/docs/reference/modules/discovery.asciidoc b/docs/reference/modules/discovery.asciidoc index 225c6e28d8705..546c347fa3bb8 100644 --- a/docs/reference/modules/discovery.asciidoc +++ b/docs/reference/modules/discovery.asciidoc @@ -49,8 +49,8 @@ module. This module is divided into the following sections: Advanced settings:: There are settings that allow advanced users to influence the - <> and <> - processes. + <> and + <> processes. <>:: diff --git a/docs/reference/modules/discovery/fault-detection.asciidoc b/docs/reference/modules/discovery/fault-detection.asciidoc index 5805fbce60477..0a8ff5fa2081c 100644 --- a/docs/reference/modules/discovery/fault-detection.asciidoc +++ b/docs/reference/modules/discovery/fault-detection.asciidoc @@ -1,4 +1,4 @@ -[[fault-detection]] +[[fault-detection-settings]] === Cluster fault detection settings An elected master periodically checks each of the nodes in the cluster in order diff --git a/docs/reference/modules/discovery/zen.asciidoc b/docs/reference/modules/discovery/zen.asciidoc deleted file mode 100644 index 98967bf7ebaf4..0000000000000 --- a/docs/reference/modules/discovery/zen.asciidoc +++ /dev/null @@ -1,226 +0,0 @@ -[[modules-discovery-zen]] -=== Zen Discovery - -Zen discovery is the built-in, default, discovery module for Elasticsearch. It -provides unicast and file-based discovery, and can be extended to support cloud -environments and other forms of discovery via plugins. - -Zen discovery is integrated with other modules, for example, all communication -between nodes is done using the <> module. - -It is separated into several sub modules, which are explained below: - -[float] -[[ping]] -==== Ping - -This is the process where a node uses the discovery mechanisms to find other -nodes. - -[float] -[[discovery-seed-nodes]] -==== Seed nodes - -Zen discovery uses a list of _seed_ nodes in order to start off the discovery -process. At startup, or when electing a new master, Elasticsearch tries to -connect to each seed node in its list, and holds a gossip-like conversation with -them to find other nodes and to build a complete picture of the cluster. By -default there are two methods for configuring the list of seed nodes: _unicast_ -and _file-based_. It is recommended that the list of seed nodes comprises the -list of master-eligible nodes in the cluster. - -[float] -[[unicast]] -===== Unicast - -Unicast discovery configures a static list of hosts for use as seed nodes. -These hosts can be specified as hostnames or IP addresses; hosts specified as -hostnames are resolved to IP addresses during each round of pinging. Note that -if you are in an environment where DNS resolutions vary with time, you might -need to adjust your <>. - -The list of hosts is set using the `discovery.zen.ping.unicast.hosts` static -setting. This is either an array of hosts or a comma-delimited string. Each -value should be in the form of `host:port` or `host` (where `port` defaults to -the setting `transport.profiles.default.port` falling back to -`transport.port` if not set). Note that IPv6 hosts must be bracketed. The -default for this setting is `127.0.0.1, [::1]` - -Additionally, the `discovery.zen.ping.unicast.resolve_timeout` configures the -amount of time to wait for DNS lookups on each round of pinging. This is -specified as a <> and defaults to 5s. - -Unicast discovery uses the <> module to perform the -discovery. - -[float] -[[file-based-hosts-provider]] -===== File-based - -In addition to hosts provided by the static `discovery.zen.ping.unicast.hosts` -setting, it is possible to provide a list of hosts via an external file. -Elasticsearch reloads this file when it changes, so that the list of seed nodes -can change dynamically without needing to restart each node. For example, this -gives a convenient mechanism for an Elasticsearch instance that is run in a -Docker container to be dynamically supplied with a list of IP addresses to -connect to for Zen discovery when those IP addresses may not be known at node -startup. - -To enable file-based discovery, configure the `file` hosts provider as follows: - -[source,txt] ----------------------------------------------------------------- -discovery.zen.hosts_provider: file ----------------------------------------------------------------- - -Then create a file at `$ES_PATH_CONF/unicast_hosts.txt` in the format described -below. Any time a change is made to the `unicast_hosts.txt` file the new -changes will be picked up by Elasticsearch and the new hosts list will be used. - -Note that the file-based discovery plugin augments the unicast hosts list in -`elasticsearch.yml`: if there are valid unicast host entries in -`discovery.zen.ping.unicast.hosts` then they will be used in addition to those -supplied in `unicast_hosts.txt`. - -The `discovery.zen.ping.unicast.resolve_timeout` setting also applies to DNS -lookups for nodes specified by address via file-based discovery. This is -specified as a <> and defaults to 5s. - -The format of the file is to specify one node entry per line. Each node entry -consists of the host (host name or IP address) and an optional transport port -number. If the port number is specified, is must come immediately after the -host (on the same line) separated by a `:`. If the port number is not -specified, a default value of 9300 is used. - -For example, this is an example of `unicast_hosts.txt` for a cluster with four -nodes that participate in unicast discovery, some of which are not running on -the default port: - -[source,txt] ----------------------------------------------------------------- -10.10.10.5 -10.10.10.6:9305 -10.10.10.5:10005 -# an IPv6 address -[2001:0db8:85a3:0000:0000:8a2e:0370:7334]:9301 ----------------------------------------------------------------- - -Host names are allowed instead of IP addresses (similar to -`discovery.zen.ping.unicast.hosts`), and IPv6 addresses must be specified in -brackets with the port coming after the brackets. - -It is also possible to add comments to this file. All comments must appear on -their lines starting with `#` (i.e. comments cannot start in the middle of a -line). - -[float] -[[master-election]] -==== Master Election - -As part of the ping process a master of the cluster is either elected or joined -to. This is done automatically. The `discovery.zen.ping_timeout` (which defaults -to `3s`) determines how long the node will wait before deciding on starting an -election or joining an existing cluster. Three pings will be sent over this -timeout interval. In case where no decision can be reached after the timeout, -the pinging process restarts. In slow or congested networks, three seconds -might not be enough for a node to become aware of the other nodes in its -environment before making an election decision. Increasing the timeout should -be done with care in that case, as it will slow down the election process. Once -a node decides to join an existing formed cluster, it will send a join request -to the master (`discovery.zen.join_timeout`) with a timeout defaulting at 20 -times the ping timeout. - -When the master node stops or has encountered a problem, the cluster nodes start -pinging again and will elect a new master. This pinging round also serves as a -protection against (partial) network failures where a node may unjustly think -that the master has failed. In this case the node will simply hear from other -nodes about the currently active master. - -If `discovery.zen.master_election.ignore_non_master_pings` is `true`, pings from -nodes that are not master eligible (nodes where `node.master` is `false`) are -ignored during master election; the default value is `false`. - -Nodes can be excluded from becoming a master by setting `node.master` to -`false`. - -The `discovery.zen.minimum_master_nodes` sets the minimum number of master -eligible nodes that need to join a newly elected master in order for an election -to complete and for the elected node to accept its mastership. The same setting -controls the minimum number of active master eligible nodes that should be a -part of any active cluster. If this requirement is not met the active master -node will step down and a new master election will begin. - -This setting must be set to a <> of your master -eligible nodes. It is recommended to avoid having only two master eligible -nodes, since a quorum of two is two. Therefore, a loss of either master eligible -node will result in an inoperable cluster. - -[float] -[[fault-detection]] -==== Fault Detection - -There are two fault detection processes running. The first is by the master, to -ping all the other nodes in the cluster and verify that they are alive. And on -the other end, each node pings to master to verify if its still alive or an -election process needs to be initiated. - -The following settings control the fault detection process using the -`discovery.zen.fd` prefix: - -[cols="<,<",options="header",] -|======================================================================= -|Setting |Description -|`ping_interval` |How often a node gets pinged. Defaults to `1s`. - -|`ping_timeout` |How long to wait for a ping response, defaults to -`30s`. - -|`ping_retries` |How many ping failures / timeouts cause a node to be -considered failed. Defaults to `3`. -|======================================================================= - -[float] -==== Cluster state updates - -The master node is the only node in a cluster that can make changes to the -cluster state. The master node processes one cluster state update at a time, -applies the required changes and publishes the updated cluster state to all the -other nodes in the cluster. Each node receives the publish message, acknowledges -it, but does *not* yet apply it. If the master does not receive acknowledgement -from at least `discovery.zen.minimum_master_nodes` nodes within a certain time -(controlled by the `discovery.zen.commit_timeout` setting and defaults to 30 -seconds) the cluster state change is rejected. - -Once enough nodes have responded, the cluster state is committed and a message -will be sent to all the nodes. The nodes then proceed to apply the new cluster -state to their internal state. The master node waits for all nodes to respond, -up to a timeout, before going ahead processing the next updates in the queue. -The `discovery.zen.publish_timeout` is set by default to 30 seconds and is -measured from the moment the publishing started. Both timeout settings can be -changed dynamically through the <> - -[float] -[[no-master-block]] -==== No master block - -For the cluster to be fully operational, it must have an active master and the -number of running master eligible nodes must satisfy the -`discovery.zen.minimum_master_nodes` setting if set. The -`discovery.zen.no_master_block` settings controls what operations should be -rejected when there is no active master. - -The `discovery.zen.no_master_block` setting has two valid options: - -[horizontal] -`all`:: All operations on the node--i.e. both read & writes--will be rejected. -This also applies for api cluster state read or write operations, like the get -index settings, put mapping and cluster state api. -`write`:: (default) Write operations will be rejected. Read operations will -succeed, based on the last known cluster configuration. This may result in -partial reads of stale data as this node may be isolated from the rest of the -cluster. - -The `discovery.zen.no_master_block` setting doesn't apply to nodes-based apis -(for example cluster stats, node info and node stats apis). Requests to these -apis will not be blocked and can run on any available node. diff --git a/docs/reference/upgrade/cluster_restart.asciidoc b/docs/reference/upgrade/cluster_restart.asciidoc index 85b6fffdb2eb3..4c229e373f505 100644 --- a/docs/reference/upgrade/cluster_restart.asciidoc +++ b/docs/reference/upgrade/cluster_restart.asciidoc @@ -59,10 +59,14 @@ If you have dedicated master nodes, start them first and wait for them to form a cluster and elect a master before proceeding with your data nodes. You can check progress by looking at the logs. -As soon as the <> -have discovered each other, they form a cluster and elect a master. At -that point, you can use <> and -<> to monitor nodes joining the cluster: +If upgrading from a 6.x cluster, you must +<> by +setting the `cluster.initial_master_nodes` setting. + +As soon as enough master-eligible nodes have discovered each other, they form a +cluster and elect a master. At that point, you can use +<> and <> to monitor nodes +joining the cluster: [source,sh] -------------------------------------------------- From 94bc24bc6654c2d2035b31c5d36150ccfc844b52 Mon Sep 17 00:00:00 2001 From: David Turner Date: Thu, 20 Dec 2018 12:31:05 +0000 Subject: [PATCH 106/106] _hosts_ providers --- docs/plugins/discovery.asciidoc | 13 ++++++++----- 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/docs/plugins/discovery.asciidoc b/docs/plugins/discovery.asciidoc index 17b223478eb51..926acead09ea1 100644 --- a/docs/plugins/discovery.asciidoc +++ b/docs/plugins/discovery.asciidoc @@ -1,8 +1,8 @@ [[discovery]] == Discovery Plugins -Discovery plugins extend Elasticsearch by adding new host providers that -can be used to extend the {ref}/modules-discovery.html[cluster formation module]. +Discovery plugins extend Elasticsearch by adding new hosts providers that can be +used to extend the {ref}/modules-discovery.html[cluster formation module]. [float] ==== Core discovery plugins @@ -11,15 +11,18 @@ The core discovery plugins are: <>:: -The EC2 discovery plugin uses the https://github.com/aws/aws-sdk-java[AWS API] for unicast discovery. +The EC2 discovery plugin uses the https://github.com/aws/aws-sdk-java[AWS API] +for unicast discovery. <>:: -The Azure Classic discovery plugin uses the Azure Classic API for unicast discovery. +The Azure Classic discovery plugin uses the Azure Classic API for unicast +discovery. <>:: -The Google Compute Engine discovery plugin uses the GCE API for unicast discovery. +The Google Compute Engine discovery plugin uses the GCE API for unicast +discovery. [float] ==== Community contributed discovery plugins