Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOCS] Document cluster behavior when a file system crashes but node remains operational #25591

Closed
MorrieAtElastic opened this issue Jul 7, 2017 · 9 comments
Labels
:Core/Infra/Core Core issues without another label :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >docs General docs changes resiliency Team:Core/Infra Meta label for core/infra team Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.

Comments

@MorrieAtElastic
Copy link
Contributor

MorrieAtElastic commented Jul 7, 2017

Describe the feature: Document cluster behavior when a file system crashes but node remains operational

Elasticsearch version: Generic

Plugins installed: [] n/a

JVM version (java -version): n/a

OS version (uname -a if on a Unix-like system): generic

Description of the problem including expected versus actual behavior:

Elasticsearch documentation currently describes behavior when a node in a cluster fails. The documentation does not describe behavior when a node's file system fails but the node itself remains operational. Such failure conditions can and will happen especially for customers using 3rd-party high-performance disk systems (SSD, RAID, etc.) which are loosely coupled with the OS. Additionally it is common that customers will mount their data directories on high-performance disk systems while keeping their log data on the system drive.

General issues that need to be addressed:

  • cluster actions when primary shards are lost due to disk failure (according to my testing, replica shards are promoted on other nodes)
  • cluster actions when replica shards are lost due to disk failure (new replica shards are created on surviving nodes)
  • parameters affecting shard management when a disk failure occurs
  • cluster response when disk failure is resolved and the disk system is brought back online (according to my testing, nothing happens until the entire cluster is restarted)
  • response of the node and the cluster to queries and CRUD requests addressed to the node with the failed system.

Relevant Discussions

"Expected behavior" during disk crashes has changed significantly between elastic search versions and there are several significant open issues speaking to this question including:

#18417
#18467
#19789

Cluster response specifically to failed disk conditions should be documented for user system design and recovery planning.

@jimczi jimczi added the >docs General docs changes label Jul 7, 2017
@PhaedrusTheGreek
Copy link
Contributor

Related Discussion: #18279

@colings86 colings86 added the :Core/Infra/Core Core issues without another label label Apr 24, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

@pugnascotia
Copy link
Contributor

@jrodewig
Copy link
Contributor

jrodewig commented Nov 1, 2019

[docs issue triage]

@rjernst rjernst added Team:Core/Infra Meta label for core/infra team Team:Docs Meta label for docs team labels May 4, 2020
@rjernst rjernst added the needs:triage Requires assignment of a team area label label Dec 3, 2020
@jaymode jaymode removed the needs:triage Requires assignment of a team area label label Dec 14, 2020
@stefnestor
Copy link
Contributor

@jrodewig @jaymode, I think this fell of radar. Can you review? 🙏🏼

@jrodewig jrodewig added :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. resiliency labels Nov 24, 2021
@elasticmachine elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Nov 24, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@jrodewig
Copy link
Contributor

jrodewig commented Nov 24, 2021

Thanks for the ping @stefnestor. @jaymode is now part of another team, but I've added some labels to include the Distributed team.

Thanks to the work in #45286, we hopefully have a simpler story here. As this info is largely targeted as users doing recovery planning, we may want to add a page to Designing for Resilience.

I don't personally have the bandwidth to pick this up in the near term, but I can bring this to our next Docs sync to see if anyone else if available.

@DaveCTurner
Copy link
Contributor

I wonder if it's worth making a distinction between "node failed" and "filesystem failed but node still running" any more. #45286 means that a node with a broken filesystem will remove itself from the cluster, just like any other failure mode.

@debadair debadair removed the Team:Docs Meta label for docs team label Apr 27, 2022
@debadair debadair changed the title Elasticsearch: Document Cluster Behavior When A File System Crashes But Node Remains Operational [DOCS] Document cluster behavior when a file system crashes but node remains operational Apr 27, 2022
@idegtiarenko
Copy link
Contributor

Node leaves the cluster as soon as the fs is no longer writable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Core/Infra/Core Core issues without another label :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >docs General docs changes resiliency Team:Core/Infra Meta label for core/infra team Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.
Projects
None yet
Development

No branches or pull requests