-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handling disk and file system permission issues on new index creation #19789
Comments
There is another issue at play here: The issue is that the index-creation context is lost after the primary shard fails to initialize on node 2 (which is a variation of #15241, only fixed by allocation ids in v5.0.0). This means that after the first failed attempt to initialize the primary shard, it is treated as an existing shard copy to be recovered instead of a new one (I've added some notes below how to detect this situation). This also means that the primary shard allocator searches for a copy of the data on the nodes to allocate the shard to, which triggers How to detect that index-creation context is lost in v1.x/v2.x: This can be seen by looking at the unassignedInfo on the unassigned primary shard, which shows something like this:
The index-creation context is correctly set if the unassigned_info object has In v5.0.0, the decision whether a fresh shard copy is expected (like after index creation) or an existing copy is required (when recovering a previously started shard after cluster restart) is not based on the unassigned_info anymore but based on allocation ids. I have already opened a PR where it will be easier to see in v5.0.0 whether a primary is going to recover as a fresh initial shard copy or requires an existing on-disk shard copy (#19516). |
This also shows another subtle issue (even in current master) which is related to async_shard_fetch: Assume for simplicity an index with 1 primary shard and no replicas. Primary shard was successfully started at some point on data node X, but now we have done a full cluster restart. Data node X has some disk permission issues for the shard directory after restart. When master tries to allocate the primary after cluster restart, it first does an async_shard_fetch, which fails hard on node X as async_shard_fetch (
|
To clarify per discussion with @ywelsch (thx): The first issue where the shard loses its index creation context after first failed allocation attempt is solved in v5.0.0 based on allocation ids (#14739). The second issue is that async shard fetching can in a certain situation be triggered again and again when no existing shard copy can be found to allocate as primary. The situation where this occurs is if just doing a listFiles on the shard directory during shard fetching on a data node already throws an exception. We are keeping this issue open to track this particular PR. |
It would be great to do something when a disk goes to read-only. This seems to be the default in some linux OSs when there are issues (such as corruption or problems with the mounted disk). Also, to avoid this we could mention (in the documentation) that RAID 0 could be helpful? |
@ywelsch This seems to be still an issue where failing to read a disk on a data node can lead to endless shard fetching. I tend to open an issue which is dedicated to that. Do you agree? |
This could also be seen as falling under the umbrella of #18417, even if the issue technically happens before the shard is even allocated to the broken FS / node. How about closing this one and adding a comment to the linked issue? |
works for me. Added a comment to #18417 |
Elasticsearch version: 2.3.2
This is an attempt to simulate a bad disk that has turned read only.
testindex 3 p STARTED 0 130b 127.0.0.1 node1
testindex 4 p UNASSIGNED
testindex 2 p UNASSIGNED
testindex 1 p STARTED 0 130b 127.0.0.1 node1
testindex 0 p UNASSIGNED
However, the master node is not recovering from this scenario well. It keeps trying to load the shards onto node 2 ... pretty much perpetually as long as node2 is started with a read only file system.
And these tasks keep getting added and re-added to the pending tasks to no end:
It keeps going forever until node 2 is stopped and the underlying file system is addressed.
Once node2 is started back up with a writable data path, then you end up with a red index for it does not go and retry the allocation there.
testindex 3 p STARTED 0 130b 127.0.0.1 node1
testindex 4 p UNASSIGNED
testindex 2 p UNASSIGNED
testindex 1 p STARTED 0 130b 127.0.0.1 node1
testindex 0 p UNASSIGNED
Seems like there is an opportunity to handle this better:
The text was updated successfully, but these errors were encountered: