Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Copy checkpoint atomically when rolling generation #35407

Merged

Conversation

DaveCTurner
Copy link
Contributor

Today when rolling a transog generation we copy the checkpoint from
translog.ckp to translog-nnnn.ckp using a simple Files.copy() followed by
appropriate fsync() calls. The copy operation is not atomic, so if we crash
at the wrong moment we can leave an incomplete checkpoint file on disk. In
practice the checkpoint is so small that it's either empty or fully written.
However, we do not correctly handle the case where it's empty when the node
restarts.

In contrast, in recoverFromFiles() we do copy the checkpoint atomically.
This commit extracts the atomic copy operation from recoverFromFiles() and
re-uses it in rollGeneration().

Today when rolling a transog generation we copy the checkpoint from
`translog.ckp` to `translog-nnnn.ckp` using a simple `Files.copy()` followed by
appropriate `fsync()` calls. The copy operation is not atomic, so if we crash
at the wrong moment we can leave an incomplete checkpoint file on disk. In
practice the checkpoint is so small that it's either empty or fully written.
However, we do not correctly handle the case where it's empty when the node
restarts.

In contrast, in `recoverFromFiles()` we _do_ copy the checkpoint atomically.
This commit extracts the atomic copy operation from `recoverFromFiles()` and
re-uses it in `rollGeneration()`.
@DaveCTurner DaveCTurner added >bug v7.0.0 :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. v6.6.0 labels Nov 9, 2018
@DaveCTurner DaveCTurner requested a review from s1monw November 9, 2018 09:29
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@DaveCTurner
Copy link
Contributor Author

This situation occurred in https://discuss.elastic.co/t/failed-shard-after-ooming-corrupt-index/155612/3.

I recognise there's no tests for this change yet, because I don't know a good way to simulate this situation. Any ideas?

@s1monw
Copy link
Contributor

s1monw commented Nov 9, 2018

God I don’t know how often I looked at this code and I missed that?! Code LGTM, regarding testing I think we can add a randomly throwing FS impl that’s corrupting files if they are not fsynced. We do this in some lucene testing directories but that’s a bigger change. I am ok with getting this in as is and start the conversation on a follow up issue

Copy link
Contributor

@bleskes bleskes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. It terms of testing, we have various tests that inject errors in TranslogTests. Did you check if we already cover the case where files can be created, we run into out of disk space and leave them empty?

@DaveCTurner DaveCTurner merged commit d01436d into elastic:master Nov 23, 2018
DaveCTurner added a commit that referenced this pull request Nov 23, 2018
Today when rolling a transog generation we copy the checkpoint from
`translog.ckp` to `translog-nnnn.ckp` using a simple `Files.copy()` followed by
appropriate `fsync()` calls. The copy operation is not atomic, so if we crash
at the wrong moment we can leave an incomplete checkpoint file on disk. In
practice the checkpoint is so small that it's either empty or fully written.
However, we do not correctly handle the case where it's empty when the node
restarts.

In contrast, in `recoverFromFiles()` we _do_ copy the checkpoint atomically.
This commit extracts the atomic copy operation from `recoverFromFiles()` and
re-uses it in `rollGeneration()`.
original-brownbear pushed a commit that referenced this pull request Nov 23, 2018
Today when rolling a transog generation we copy the checkpoint from
`translog.ckp` to `translog-nnnn.ckp` using a simple `Files.copy()` followed by
appropriate `fsync()` calls. The copy operation is not atomic, so if we crash
at the wrong moment we can leave an incomplete checkpoint file on disk. In
practice the checkpoint is so small that it's either empty or fully written.
However, we do not correctly handle the case where it's empty when the node
restarts.

In contrast, in `recoverFromFiles()` we _do_ copy the checkpoint atomically.
This commit extracts the atomic copy operation from `recoverFromFiles()` and
re-uses it in `rollGeneration()`.
@DaveCTurner DaveCTurner deleted the 2018-11-09-copy-checkpoint-atomically branch November 27, 2018 15:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. v6.6.0 v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants