-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Strict repository lock handling #3569
Conversation
4becb95
to
7fb3289
Compare
I've added an extra commit which lets restic fail if it is unable to read any of the lock files in the repository. The idea is to prevent overlooking concurrent restic processes if a lock file is unreadable for some reason. |
1468ccf
to
198e750
Compare
198e750
to
2cfa0f2
Compare
2cfa0f2
to
0c3fb79
Compare
0c3fb79
to
8c494c9
Compare
+1! This seems like a critical change. |
In what regard? For servers, which don't hibernate/use standby, this PR is likely not too relevant. |
8c494c9
to
5c9a88a
Compare
I've changed the timer expiry checks as apparently timers are stopped during standby: golang/go#35012 |
I should have said "virtual machine" rather than "server".
In one case (the reason I started using Restic, in fact), the cloud provider
has started becoming unstable. (I.E. the company is failing as a concern and its
support has become unreachable).
VMs can restart just fine, and a few minutes of downtime here and there
isn't critical for the functions they serve, but loss of data is.
Because of the prepaid contracts we have with them, I am loathed to just dump
and move to a more reliable provider such as Linode, until closer to the paid
expiration, early next year.
But the potential for a backup to be interrupted in that context is very strong.
Unlike, say, on my home machine, where I also use it. If that fails out for some
reason, I will absolutely know about it and can take manual corrective action if
necessary.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't done an in-depth review of the new lock code, but looked at the design and I like it a lot!
a37c7f9
to
42e0e7a
Compare
The globalContext is now passed through cobra. (credit goes to @fd0) |
The gopts.ctx is cancelled when the main() method of restic exits.
Previously the global context was either accessed via gopts.ctx, stored in a local variable and then used within that function or sometimes both. This makes it very hard to follow which ctx or a wrapped version of it reaches which method. Thus just drop the context from the globalOptions struct and pass it explicitly to every command line handler method.
Restic continued e.g. a backup task even when it failed to renew the lock or failed to do so in time. For example if a backup client enters standby during the backup this can allow other operations like `prune` to run in the meantime (after calling `unlock`). After leaving standby the backup client will continue its backup and upload indexes which refer pack files that were removed in the meantime. This commit introduces a goroutine explicitly monitoring for locks that are not refreshed in time. To simplify the implementation there's now a separate goroutine to refresh the lock and monitor for timeouts for each lock. The monitoring goroutine would now cause the backup to fail as the client has lost it's lock in the meantime. The lock refresh goroutines are bound to the context used to lock the repository initially. The context returned by `lockRepo` is also cancelled when any of the goroutines exits. This ensures that the context is cancelled whenever for any reason the lock is no longer refreshed.
The tests check that the wrapped context is properly canceled whenever the repository is unlock or when the lock refresh fails.
While searching for lock file from concurrently running restic instances, restic ignored unreadable lock files. These can either be in fact invalid or just be temporarily unreadable. As it is not really possible to differentiate between both cases, just err on the side of caution and consider the repository as already locked. The code retries searching for other locks up to three times to smooth out temporarily unreadable lock files.
Monotonic timers are paused during standby. Thus these timers won't fire after waking up. Fall back to periodic polling to detect too large clock jumps. See golang/go#35012 for a discussion of go timers during standby.
42e0e7a
to
6d2d297
Compare
It's not just virtual machines. I started seeing this on my physical servers a couple of days ago. |
What does this PR change? What problem does it solve?
restic operations did not check that their lock is still valid. This can lead to a situation where for example a backup client is paused, then in the meantime
unlock
is called for the repository, which removes the now stale lock, and some time later the backup continues. If in the meantimeprune
was called, then this results in a broken snapshot.This PR changes the locking behavior to actually be strict. If restic is unable to refresh locks in time, then the whole operation will be cancelled. This is done by tying the context used by restic's commands to the lock lifetime. If the monitoring goroutine for the lock detects that the lock file was not refresh in time, then the context will be canceled. Thereby, the command is forcibly terminated.
To keep the implementation simple there are now two goroutines per lock. One which periodically refreshes the lock file and one which monitors that the expiry is done in time. The time limit to refresh the lock file is a few minutes shorter than the duration after which a lock file becomes stale. This is intended to compensate for a small amount of clock drift between clients.
Most code changes revolve around cleaning up the usage of the global context previously available via
globalOptions.ctx
. Depending on the command either that context was accessed viagopts.ctx
or a local copy inctx
. Alternatively, a few commands also introduce an additional context usingcontext.WithCancel
. The latter is unnecessary as the global context is canceled when restic shuts down. In total, these different ways to use the context made it non-obvious which contexts have to be tied to the lock lifetime to ensure that the command properly terminates after a failed lock refresh.This PR solves the "additional behavior changes" proposed in #2715. However, it does not check whether a lock file disappeared or not as the could lead to race conditions with storage backends which do not provide strongly consistent directory listings.
Was the change previously discussed in an issue or on the forum?
Fixes #2715
Checklist
changelog/unreleased/
that describes the changes for our users (see template).gofmt
on the code in all commits.