-
-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Timers don't resume counting after system suspend/resume cycle #611
Comments
I tried reproducing your issue in a KVM VM that runs Ubuntu 22.04. No luck.
Results:
Let's figure out why I can't repro your problem:
|
I've just checked my laptop and it appears that zrepl made snapshots on most of the days I've used it recently, in spite of the suspend/resume cycles. I can't say for sure at this moment that there is an issue, but I'll keep an eye on it and if I see that any snapshots got missed I'll provide the additional details here. Thanks. |
So, the initial bug report was a false observation? Maybe only zrepl status UI didn't update? |
No, it wasn't just the 'zrepl status', the expected snapshots had not been made. However, there may have been other reasons that occurred, not just the suspend/resume cycles. I'll have a better idea of the situation in a few hours, as the next snapshots are supposed to be made in about four hours and I will intentionally leave the laptop suspended around that time and then resume it later in the evening (or tomorrow morning). |
Alright, I brought the laptop back out of suspend about fifteen minutes after the timer was to expire; it's been active for about 45 minutes since then, but 'zrepl status' still shows that the snap job is waiting for that original time to arrive, even though that time passed more than 60 minutes ago. However... this may be a clue: the laptop has undergone a timezone change. 'zrepl status' shows a date/time with -0500 (CDT), but the laptop is currently in -0400 (EDT). I suspect this also happened (in the other direction) the first time, since I started zrepl before leaving for the trip. The remaining details you asked for:
|
It's now been another 12 hours, with a few suspend/resume cycles, and zrepl is still waiting until '2022-07-04 18:29:26.325058544 -0500 -0500'. |
I'll need to learn more about how timezone changes propagate to already-running processes like daemons... And, was this a one-time timezone change or do you do this regularly? I'm asking to figure out why the problem didn't occur for a few days when you posted this comment 6 days ago. |
Note to self: It's a known issue that Go doesn't pick up timezone changes after first use of time: golang/go#28020 |
Note to self: according to this blog post, Go stdlib loads timezone lazily just for time.Time.String(), and actually it's supposed to be set explicitly by the app? https://www.dolthub.com/blog/2021-09-03-golang-time-bugs/#how-is-this-possible |
@kpfleming if you're able to compile zrepl yourself, please try with this patch: 3cf4885 |
I'll give that a try this week. Time zone changes are not a regular thing for me, I just happened to be traveling (out and back) the week that I was installing and configuring zrepl. |
The first timezone change (when I noticed the problem) was from -0400 to -0500; the second was returning from -0400 to -0500. The timezone changes were made by the GNOME control panel when the system noticed I was in a different physical location; I'm not sure how it actually makes the change to the system though. |
I'll get the patched version built later today. It appears that this issue is not related to timezone changes, as it has happened again in the last two days and my laptop has not undergone any timezone changes in that time. |
Finally got a chance to build and install the iss-611 branch; I'll post some log snippets tomorrow, after there have been a few suspend/resume cycles (I'll also note when I suspended and resumed). |
OK, the first suspend/resume cycle has been completed. The total suspend time was 15:24, and the zrepl debug log shows that |
It appears we have some 'unusual' log entries now:
The |
Hm that
|
OK, I'll keep watching for failures to trigger at the proper time. Next scheduled run is at 16:38 -0400 today. |
I'm still working on this issue, but I got waaaay behind on converting from my old replication system to zrepl, and it was concerning me that my fleet of machines didn't have solid backups in place. That's fixed as of this morning, so I'll start paying attention to the suspend/resume cycles again. |
Finally caught it happening!
Laptop was put into suspend at 12:33 on Jul 24, with 20h12m58s seconds left on the timer. Laptop was resumed at 06:03 on Jul 25 (17h30m later), but there was still 20h11m57s seconds left on the timer. |
As expected, restarting zrepl dropped the time remaining to the expected amount. |
Hm, so, IIUC, this means the suspend/resume downtime wasn't accounted in the Go runtimeClock. 🤔 Just found this Go issue here: Didn't have time to read through it in detail yet, but there are other issues referencing it that sound similar: |
In GoogleCloudPlatform/cloud-sql-proxy#1223 they remove the monotonic time reading before subtraction, so that the That would definitely fix it with the debug patch that your're currently running 3cf4885 But I don't want to merge that patch, since using the timer is the right thing to do here.
time.Until looks like this:
If we used @kpfleming does the above check out with your observation, before you used the debug patch? Specifically
If it does, then we have a plausible root cause. |
I really don't know anything about how Go handles timestamps other than what you've posted in this thread, so I may not be able to provide much guidance here :-) Your summary does make sense to me, we really should only be paying attention to the wall-clock time. |
See explainer comment in periodic.go for details. fixes #611
@kpfleming I posted a patch (it should show up right above this message on GitHub). Also / alternatively, here is a small Go program that showcases the problem.
If you have time today we can sync up via Matrix or set up a call to debug this together. |
Here's some log output from the test program, hopefully it is useful:
I'll start it up again later today when I can leave the system suspended for a longer period of time, and also overnight tonight in case the midnight rollover is some kind of factor. |
I refined the demo a little bit and made it more self-explanatory:
Now in
|
See explainer comment in periodic.go for details. fixes #611
Confirmed, I get the same behavior as you indicated.
|
Well, bad news. I built the iss-611-fix branch, installed it, and started it. With the snapshot period set to 1hr, it made a snapshot at startup. I then waited about 10 minutes, and suspended the laptop. I left it suspended for about 30 minutes, then resumed it and watched 'zrepl status' until the target time arrived. It passed, and no snapshot occurred. Restarting zrepl caused a snapshot to be created immediately. I'll suspend again now after changing the snapshot interval to 4h, and will resume in the morning to see what happens. |
Resumed the laptop this morning, well after the 4h snapshot interval should have expired, but 'zrepl status' still shows it waiting to make the snapshot scheduled for almost 10 hours ago. I've double-checked that I built the binary from the iss-611-fix branch and I'm running that binary. |
See explainer comment in periodic.go for details. fixes #611
See explainer comment in periodic.go for details. fixes #611
@kpfleming I just force-pushed 88dfbed to the branch. I did on my machine, and with the fix, I can't repro the issue. Without the fix, I can repro it. |
I won't be able to leave this in place for long enough to test it unless I also upgrade the receiver since there's a new protocol version in the branch... and the receiver replicates to a third system so it would also have to be upgraded. Are there any other changes in this branch I should be concerned about? |
It’s master + the fix. Once this issue is resolved I intend to release it. So, you’d be running a release candidate. I have run it on my personal systems for weeks already |
See explainer comment in periodic.go for details. fixes #611
Initial indications are that this branch does fix the suspend-related problems. I'll have a better answer in the morning. |
We have success! I started up zrepl built from the a187fa3 commit around 8:30PM last night, with the snap interval set to 12 hours. It made a snap on startup as expected. In between that time and now, the laptop was suspended/resumed multiple times, including ~8 hours overnight. At the appropriate time this morning a new snap was made and it is now waiting another 12 hours for another snap. Thanks for sticking with this :-) |
See explainer comment in periodic.go for details. fixes #611
I've got zrepl configured this way:
I set it up for the first time on my laptop on 2022-06-25. After starting the service, it made a set of snapshots for the 'daily' job as expected. Running 'zrepl status' showed that the 'daily' job was waiting until the same time of day on 2022-06-26 to make the next set of snapshots.
On 2022-06-29, after having suspended/resumed the laptop a number of times, I ran 'zrepl status' to see what had been happening. The 'daily' job still showed that it was waiting until the chosen time on 2022-06-26, even though that time had passed three days prior.
The text was updated successfully, but these errors were encountered: