Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timers don't resume counting after system suspend/resume cycle #611

Closed
kpfleming opened this issue Jun 30, 2022 · 35 comments
Closed

Timers don't resume counting after system suspend/resume cycle #611

kpfleming opened this issue Jun 30, 2022 · 35 comments
Milestone

Comments

@kpfleming
Copy link

I've got zrepl configured this way:

global:
  logging:
    - type: syslog
      format: human
      level: warn

jobs:
  - name: daily
    type: snap
    filesystems:
      "bpool/boot<": true
      "rpool/root<": true
    snapshotting:
      type: periodic
      prefix: zrepl_
      interval: 24h
    pruning:
      keep:
        - type: regex
          regex: "^zrepl_.*"
          negate: true
        - type: grid
          regex: "^zrepl_.*"
          grid: 7x24h | 4x168h

I set it up for the first time on my laptop on 2022-06-25. After starting the service, it made a set of snapshots for the 'daily' job as expected. Running 'zrepl status' showed that the 'daily' job was waiting until the same time of day on 2022-06-26 to make the next set of snapshots.

On 2022-06-29, after having suspended/resumed the laptop a number of times, I ran 'zrepl status' to see what had been happening. The 'daily' job still showed that it was waiting until the chosen time on 2022-06-26, even though that time had passed three days prior.

@problame
Copy link
Member

problame commented Jul 3, 2022

I tried reproducing your issue in a KVM VM that runs Ubuntu 22.04. No luck.
Steps:

  • Setup a snap job with 10 seond interval
  • Wait until it snaps
  • systemctl suspend
  • wait 30 seconds
  • Hit a key on the VM console to make it wake up
  • See how many snaps are present, and what timestamp they have.

Results:

  • As soon as the VM wakes up, the snapjob creates a snapshot.
  • Exactly $interval after that, it takes the next snapshot.

Let's figure out why I can't repro your problem:

  1. What version of zrepl are you using?
  2. What Go version are you using to compile zrepl?
  3. What OS are you using (distro + exact kernel version please)
    3.1 If you are running a custom kernel, please post the kernel config as well.
  4. Did you suspend-to-RAM, hibernate-to-disk, or maybe suspend-then-hibernate?

@problame problame added this to the 0.6 milestone Jul 3, 2022
@kpfleming
Copy link
Author

I've just checked my laptop and it appears that zrepl made snapshots on most of the days I've used it recently, in spite of the suspend/resume cycles. I can't say for sure at this moment that there is an issue, but I'll keep an eye on it and if I see that any snapshots got missed I'll provide the additional details here. Thanks.

@problame
Copy link
Member

problame commented Jul 4, 2022

So, the initial bug report was a false observation? Maybe only zrepl status UI didn't update?

@kpfleming
Copy link
Author

No, it wasn't just the 'zrepl status', the expected snapshots had not been made. However, there may have been other reasons that occurred, not just the suspend/resume cycles.

I'll have a better idea of the situation in a few hours, as the next snapshots are supposed to be made in about four hours and I will intentionally leave the laptop suspended around that time and then resume it later in the evening (or tomorrow morning).

@kpfleming
Copy link
Author

Alright, I brought the laptop back out of suspend about fifteen minutes after the timer was to expire; it's been active for about 45 minutes since then, but 'zrepl status' still shows that the snap job is waiting for that original time to arrive, even though that time passed more than 60 minutes ago.

However... this may be a clue: the laptop has undergone a timezone change. 'zrepl status' shows a date/time with -0500 (CDT), but the laptop is currently in -0400 (EDT). I suspect this also happened (in the other direction) the first time, since I started zrepl before leaving for the trip.

The remaining details you asked for:

  • zrepl version=v0.5.0 go=go1.17.6 GOOS=linux GOARCH=amd64 Compiler=gc
  • Debian Bookworm - 5.18.0-2-amd64 #1 SMP PREEMPT_DYNAMIC Debian 5.18.5-1 (2022-06-16) x86_64 GNU/Linux
  • suspend-to-RAM only (the laptop's LED flashes slowly while it is suspended); I do not have hibernation configured.

@kpfleming
Copy link
Author

It's now been another 12 hours, with a few suspend/resume cycles, and zrepl is still waiting until '2022-07-04 18:29:26.325058544 -0500 -0500'.

@problame
Copy link
Member

However... this may be a clue: the laptop has undergone a timezone change. 'zrepl status' shows a date/time with -0500 (CDT), but the laptop is currently in -0400 (EDT). I suspect this also happened (in the other direction) the first time, since I started zrepl before leaving for the trip.

  1. Hm, how did you apply the timezone change? systemd-timesyncd?
  2. Which timezone did you switch from / to? I presume -0500 to -0400?

I'll need to learn more about how timezone changes propagate to already-running processes like daemons...

And, was this a one-time timezone change or do you do this regularly? I'm asking to figure out why the problem didn't occur for a few days when you posted this comment 6 days ago.

@problame
Copy link
Member

Note to self: It's a known issue that Go doesn't pick up timezone changes after first use of time: golang/go#28020

@problame
Copy link
Member

Note to self: according to this blog post, Go stdlib loads timezone lazily just for time.Time.String(), and actually it's supposed to be set explicitly by the app? https://www.dolthub.com/blog/2021-09-03-golang-time-bugs/#how-is-this-possible

@problame
Copy link
Member

@kpfleming if you're able to compile zrepl yourself, please try with this patch: 3cf4885
It adds debug logging that that periodically dumps the remaining time until the snapshotter wakes up.
Configurable via env var ZREPL_SNAPPER_DEBUG_LOG_ISSUE_611, defaults to once per minute.
It dumps at error level, so that you don't need to increase your log level.
Should make it easy to spot glitches / negative counts there.

@kpfleming
Copy link
Author

I'll give that a try this week. Time zone changes are not a regular thing for me, I just happened to be traveling (out and back) the week that I was installing and configuring zrepl.

@kpfleming
Copy link
Author

The first timezone change (when I noticed the problem) was from -0400 to -0500; the second was returning from -0400 to -0500. The timezone changes were made by the GNOME control panel when the system noticed I was in a different physical location; I'm not sure how it actually makes the change to the system though.

@kpfleming
Copy link
Author

I'll get the patched version built later today. It appears that this issue is not related to timezone changes, as it has happened again in the last two days and my laptop has not undergone any timezone changes in that time.

@kpfleming
Copy link
Author

Finally got a chance to build and install the iss-611 branch; I'll post some log snippets tomorrow, after there have been a few suspend/resume cycles (I'll also note when I suspended and resumed).

@kpfleming
Copy link
Author

OK, the first suspend/resume cycle has been completed. The total suspend time was 15:24, and the zrepl debug log shows that remaining dropped by the same amount. More updates to follow after more cycles and after remaining is expected to pass zero.

@kpfleming
Copy link
Author

It appears we have some 'unusual' log entries now:

Jul 16 15:40:03 balrog22 zrepl[1956]: [daily][snapshot][Uv38$+3Lx$MmvZ$MmvZ.EXe9]: DEBUG-LOG-ZREPL-ISSUE-611 sleep_until="2022-07-17 15:39:03.16319136 -0400 EDT m=+89301.791265751" remaining="23h58m59.692558612s"
Jul 16 18:46:19 balrog22 zrepl[1956]: [daily][snapshot][Uv38$+3Lx$MmvZ$MmvZ.EXe9]: DEBUG-LOG-ZREPL-ISSUE-611 sleep_until="2022-07-17 15:39:03.16319136 -0400 EDT m=+89301.791265751" remaining="23h57m59.419584376s"
Jul 16 18:47:20 balrog22 zrepl[1956]: [daily][snapshot][Uv38$+3Lx$MmvZ$MmvZ.EXe9]: DEBUG-LOG-ZREPL-ISSUE-611 remaining="23h56m59.146321317s" sleep_until="2022-07-17 15:39:03.16319136 -0400 EDT m=+89301.791265751"
Jul 16 18:48:20 balrog22 zrepl[1956]: [daily][snapshot][Uv38$+3Lx$MmvZ$MmvZ.EXe9]: DEBUG-LOG-ZREPL-ISSUE-611 sleep_until="2022-07-17 15:39:03.16319136 -0400 EDT m=+89301.791265751" remaining="23h55m58.873559271s"
Jul 16 18:49:20 balrog22 zrepl[1956]: [daily][snapshot][Uv38$+3Lx$MmvZ$MmvZ.EXe9]: DEBUG-LOG-ZREPL-ISSUE-611 sleep_until="2022-07-17 15:39:03.16319136 -0400 EDT m=+89301.791265751" remaining="23h54m58.599972213s"
Jul 16 18:50:21 balrog22 zrepl[1956]: [daily][snapshot][Uv38$+3Lx$MmvZ$MmvZ.EXe9]: DEBUG-LOG-ZREPL-ISSUE-611 remaining="23h53m58.326803996s" sleep_until="2022-07-17 15:39:03.16319136 -0400 EDT m=+89301.791265751"

The sleep_until values have an unusual suffix.

@problame
Copy link
Member

Hm that m= suffix is just the Go runtime clock.

If the time has a monotonic clock reading, the returned string includes a final field "m=±", where value is the monotonic clock reading formatted as a decimal number of seconds.

@kpfleming
Copy link
Author

OK, I'll keep watching for failures to trigger at the proper time. Next scheduled run is at 16:38 -0400 today.

@kpfleming
Copy link
Author

I'm still working on this issue, but I got waaaay behind on converting from my old replication system to zrepl, and it was concerning me that my fleet of machines didn't have solid backups in place. That's fixed as of this morning, so I'll start paying attention to the suspend/resume cycles again.

@kpfleming
Copy link
Author

Finally caught it happening!

ul 24 12:30:21 balrog22 zrepl[2564793]: [daily][snapshot][Uv38$lx+I$hJns$hJns.8+ie]: DEBUG-LOG-ZREPL-ISSUE-611 sleep_until="2022-07-25 06:56:55.071379812 -0400 EDT m=+94565.722551949" remaining="20h15m58.897634179s"
Jul 24 12:31:22 balrog22 zrepl[2564793]: [daily][snapshot][Uv38$lx+I$hJns$hJns.8+ie]: DEBUG-LOG-ZREPL-ISSUE-611 sleep_until="2022-07-25 06:56:55.071379812 -0400 EDT m=+94565.722551949" remaining="20h14m58.624734257s"
Jul 24 12:32:22 balrog22 zrepl[2564793]: [daily][snapshot][Uv38$lx+I$hJns$hJns.8+ie]: DEBUG-LOG-ZREPL-ISSUE-611 remaining="20h13m58.351775866s" sleep_until="2022-07-25 06:56:55.071379812 -0400 EDT m=+94565.722551949"
Jul 24 12:33:22 balrog22 zrepl[2564793]: [daily][snapshot][Uv38$lx+I$hJns$hJns.8+ie]: DEBUG-LOG-ZREPL-ISSUE-611 sleep_until="2022-07-25 06:56:55.071379812 -0400 EDT m=+94565.722551949" remaining="20h12m58.079112926s"
Jul 25 06:03:23 balrog22 zrepl[2564793]: [daily][snapshot][Uv38$lx+I$hJns$hJns.8+ie]: DEBUG-LOG-ZREPL-ISSUE-611 remaining="20h11m57.805964574s" sleep_until="2022-07-25 06:56:55.071379812 -0400 EDT m=+94565.722551949"
Jul 25 06:04:24 balrog22 zrepl[2564793]: [daily][snapshot][Uv38$lx+I$hJns$hJns.8+ie]: DEBUG-LOG-ZREPL-ISSUE-611 sleep_until="2022-07-25 06:56:55.071379812 -0400 EDT m=+94565.722551949" remaining="20h10m57.533599412s"
Jul 25 06:05:24 balrog22 zrepl[2564793]: [daily][snapshot][Uv38$lx+I$hJns$hJns.8+ie]: DEBUG-LOG-ZREPL-ISSUE-611 sleep_until="2022-07-25 06:56:55.071379812 -0400 EDT m=+94565.722551949" remaining="20h9m57.259629469s"
Jul 25 06:06:24 balrog22 zrepl[2564793]: [daily][snapshot][Uv38$lx+I$hJns$hJns.8+ie]: DEBUG-LOG-ZREPL-ISSUE-611 sleep_until="2022-07-25 06:56:55.071379812 -0400 EDT m=+94565.722551949" remaining="20h8m56.98692036s"
Jul 25 06:07:24 balrog22 zrepl[2564793]: [daily][snapshot][Uv38$lx+I$hJns$hJns.8+ie]: DEBUG-LOG-ZREPL-ISSUE-611 sleep_until="2022-07-25 06:56:55.071379812 -0400 EDT m=+94565.722551949" remaining="20h7m56.713989704s"
Jul 25 06:08:25 balrog22 zrepl[2564793]: [daily][snapshot][Uv38$lx+I$hJns$hJns.8+ie]: DEBUG-LOG-ZREPL-ISSUE-611 sleep_until="2022-07-25 06:56:55.071379812 -0400 EDT m=+94565.722551949" remaining="20h6m56.441223381s"

Laptop was put into suspend at 12:33 on Jul 24, with 20h12m58s seconds left on the timer.

Laptop was resumed at 06:03 on Jul 25 (17h30m later), but there was still 20h11m57s seconds left on the timer.

@kpfleming
Copy link
Author

As expected, restarting zrepl dropped the time remaining to the expected amount.

@problame
Copy link
Member

Hm, so, IIUC, this means the suspend/resume downtime wasn't accounted in the Go runtimeClock. 🤔

Just found this Go issue here:
golang/go#36141

Didn't have time to read through it in detail yet, but there are other issues referencing it that sound similar:

image

@problame
Copy link
Member

In GoogleCloudPlatform/cloud-sql-proxy#1223 they remove the monotonic time reading before subtraction, so that the time package uses wall clock time instead of simply subtracting the monotonic time.

That would definitely fix it with the debug patch that your're currently running 3cf4885

But I don't want to merge that patch, since using the timer is the right thing to do here.
So, I guess what happens is that after resume, it might or might not have taken one snaphot, but then entered wait state with a monotonic clock that was way off the wall clock time, and armed the wait timer using

snapper.lastInvocation = time.Now()
...
snapper.sleepUntil = lastTick.Add(a.interval)
sleepUntil = snapper.sleepUntil
...
t := time.NewTimer(time.Until(sleepUntil))

time.Until looks like this:


// Until returns the duration until t.
// It is shorthand for t.Sub(time.Now()).
func Until(t Time) Duration {
	var now Time
	if t.wall&hasMonotonic != 0 {
		// Common case optimization: if t has monotonic time, then Sub will use only it.
		now = Time{hasMonotonic, runtimeNano() - startNano, nil}
	} else {
		now = Now()
	}
	return t.Sub(now)
}

t is sleepUntil here, and since it's derived from time.Now(), it has a monotonic clock reading.
So, time.Until takes the first branch, and uses runtime clock to compute the delta.

If we used time.Until(sleepUntil.Round(0)) to strip the monotonic clock reading, it would use t.Sub(now) which would do a wall-clock-time comparison, which is the correct thing to do here.

@kpfleming does the above check out with your observation, before you used the debug patch? Specifically

So, I guess what happens is that after resume, it might or might not have taken one snaphot, but then entered wait state with a monotonic clock that was way off the wall clock time, and armed the wait timer using

If it does, then we have a plausible root cause.

@kpfleming
Copy link
Author

I really don't know anything about how Go handles timestamps other than what you've posted in this thread, so I may not be able to provide much guidance here :-)

Your summary does make sense to me, we really should only be paying attention to the wall-clock time.

problame added a commit that referenced this issue Oct 9, 2022
See explainer comment in periodic.go for details.

fixes #611
@problame
Copy link
Member

problame commented Oct 9, 2022

@kpfleming I posted a patch (it should show up right above this message on GitHub).
Please deploy it to your system and try it out.

Also / alternatively, here is a small Go program that showcases the problem.
Oddly enough, it doesn't showcase the problem on my machine, but maybe on yours...

package main

import (
	"bufio"
	"flag"
	"log"
	"os"
	"time"
)

func main() {

	flagDelta := flag.Duration("d", 0, "")
	flagRound := flag.Bool("r", false, "")
	flag.Parse()

	log.Printf("flagDelta=%v", *flagDelta)

	stdin := bufio.NewReader(os.Stdin)
	for i := 1; ; i++ {
		log.Printf("ITERATION %v", i)
		x := time.Now().Add(*flagDelta)
		if *flagRound {
			x = x.Round(0)
		}
		log.Printf("x: %s", x)
		log.Println("please suspend the system. Hit enter to resume this program")
		_, _, err := stdin.ReadLine()
		if err != nil {
			panic(err)
		}
		log.Printf("x: %s", x)
		log.Printf("time.Until(x): %v", time.Until(x).Seconds())
		log.Println("")
		log.Println("----")
	}

}

If you have time today we can sync up via Matrix or set up a call to debug this together.

@kpfleming
Copy link
Author

Here's some log output from the test program, hopefully it is useful:

2022/10/09 08:45:58 flagDelta=0s
2022/10/09 08:45:58 ITERATION 1
2022/10/09 08:45:58 x: 2022-10-09 08:45:58.268989289 -0500 CDT m=+0.000145722
2022/10/09 08:45:58 please suspend the system. Hit enter to resume this program

2022/10/09 08:46:44 x: 2022-10-09 08:45:58.268989289 -0500 CDT m=+0.000145722
2022/10/09 08:46:44 time.Until(x): -27.246221551
2022/10/09 08:46:44 
2022/10/09 08:46:44 ----
2022/10/09 08:46:44 ITERATION 2
2022/10/09 08:46:44 x: 2022-10-09 08:46:44.168356104 -0500 CDT m=+27.246385848
2022/10/09 08:46:44 please suspend the system. Hit enter to resume this program

2022/10/09 08:48:23 x: 2022-10-09 08:46:44.168356104 -0500 CDT m=+27.246385848
2022/10/09 08:48:23 time.Until(x): -47.852487978
2022/10/09 08:48:23 
2022/10/09 08:48:23 ----
2022/10/09 08:48:23 ITERATION 3
2022/10/09 08:48:23 x: 2022-10-09 08:48:23.757363821 -0500 CDT m=+75.098891241
2022/10/09 08:48:23 please suspend the system. Hit enter to resume this program

2022/10/09 08:52:20 x: 2022-10-09 08:48:23.757363821 -0500 CDT m=+75.098891241
2022/10/09 08:52:20 time.Until(x): -54.301496765
2022/10/09 08:52:20 
2022/10/09 08:52:20 ----
2022/10/09 08:52:20 ITERATION 4
2022/10/09 08:52:20 x: 2022-10-09 08:52:20.259585096 -0500 CDT m=+129.400408753
2022/10/09 08:52:20 please suspend the system. Hit enter to resume this program

I'll start it up again later today when I can leave the system suspended for a longer period of time, and also overnight tonight in case the midnight rollover is some kind of factor.

@problame
Copy link
Member

problame commented Oct 9, 2022

I refined the demo a little bit and made it more self-explanatory:

package main

import (
	"bufio"
	"flag"
	"log"
	"os"
	"time"
)

func main() {

	flagRound := flag.Bool("round", false, "")
	flag.Parse()

	stdin := bufio.NewReader(os.Stdin)
	log.Printf("round=%v", *flagRound)
	x := time.Now()
	if *flagRound {
		x = x.Round(0)
	}
	log.Printf("(1) x: %s", x)
	log.Println("Please suspend the system (systemctl suspend).")
	log.Println("Keep system suspended for some time (say, 40 seconds).")
	log.Println("Resume the system and hit ENTER in this terminal")
	_, _, err := stdin.ReadLine()
	if err != nil {
		panic(err)
	}
	log.Println("Post resume:")
	log.Printf("(2) x: %s", x)
	log.Printf("(3) time.Until(x): %v", time.Until(x).Seconds())
	log.Println("")
	log.Println("Look at the log timestamps of messages (1) and (2).")
	log.Println("The delta between (2) - (1) is the wall clock time delta.")
	log.Println("Now look at the time.Until result printed in (3):")
	log.Println("=> with `-round=false`, the value will be a lot smaller than the wall clock time delta.")
	log.Println("=> with `-round=true`, the value will be about the same as wall clock time delta.")

}
cs@cslap:[~/zrepl/zrepl/iss-611]: go run . -round=false

2022/10/09 18:48:19 round=false
2022/10/09 18:48:19 (1) x: 2022-10-09 18:48:19.070628235 +0200 CEST m=+0.000076543
2022/10/09 18:48:19 Please suspend the system (systemctl suspend).
2022/10/09 18:48:19 Keep system suspended for some time (say, 40 seconds).
2022/10/09 18:48:19 Resume the system and hit ENTER in this terminal

2022/10/09 18:49:09 Post resume:
2022/10/09 18:49:09 (2) x: 2022-10-09 18:48:19.070628235 +0200 CEST m=+0.000076543
2022/10/09 18:49:09 (3) time.Until(x): -15.551957229
2022/10/09 18:49:09 
2022/10/09 18:49:09 Look at the log timestamps of messages (1) and (2).
2022/10/09 18:49:09 The delta between (2) - (1) is the wall clock time delta.
2022/10/09 18:49:09 Now look at the time.Until result printed in (3):
2022/10/09 18:49:09 => with `-round=false`, the value will be a lot smaller than the wall clock time delta.
2022/10/09 18:49:09 => with `-round=true`, the value will be about the same as wall clock time delta.
  • Delta between the log timestamps of (2) - (1) = 18:49:09 - 18:48:19 = 50 seconds
  • But according to time.Until, only 15.55 seconds passed.

Now in round=true mode:

cs@cslap:[~/zrepl/zrepl/iss-611]: go run . -round=true
2022/10/09 18:49:36 round=true
2022/10/09 18:49:36 (1) x: 2022-10-09 18:49:36.96528956 +0200 CEST
2022/10/09 18:49:36 Please suspend the system (systemctl suspend).
2022/10/09 18:49:36 Keep system suspended for some time (say, 40 seconds).
2022/10/09 18:49:36 Resume the system and hit ENTER in this terminal

2022/10/09 18:50:46 Post resume:
2022/10/09 18:50:46 (2) x: 2022-10-09 18:49:36.96528956 +0200 CEST
2022/10/09 18:50:46 (3) time.Until(x): -70.011943975
2022/10/09 18:50:46 
2022/10/09 18:50:46 Look at the log timestamps of messages (1) and (2).
2022/10/09 18:50:46 The delta between (2) - (1) is the wall clock time delta.
2022/10/09 18:50:46 Now look at the time.Until result printed in (3):
2022/10/09 18:50:46 => with `-round=false`, the value will be a lot smaller than the wall clock time delta.
2022/10/09 18:50:46 => with `-round=true`, the value will be about the same as wall clock time delta.
  • Delta (2) - (1) = 18:50:46 - 18:49:36 = 70 seconds
  • And time.Until also reports that (70.011)

problame added a commit that referenced this issue Oct 9, 2022
See explainer comment in periodic.go for details.

fixes #611
@kpfleming
Copy link
Author

Confirmed, I get the same behavior as you indicated. -round=true results in the result being very close to the wall clock time delta.

kpfleming@balrog22:~/Downloads$ /usr/local/go/bin/go run issue-611.go -round=false
2022/10/09 17:14:46 round=false
2022/10/09 17:14:46 (1) x: 2022-10-09 17:14:46.787277722 -0500 CDT m=+0.000152627
2022/10/09 17:14:46 Please suspend the system (systemctl suspend).
2022/10/09 17:14:46 Keep system suspended for some time (say, 40 seconds).
2022/10/09 17:14:46 Resume the system and hit ENTER in this terminal

2022/10/09 17:15:48 Post resume:
2022/10/09 17:15:48 (2) x: 2022-10-09 17:14:46.787277722 -0500 CDT m=+0.000152627
2022/10/09 17:15:48 (3) time.Until(x): -25.011443242
2022/10/09 17:15:48 
2022/10/09 17:15:48 Look at the log timestamps of messages (1) and (2).
2022/10/09 17:15:48 The delta between (2) - (1) is the wall clock time delta.
2022/10/09 17:15:48 Now look at the time.Until result printed in (3):
2022/10/09 17:15:48 => with `-round=false`, the value will be a lot smaller than the wall clock time delta.
2022/10/09 17:15:48 => with `-round=true`, the value will be about the same as wall clock time delta.
kpfleming@balrog22:~/Downloads$ /usr/local/go/bin/go run issue-611.go -round=true
2022/10/09 17:16:11 round=true
2022/10/09 17:16:11 (1) x: 2022-10-09 17:16:11.845913067 -0500 CDT
2022/10/09 17:16:11 Please suspend the system (systemctl suspend).
2022/10/09 17:16:11 Keep system suspended for some time (say, 40 seconds).
2022/10/09 17:16:11 Resume the system and hit ENTER in this terminal

2022/10/09 17:17:18 Post resume:
2022/10/09 17:17:18 (2) x: 2022-10-09 17:16:11.845913067 -0500 CDT
2022/10/09 17:17:18 (3) time.Until(x): -66.682650027
2022/10/09 17:17:18 
2022/10/09 17:17:18 Look at the log timestamps of messages (1) and (2).
2022/10/09 17:17:18 The delta between (2) - (1) is the wall clock time delta.
2022/10/09 17:17:18 Now look at the time.Until result printed in (3):
2022/10/09 17:17:18 => with `-round=false`, the value will be a lot smaller than the wall clock time delta.
2022/10/09 17:17:18 => with `-round=true`, the value will be about the same as wall clock time delta.

@kpfleming
Copy link
Author

Well, bad news. I built the iss-611-fix branch, installed it, and started it. With the snapshot period set to 1hr, it made a snapshot at startup. I then waited about 10 minutes, and suspended the laptop.

I left it suspended for about 30 minutes, then resumed it and watched 'zrepl status' until the target time arrived. It passed, and no snapshot occurred. Restarting zrepl caused a snapshot to be created immediately.

I'll suspend again now after changing the snapshot interval to 4h, and will resume in the morning to see what happens.

@kpfleming
Copy link
Author

Resumed the laptop this morning, well after the 4h snapshot interval should have expired, but 'zrepl status' still shows it waiting to make the snapshot scheduled for almost 10 hours ago. I've double-checked that I built the binary from the iss-611-fix branch and I'm running that binary.

problame added a commit that referenced this issue Oct 22, 2022
See explainer comment in periodic.go for details.

fixes #611
problame added a commit that referenced this issue Oct 22, 2022
See explainer comment in periodic.go for details.

fixes #611
@problame
Copy link
Member

@kpfleming I just force-pushed 88dfbed to the branch.
Can you please repeat the experiment from two weeks ago?

I did on my machine, and with the fix, I can't repro the issue. Without the fix, I can repro it.

@kpfleming
Copy link
Author

I won't be able to leave this in place for long enough to test it unless I also upgrade the receiver since there's a new protocol version in the branch... and the receiver replicates to a third system so it would also have to be upgraded. Are there any other changes in this branch I should be concerned about?

@problame
Copy link
Member

It’s master + the fix. Once this issue is resolved I intend to release it. So, you’d be running a release candidate. I have run it on my personal systems for weeks already

problame added a commit that referenced this issue Oct 24, 2022
See explainer comment in periodic.go for details.

fixes #611
@kpfleming
Copy link
Author

Initial indications are that this branch does fix the suspend-related problems. I'll have a better answer in the morning.

@kpfleming
Copy link
Author

We have success! I started up zrepl built from the a187fa3 commit around 8:30PM last night, with the snap interval set to 12 hours. It made a snap on startup as expected.

In between that time and now, the laptop was suspended/resumed multiple times, including ~8 hours overnight. At the appropriate time this morning a new snap was made and it is now waiting another 12 hours for another snap.

Thanks for sticking with this :-)

problame added a commit that referenced this issue Oct 26, 2022
See explainer comment in periodic.go for details.

fixes #611
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants