Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Last Heartbeat and Status are lost after app restart #53

Closed
przemeklal opened this issue Oct 24, 2023 · 7 comments · Fixed by #66
Closed

Last Heartbeat and Status are lost after app restart #53

przemeklal opened this issue Oct 24, 2023 · 7 comments · Fixed by #66

Comments

@przemeklal
Copy link
Member

Bug Description

A restart of cos-alerter results in the loss of this information as it is stored only in memory.

The last heartbeat timestamp is useful to store for tracking purposes, e.g. when waiting for firewall change to unblock connectivity.

Status persistence is even more important, in this case in particular:

  1. In the config:
wait_for_first_connection: true
  1. Client 1 (alertmanager) is up, then goes down, and alerts are being sent every repeat_interval.
  2. cos-alerter is restarted (e.g. to add new client).
  3. Client 1 status changes to Unknown. No alerts are sent, information about the last heartbeat is lost.

To Reproduce

Please see the example use case above.

Environment

Snap:

cos-alerter  0.4.0          1      latest/stable  0x12b       -

Relevant log output

No relevant logs.

Additional context

No response

@dstathis
Copy link
Contributor

I am not so sure this is a good idea. There are two issues to consider.

  1. The behavior of this is unclear. Let's say we have a down_interval of 5m. If COS alerter is stopped for 10 minutes, what should happen when it is restarted? Should every client be assumed to be down? This would cause a lot of false positives.
  2. COS Alerter is intended to be lightweight. In order to maintain good performance while writing everything to disk, we would likely need to add a dependency such as Redis. While not a major dependency, I would still prefer to avoid it.

@przemeklal
Copy link
Member Author

Thanks. Yes, I understand it's a tricky one and these are excellent arguments against implementing this feature.

We would have to stick to wait_for_first_connection: false and be prepared to lose status/last heartbeat any time there's e.g. cos-alerter snap auto-refresh. Without persistence, wait_for_first_connection: true is a no-go because there's a chance that things that kept alerting will stop after the app restart.

  1. It could be worked around by alerting only if the client hasn't sent any alert in the past 5m and cos-alerter has been running for more than 5m.
  2. Agreed, going with redis is going to be a challenge from an operator's point of view as well.

Still, the last heartbeat would be nice to have for troubleshooting e.g. when network issues started.

Let's get an opinion from @vsedelnik as well.

@przemeklal
Copy link
Member Author

@dstathis As suggested by @vsedelnik in a private channel, maybe dumping the state during a graceful shutdown and trying to load it up on the next startup in the best effort would be a sensible compromise.

@dstathis
Copy link
Contributor

We discussed this a bit further and came up with what we believe to be a good solution. We can persist the wait_for_first_connection info only. The benefit of this over persisting all the data is it changes infrequently so we can always write it to disk immediately. This would mean we would not be dependent on graceful shutdown to run correctly.

@dstathis
Copy link
Contributor

dstathis commented Feb 5, 2024

A side affect of this would be that if cos-alerter is down for down-interval (default 5m), it will alert for every client. @przemeklal Do you find this to be an acceptable tradeoff?

@simskij
Copy link
Member

simskij commented Feb 5, 2024

dumping the state during a graceful shutdown and trying to load it up on the next startup

this is a reasonable thing we should likely also do, @dstathis

@dstathis dstathis mentioned this issue Feb 21, 2024
@dstathis
Copy link
Contributor

Okay so I opened a PR which keeps the state on graceful shutdown. Does this solve your main concerns, or do we need to save some state on unexpected shutdowns as well?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants