Last Heartbeat and Status are lost after app restart #53

przemeklal · 2023-10-24T08:05:57Z

Bug Description

A restart of cos-alerter results in the loss of this information as it is stored only in memory.

The last heartbeat timestamp is useful to store for tracking purposes, e.g. when waiting for firewall change to unblock connectivity.

Status persistence is even more important, in this case in particular:

In the config:

wait_for_first_connection: true

Client 1 (alertmanager) is up, then goes down, and alerts are being sent every repeat_interval.
cos-alerter is restarted (e.g. to add new client).
Client 1 status changes to Unknown. No alerts are sent, information about the last heartbeat is lost.

To Reproduce

Please see the example use case above.

Environment

Snap:

cos-alerter  0.4.0          1      latest/stable  0x12b       -

Relevant log output

No relevant logs.

Additional context

No response

The text was updated successfully, but these errors were encountered:

dstathis · 2023-10-25T13:15:58Z

I am not so sure this is a good idea. There are two issues to consider.

The behavior of this is unclear. Let's say we have a down_interval of 5m. If COS alerter is stopped for 10 minutes, what should happen when it is restarted? Should every client be assumed to be down? This would cause a lot of false positives.
COS Alerter is intended to be lightweight. In order to maintain good performance while writing everything to disk, we would likely need to add a dependency such as Redis. While not a major dependency, I would still prefer to avoid it.

przemeklal · 2023-10-25T13:45:22Z

Thanks. Yes, I understand it's a tricky one and these are excellent arguments against implementing this feature.

We would have to stick to wait_for_first_connection: false and be prepared to lose status/last heartbeat any time there's e.g. cos-alerter snap auto-refresh. Without persistence, wait_for_first_connection: true is a no-go because there's a chance that things that kept alerting will stop after the app restart.

It could be worked around by alerting only if the client hasn't sent any alert in the past 5m and cos-alerter has been running for more than 5m.
Agreed, going with redis is going to be a challenge from an operator's point of view as well.

Still, the last heartbeat would be nice to have for troubleshooting e.g. when network issues started.

Let's get an opinion from @vsedelnik as well.

przemeklal · 2023-10-25T14:06:56Z

@dstathis As suggested by @vsedelnik in a private channel, maybe dumping the state during a graceful shutdown and trying to load it up on the next startup in the best effort would be a sensible compromise.

dstathis · 2023-10-30T12:51:42Z

We discussed this a bit further and came up with what we believe to be a good solution. We can persist the wait_for_first_connection info only. The benefit of this over persisting all the data is it changes infrequently so we can always write it to disk immediately. This would mean we would not be dependent on graceful shutdown to run correctly.

dstathis · 2024-02-05T11:00:11Z

A side affect of this would be that if cos-alerter is down for down-interval (default 5m), it will alert for every client. @przemeklal Do you find this to be an acceptable tradeoff?

simskij · 2024-02-05T14:47:14Z

dumping the state during a graceful shutdown and trying to load it up on the next startup

this is a reasonable thing we should likely also do, @dstathis

dstathis · 2024-02-21T12:33:30Z

Okay so I opened a PR which keeps the state on graceful shutdown. Does this solve your main concerns, or do we need to save some state on unexpected shutdowns as well?

przemeklal added Status: Triage Type: Bug labels Oct 24, 2023

dstathis self-assigned this Oct 25, 2023

dstathis mentioned this issue Feb 21, 2024

Retain state #66

Merged

dstathis closed this as completed in #66 Feb 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Last Heartbeat and Status are lost after app restart #53

Last Heartbeat and Status are lost after app restart #53

przemeklal commented Oct 24, 2023

dstathis commented Oct 25, 2023

przemeklal commented Oct 25, 2023

przemeklal commented Oct 25, 2023

dstathis commented Oct 30, 2023

dstathis commented Feb 5, 2024

simskij commented Feb 5, 2024

dstathis commented Feb 21, 2024

Last Heartbeat and Status are lost after app restart #53

Last Heartbeat and Status are lost after app restart #53

Comments

przemeklal commented Oct 24, 2023

Bug Description

To Reproduce

Environment

Relevant log output

Additional context

dstathis commented Oct 25, 2023

przemeklal commented Oct 25, 2023

przemeklal commented Oct 25, 2023

dstathis commented Oct 30, 2023

dstathis commented Feb 5, 2024

simskij commented Feb 5, 2024

dstathis commented Feb 21, 2024