Notification behaviour after Downtime ends #5919

edpstiffel · 2017-12-29T09:18:05Z

Expected Behavior

I schedule a fixed downtime for a service.
The service goes CRITICAL within the downtime.
The service is still CRITICAL when the downtime ends.
I expect to be notified right when the downtime ends.

Current Behavior

When the downtime ends, the notification for one contact fires right away, but the notification for a second contact is delayed.

Possible Solution

Experiments show that the interval setting for a notification is the key: The one contact that gets the notification right after the downtime ends has notification interval of 0 (zero).
The other contact has an interval setting of 600 seconds (10m), and he gets the notification 10 minutes after the hard state change happened (during the downtime).

Steps to Reproduce (for bugs)

Please take a look at the attached screenshot which shows the history of such a behaviour:

Schedule downtime of 5 minutes for a (passive) service
Trigger hard state change to CRITICAL after a couple of seconds in the downtime
After downtime has ended, pstiffel is notified right away (the attached notification object has interval of 0)
5 minutes after the downtime has ended (and 10 minutes after the hard state change), jmueller is notified (the attached notification object has interval of 10m)

Context

IMHO, the notification should happen immediately after the downtime has ended, no matter which interval was set.
I guess, I watched the same behaviour when using timeperiods which are not 24x7, i.e. when using notification period 9to17 and a outtage happens before that time. The contact attached to notification object with interval 0 is notified right when the notification period starts, the contact attached to a notification object with an interval, is delayed until the next regular interval after the outtage.

The background: Our contact with interval 0 is a ticket system which should only receive one notification, while our staff should be re-informed every hour.

Your Environment

icinga2-Version r2.8.0-1
Clustered master setup with two nodes
Debian 8

dnsmichi · 2018-01-08T12:45:29Z

Contacts/Users don't have a notification interval, that's to be defined inside the notification object.

Can you share a sample configuration in order to reproduce the issue?

Cheers,
Michael

edpstiffel · 2018-01-12T08:50:20Z

Ok, here we go:

I applied the following changes to a vanilla icinga 2.8 installation:

conf.d/services.conf

apply Service "dummy" {
  import "generic-service"
  check_command = "dummy"
  max_check_attempts = 1
  assign where host.name == NodeName
}

conf.d/users.conf

object User "UserA" {
  import "generic-user"
  display_name = "nur eine Benachrichtigung"
  email = "root@localhost"
}

object User "UserB" {
  import "generic-user"
  display_name = "Benachrichtigung jede Stunde"
  email = "root@localhost"
}

conf.d/notifications.conf

apply Notification "einmalige-Mail" to Service {
  import "mail-service-notification"
  users = [ "UserA" ]
  interval = 0
  assign where match(service.name, "dummy")
}

apply Notification "stuendliche-Mail" to Service {
  import "mail-service-notification"
  users = [ "UserB" ]
  interval = 1h
  assign where match(service.name, "dummy")
}

conf.d/templates.conf

template User "generic-user" {
        states = [ Up, Down, OK, Warning, Critical, Unknown ]
        types = [ Problem, Acknowledgement, Recovery ]
}

Here's how to reproduce the problem:

disable active checks on the dummy service
create fixed downtime of 5min on the dummy service
submit CRITICAL state to dummy service

Result:

after the downtime has ended, UserA is notified about the CRITICAL service immediately
UserB is getting the notification after a serious amount of delay (in the attached screenshot, it is 53 minutes later)
icingaadmin gets notified via default notification object after 23 minutes

Conclusion:
IMHO, all users should be notified immediately after the downtime has ended.
In our production environment, I guess the same problem occurs when you use notification timeperiods other than 24x7 and the outtage happens outside of the timeperiod. Here, the notification object with interval=0 fires immediately when the notification period has started, and the other notification objects with interval != 0 fire later. I will reproduce that in the test environment.

dnsmichi · 2018-01-18T17:45:55Z

Ok, understood. The main request is to ignore the notification interval if a downtime has ended. Right now the calculated next notification time is

notification -> suppressed by downtime
+10m for next_notification

downtime ends after 5m

5m later, the next notification is sent for the problem

Changing this could break existing setups. I'd like to hear from others what they think. Or see a possible patch to adjust the behaviour and fully test it.

edpstiffel · 2018-01-19T12:45:47Z

Just a quick addendum: I watched the same behaviour when an outage happens out of a notification period: when the notification period starts, UserA with interval=0 gets a notification immediatly, and the user with interval=60m gets the notification later, apparently with the same formula that dnsmichi has shown before.

I cannot imagine why someone doesn't want to be informed of an outage immediatly when a downtime ends or a notification period starts, so count my vote for a change of that behaviour.

dnsmichi · 2018-01-19T14:57:09Z

Sure, I hear you. I'm not sure how this can be implemented yet though.

Footur · 2018-07-09T08:53:16Z

I noticed in my setup the same behavior and I agree with @edpstiffel that a notification should be sent right after the downtime.

jonbulica99 · 2018-07-12T07:25:22Z

BUMP

Our intended setup relies heavily on what @edpstiffel is describing being the case. Consider the following scenario:
You monitor the software update state for ~500 hosts and Icinga notifications are sent directly to the ticket system. For this to work reliably, without spamming our ticket system every now and then, we have defined a downtime specific to the update checks, so that they only run once a week (a full day). With the current behaviour, if a host gets updates during said downtime, no notification will be sent when the downtime is over, since the check interval is 24h.
That being said, I understand people might be relying on the current behaviour for their setups, so maybe finding some middle ground (e.g. a setting to toggle this behaviour) would statisfy all of us.

winter1967 · 2018-07-27T10:18:08Z

The same issue or wish for feature request here; every night our print servers were rebooted, at this time they are in downtime. When a service ended at the downtime, then, in this case reboot and the service doesn't came up, we haven't any notification... Yes.. in the downtime it reached critical state, yes.. the state doesn't changed, when the downtime ends...
Maybe a workarround.. we will reset the service to "ok" after downtime with api from our ticket system..

widhalmt · 2018-07-27T11:38:31Z

+1

Could this be solved by adding some sort of queue where all notifications that occured during a downtime (or while outside of an notification timeperiod) are collected? After the downtime ends, the get deduplicated and checked if they still apply. If yes, then the notifications get sent immediately.

refs #5919