Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Notification behaviour after Downtime ends #5919

Closed
edpstiffel opened this issue Dec 29, 2017 · 10 comments · Fixed by #7270
Closed

Notification behaviour after Downtime ends #5919

edpstiffel opened this issue Dec 29, 2017 · 10 comments · Fixed by #7270
Assignees
Labels
area/notifications Notification events enhancement New feature or request needs-sponsoring Not low on priority but also not scheduled soon without any incentive
Milestone

Comments

@edpstiffel
Copy link

Expected Behavior

I schedule a fixed downtime for a service.
The service goes CRITICAL within the downtime.
The service is still CRITICAL when the downtime ends.
I expect to be notified right when the downtime ends.

Current Behavior

When the downtime ends, the notification for one contact fires right away, but the notification for a second contact is delayed.

Possible Solution

Experiments show that the interval setting for a notification is the key: The one contact that gets the notification right after the downtime ends has notification interval of 0 (zero).
The other contact has an interval setting of 600 seconds (10m), and he gets the notification 10 minutes after the hard state change happened (during the downtime).

Steps to Reproduce (for bugs)

Please take a look at the attached screenshot which shows the history of such a behaviour:

  • Schedule downtime of 5 minutes for a (passive) service
  • Trigger hard state change to CRITICAL after a couple of seconds in the downtime
  • After downtime has ended, pstiffel is notified right away (the attached notification object has interval of 0)
  • 5 minutes after the downtime has ended (and 10 minutes after the hard state change), jmueller is notified (the attached notification object has interval of 10m)
    2017-12-29_10h12_28

Context

IMHO, the notification should happen immediately after the downtime has ended, no matter which interval was set.
I guess, I watched the same behaviour when using timeperiods which are not 24x7, i.e. when using notification period 9to17 and a outtage happens before that time. The contact attached to notification object with interval 0 is notified right when the notification period starts, the contact attached to a notification object with an interval, is delayed until the next regular interval after the outtage.

The background: Our contact with interval 0 is a ticket system which should only receive one notification, while our staff should be re-informed every hour.

Your Environment

  • icinga2-Version r2.8.0-1
  • Clustered master setup with two nodes
  • Debian 8
@dnsmichi
Copy link
Contributor

dnsmichi commented Jan 8, 2018

Contacts/Users don't have a notification interval, that's to be defined inside the notification object.

Can you share a sample configuration in order to reproduce the issue?

Cheers,
Michael

@dnsmichi dnsmichi added area/notifications Notification events needs feedback We'll only proceed once we hear from you again labels Jan 8, 2018
@edpstiffel
Copy link
Author

edpstiffel commented Jan 12, 2018

Ok, here we go:

I applied the following changes to a vanilla icinga 2.8 installation:

conf.d/services.conf

apply Service "dummy" {
  import "generic-service"
  check_command = "dummy"
  max_check_attempts = 1
  assign where host.name == NodeName
}

conf.d/users.conf

object User "UserA" {
  import "generic-user"
  display_name = "nur eine Benachrichtigung"
  email = "root@localhost"
}

object User "UserB" {
  import "generic-user"
  display_name = "Benachrichtigung jede Stunde"
  email = "root@localhost"
}

conf.d/notifications.conf

apply Notification "einmalige-Mail" to Service {
  import "mail-service-notification"
  users = [ "UserA" ]
  interval = 0
  assign where match(service.name, "dummy")
}

apply Notification "stuendliche-Mail" to Service {
  import "mail-service-notification"
  users = [ "UserB" ]
  interval = 1h
  assign where match(service.name, "dummy")
}

conf.d/templates.conf

template User "generic-user" {
        states = [ Up, Down, OK, Warning, Critical, Unknown ]
        types = [ Problem, Acknowledgement, Recovery ]
}

Here's how to reproduce the problem:

  • disable active checks on the dummy service
  • create fixed downtime of 5min on the dummy service
  • submit CRITICAL state to dummy service

Result:

  • after the downtime has ended, UserA is notified about the CRITICAL service immediately
  • UserB is getting the notification after a serious amount of delay (in the attached screenshot, it is 53 minutes later)
  • icingaadmin gets notified via default notification object after 23 minutes

Conclusion:
IMHO, all users should be notified immediately after the downtime has ended.
In our production environment, I guess the same problem occurs when you use notification timeperiods other than 24x7 and the outtage happens outside of the timeperiod. Here, the notification object with interval=0 fires immediately when the notification period has started, and the other notification objects with interval != 0 fire later. I will reproduce that in the test environment.

2018-01-12_09h19_52

@dnsmichi
Copy link
Contributor

Ok, understood. The main request is to ignore the notification interval if a downtime has ended. Right now the calculated next notification time is

notification -> suppressed by downtime
+10m for next_notification

downtime ends after 5m

5m later, the next notification is sent for the problem

Changing this could break existing setups. I'd like to hear from others what they think. Or see a possible patch to adjust the behaviour and fully test it.

@dnsmichi dnsmichi added the enhancement New feature or request label Jan 18, 2018
@edpstiffel
Copy link
Author

Just a quick addendum: I watched the same behaviour when an outage happens out of a notification period: when the notification period starts, UserA with interval=0 gets a notification immediatly, and the user with interval=60m gets the notification later, apparently with the same formula that dnsmichi has shown before.

I cannot imagine why someone doesn't want to be informed of an outage immediatly when a downtime ends or a notification period starts, so count my vote for a change of that behaviour.

@dnsmichi
Copy link
Contributor

Sure, I hear you. I'm not sure how this can be implemented yet though.

@dnsmichi dnsmichi removed the needs feedback We'll only proceed once we hear from you again label May 9, 2018
@Footur
Copy link

Footur commented Jul 9, 2018

I noticed in my setup the same behavior and I agree with @edpstiffel that a notification should be sent right after the downtime.

@jonbulica99
Copy link

BUMP

Our intended setup relies heavily on what @edpstiffel is describing being the case. Consider the following scenario:
You monitor the software update state for ~500 hosts and Icinga notifications are sent directly to the ticket system. For this to work reliably, without spamming our ticket system every now and then, we have defined a downtime specific to the update checks, so that they only run once a week (a full day). With the current behaviour, if a host gets updates during said downtime, no notification will be sent when the downtime is over, since the check interval is 24h.
That being said, I understand people might be relying on the current behaviour for their setups, so maybe finding some middle ground (e.g. a setting to toggle this behaviour) would statisfy all of us.

@winter1967
Copy link

The same issue or wish for feature request here; every night our print servers were rebooted, at this time they are in downtime. When a service ended at the downtime, then, in this case reboot and the service doesn't came up, we haven't any notification... Yes.. in the downtime it reached critical state, yes.. the state doesn't changed, when the downtime ends...
Maybe a workarround.. we will reset the service to "ok" after downtime with api from our ticket system..

@dnsmichi dnsmichi added wishlist needs-sponsoring Not low on priority but also not scheduled soon without any incentive labels Jul 27, 2018
@widhalmt
Copy link
Member

+1

Could this be solved by adding some sort of queue where all notifications that occured during a downtime (or while outside of an notification timeperiod) are collected? After the downtime ends, the get deduplicated and checked if they still apply. If yes, then the notifications get sent immediately.

@dnsmichi dnsmichi removed the wishlist label May 9, 2019
@Al2Klimov Al2Klimov self-assigned this Jul 1, 2019
Al2Klimov added a commit that referenced this issue Jul 2, 2019
Al2Klimov added a commit that referenced this issue Jul 2, 2019
Al2Klimov added a commit that referenced this issue Jul 3, 2019
Al2Klimov added a commit that referenced this issue Jul 3, 2019
Al2Klimov added a commit that referenced this issue Jul 3, 2019
Al2Klimov added a commit that referenced this issue Jul 4, 2019
Al2Klimov added a commit that referenced this issue Jul 4, 2019
Al2Klimov added a commit that referenced this issue Jul 4, 2019
Al2Klimov added a commit that referenced this issue Jul 4, 2019
Al2Klimov added a commit that referenced this issue Jul 4, 2019
@dnsmichi
Copy link
Contributor

dnsmichi commented Jul 9, 2019

This is a sponsored feature request, thanks for granting us the time to implement it.

ref/IP/14729

Al2Klimov added a commit that referenced this issue Jul 10, 2019
Al2Klimov added a commit that referenced this issue Jul 10, 2019
dnsmichi pushed a commit that referenced this issue Jul 10, 2019
dnsmichi pushed a commit that referenced this issue Jul 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/notifications Notification events enhancement New feature or request needs-sponsoring Not low on priority but also not scheduled soon without any incentive
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants