Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Notification not sent for object leaving downtime and already being in a hard not-ok state #6787

Closed
matthenning opened this issue Nov 20, 2018 · 11 comments
Labels
area/notifications Notification events enhancement New feature or request
Milestone

Comments

@matthenning
Copy link

It seems there is a problem in Icinga 2 or somewhere in my configuration that leads to some notifications not being sent if a downtime ends and the affected object, in this case a service, is already in a hard not-ok state.

Expected Behavior

All contacts receive a notification if the object is in a not-ok hard state and the downtime ends.

Current Behavior

Only one of multiple contacts receives the notification.

Steps to Reproduce (for bugs)

Notification configuration ("Operator 24x7 Service Notification" is sent, "Automic 24x7 Service Notification" is not). Outside this scenario both notifications work as expected.

Object 'redacted.domain!App - Automic Generating!Operator 24x7 Service Notifications' of type 'Notification':
  % declared in '/var/lib/icinga2/api/packages/director/3d65d259-ad04-4061-90fe-6a3c8f34ee7a/zones.d/master/notification_apply.conf', lines 97:1-97:70
  * __name = "redacted.domain!App - Automic Generating!Operator 24x7 Service Notifications"
  * command = "notification"
    % = modified in '/var/lib/icinga2/api/packages/director/3d65d259-ad04-4061-90fe-6a3c8f34ee7a/zones.d/master/notification_templates.conf', lines 5:5-5:28
  * command_endpoint = ""
  * host_name = "redacted.domain"
    % = modified in '/var/lib/icinga2/api/packages/director/3d65d259-ad04-4061-90fe-6a3c8f34ee7a/zones.d/master/notification_apply.conf', lines 97:1-97:70
  * interval = 0
    % = modified in '/var/lib/icinga2/api/packages/director/3d65d259-ad04-4061-90fe-6a3c8f34ee7a/zones.d/master/notification_templates.conf', lines 6:5-6:17
  * name = "Operator 24x7 Service Notifications"
  * package = "director"
    % = modified in '/var/lib/icinga2/api/packages/director/3d65d259-ad04-4061-90fe-6a3c8f34ee7a/zones.d/master/notification_apply.conf', lines 97:1-97:70
  * period = "24x7"
    % = modified in '/var/lib/icinga2/api/packages/director/3d65d259-ad04-4061-90fe-6a3c8f34ee7a/zones.d/master/notification_apply.conf', lines 100:5-100:19
  * service_name = "App - Automic Generating"
    % = modified in '/var/lib/icinga2/api/packages/director/3d65d259-ad04-4061-90fe-6a3c8f34ee7a/zones.d/master/notification_apply.conf', lines 97:1-97:70
  * source_location
    * first_column = 1
    * first_line = 97
    * last_column = 70
    * last_line = 97
    * path = "/var/lib/icinga2/api/packages/director/3d65d259-ad04-4061-90fe-6a3c8f34ee7a/zones.d/master/notification_apply.conf"
  * states = [ "Critical", "OK", "Warning" ]
    % = modified in '/var/lib/icinga2/api/packages/director/3d65d259-ad04-4061-90fe-6a3c8f34ee7a/zones.d/master/notification_templates.conf', lines 17:5-17:38
  * templates = [ "Operator 24x7 Service Notifications", "Service Notification", "Notification" ]
    % = modified in '/var/lib/icinga2/api/packages/director/3d65d259-ad04-4061-90fe-6a3c8f34ee7a/zones.d/master/notification_apply.conf', lines 97:1-97:70
    % = modified in '/var/lib/icinga2/api/packages/director/3d65d259-ad04-4061-90fe-6a3c8f34ee7a/zones.d/master/notification_templates.conf', lines 14:1-14:44
    % = modified in '/var/lib/icinga2/api/packages/director/3d65d259-ad04-4061-90fe-6a3c8f34ee7a/zones.d/master/notification_templates.conf', lines 1:0-1:35
  * times
    % = modified in '/var/lib/icinga2/api/packages/director/3d65d259-ad04-4061-90fe-6a3c8f34ee7a/zones.d/master/notification_templates.conf', lines 2:5-4:5
    * begin = 0
  * type = "Notification"
  * types = [ "Problem", "Recovery" ]
    % = modified in '/var/lib/icinga2/api/packages/director/3d65d259-ad04-4061-90fe-6a3c8f34ee7a/zones.d/master/notification_templates.conf', lines 18:5-18:33
  * user_groups = null
  * users = [ "Operator" ]
    % = modified in '/var/lib/icinga2/api/packages/director/3d65d259-ad04-4061-90fe-6a3c8f34ee7a/zones.d/master/notification_apply.conf', lines 102:5-102:29
  * vars
    * jira_api_host = "redacted"
      % = modified in '/var/lib/icinga2/api/packages/director/3d65d259-ad04-4061-90fe-6a3c8f34ee7a/zones.d/master/notification_templates.conf', lines 7:5-7:57
    * jira_api_password = "redacted"
      % = modified in '/var/lib/icinga2/api/packages/director/3d65d259-ad04-4061-90fe-6a3c8f34ee7a/zones.d/master/notification_templates.conf', lines 8:5-8:61
    * jira_api_user = "redacted"
      % = modified in '/var/lib/icinga2/api/packages/director/3d65d259-ad04-4061-90fe-6a3c8f34ee7a/zones.d/master/notification_templates.conf', lines 9:5-9:41
    * jira_project_key = "redacted"
      % = modified in '/var/lib/icinga2/api/packages/director/3d65d259-ad04-4061-90fe-6a3c8f34ee7a/zones.d/master/notification_templates.conf', lines 10:5-10:36
    * notification_channel = "jira,email"
      % = modified in '/var/lib/icinga2/api/packages/director/3d65d259-ad04-4061-90fe-6a3c8f34ee7a/zones.d/master/notification_apply.conf', lines 103:5-103:44
    * notification_object_type = "service"
      % = modified in '/var/lib/icinga2/api/packages/director/3d65d259-ad04-4061-90fe-6a3c8f34ee7a/zones.d/master/notification_templates.conf', lines 19:5-19:45
    * notification_sender_mailbox_address = "redacted"
      % = modified in '/var/lib/icinga2/api/packages/director/3d65d259-ad04-4061-90fe-6a3c8f34ee7a/zones.d/master/notification_templates.conf', lines 11:5-11:73
  * zone = "satellite"
    % = modified in '/var/lib/icinga2/api/packages/director/3d65d259-ad04-4061-90fe-6a3c8f34ee7a/zones.d/master/notification_apply.conf', lines 97:1-97:70
Object 'redacted.domain!App - Automic Generating!Automic 24x7 Service Notification' of type 'Notification':
  % declared in '/var/lib/icinga2/api/packages/director/3d65d259-ad04-4061-90fe-6a3c8f34ee7a/zones.d/master/notification_apply.conf', lines 293:1-293:65
  * __name = "redacted.domain!App - Automic Generating!Automic 24x7 Service Notification"
  * command = "notification"
    % = modified in '/var/lib/icinga2/api/packages/director/3d65d259-ad04-4061-90fe-6a3c8f34ee7a/zones.d/master/notification_templates.conf', lines 5:5-5:28
  * command_endpoint = ""
  * host_name = "redacted.domain"
    % = modified in '/var/lib/icinga2/api/packages/director/3d65d259-ad04-4061-90fe-6a3c8f34ee7a/zones.d/master/notification_apply.conf', lines 293:1-293:65
  * interval = 3600
    % = modified in '/var/lib/icinga2/api/packages/director/3d65d259-ad04-4061-90fe-6a3c8f34ee7a/zones.d/master/notification_templates.conf', lines 6:5-6:17
    % = modified in '/var/lib/icinga2/api/packages/director/3d65d259-ad04-4061-90fe-6a3c8f34ee7a/zones.d/master/notification_apply.conf', lines 296:5-296:17
  * name = "Automic 24x7 Service Notification"
  * package = "director"
    % = modified in '/var/lib/icinga2/api/packages/director/3d65d259-ad04-4061-90fe-6a3c8f34ee7a/zones.d/master/notification_apply.conf', lines 293:1-293:65
  * period = "24x7"
    % = modified in '/var/lib/icinga2/api/packages/director/3d65d259-ad04-4061-90fe-6a3c8f34ee7a/zones.d/master/notification_apply.conf', lines 297:5-297:19
  * service_name = "App - Automic Generating"
    % = modified in '/var/lib/icinga2/api/packages/director/3d65d259-ad04-4061-90fe-6a3c8f34ee7a/zones.d/master/notification_apply.conf', lines 293:1-293:65
  * source_location
    * first_column = 1
    * first_line = 293
    * last_column = 65
    * last_line = 293
    * path = "/var/lib/icinga2/api/packages/director/3d65d259-ad04-4061-90fe-6a3c8f34ee7a/zones.d/master/notification_apply.conf"
  * states = [ "Critical", "OK", "Warning" ]
    % = modified in '/var/lib/icinga2/api/packages/director/3d65d259-ad04-4061-90fe-6a3c8f34ee7a/zones.d/master/notification_templates.conf', lines 17:5-17:38
    % = modified in '/var/lib/icinga2/api/packages/director/3d65d259-ad04-4061-90fe-6a3c8f34ee7a/zones.d/master/notification_apply.conf', lines 299:5-299:38
  * templates = [ "Automic 24x7 Service Notification", "Service Notification", "Notification" ]
    % = modified in '/var/lib/icinga2/api/packages/director/3d65d259-ad04-4061-90fe-6a3c8f34ee7a/zones.d/master/notification_apply.conf', lines 293:1-293:65
    % = modified in '/var/lib/icinga2/api/packages/director/3d65d259-ad04-4061-90fe-6a3c8f34ee7a/zones.d/master/notification_templates.conf', lines 14:1-14:44
    % = modified in '/var/lib/icinga2/api/packages/director/3d65d259-ad04-4061-90fe-6a3c8f34ee7a/zones.d/master/notification_templates.conf', lines 1:0-1:35
  * times
    % = modified in '/var/lib/icinga2/api/packages/director/3d65d259-ad04-4061-90fe-6a3c8f34ee7a/zones.d/master/notification_templates.conf', lines 2:5-4:5
    * begin = 0
  * type = "Notification"
  * types = [ "Custom", "Problem", "Recovery" ]
    % = modified in '/var/lib/icinga2/api/packages/director/3d65d259-ad04-4061-90fe-6a3c8f34ee7a/zones.d/master/notification_templates.conf', lines 18:5-18:33
    % = modified in '/var/lib/icinga2/api/packages/director/3d65d259-ad04-4061-90fe-6a3c8f34ee7a/zones.d/master/notification_apply.conf', lines 300:5-300:41
  * user_groups = [ "Automic" ]
    % = modified in '/var/lib/icinga2/api/packages/director/3d65d259-ad04-4061-90fe-6a3c8f34ee7a/zones.d/master/notification_apply.conf', lines 301:5-301:36
  * users = null
  * vars
    * jira_api_host = "redacted"
      % = modified in '/var/lib/icinga2/api/packages/director/3d65d259-ad04-4061-90fe-6a3c8f34ee7a/zones.d/master/notification_templates.conf', lines 7:5-7:57
    * jira_api_password = "redacted"
      % = modified in '/var/lib/icinga2/api/packages/director/3d65d259-ad04-4061-90fe-6a3c8f34ee7a/zones.d/master/notification_templates.conf', lines 8:5-8:61
    * jira_api_user = "redacted"
      % = modified in '/var/lib/icinga2/api/packages/director/3d65d259-ad04-4061-90fe-6a3c8f34ee7a/zones.d/master/notification_templates.conf', lines 9:5-9:41
    * jira_project_key = "redacted"
      % = modified in '/var/lib/icinga2/api/packages/director/3d65d259-ad04-4061-90fe-6a3c8f34ee7a/zones.d/master/notification_templates.conf', lines 10:5-10:36
    * notification_channel = "sms_gateway"
      % = modified in '/var/lib/icinga2/api/packages/director/3d65d259-ad04-4061-90fe-6a3c8f34ee7a/zones.d/master/notification_apply.conf', lines 302:5-302:45
    * notification_object_type = "service"
      % = modified in '/var/lib/icinga2/api/packages/director/3d65d259-ad04-4061-90fe-6a3c8f34ee7a/zones.d/master/notification_templates.conf', lines 19:5-19:45
      % = modified in '/var/lib/icinga2/api/packages/director/3d65d259-ad04-4061-90fe-6a3c8f34ee7a/zones.d/master/notification_apply.conf', lines 303:5-303:45
    * notification_sender_mailbox_address = "redacted"
      % = modified in '/var/lib/icinga2/api/packages/director/3d65d259-ad04-4061-90fe-6a3c8f34ee7a/zones.d/master/notification_templates.conf', lines 11:5-11:73
  * zone = "satellite"
    % = modified in '/var/lib/icinga2/api/packages/director/3d65d259-ad04-4061-90fe-6a3c8f34ee7a/zones.d/master/notification_apply.conf', lines 293:1-293:65

Relevant parts of the log file:

# cat /var/log/icinga2/icinga2.log.1 | grep Generating -B 2 -A 2

--
[2018-11-20 00:00:11 +0100] information/HttpServerConnection: HTTP client disconnected (from [10.76.160.234]:54738)
[2018-11-20 00:00:17 +0100] information/RemoteCheckQueue: items: 0, rate: 0/s (12/min 60/5min 180/15min);
[2018-11-20 00:00:17 +0100] information/ConfigObjectUtility: Created and activated object 'redacted.domain!App - Automic Generating!ea1b5399-0dab-4a15-9dc0-de787f7fdf10' of type 'Downtime'.
[2018-11-20 00:00:17 +0100] warning/ApiListener: Ignoring config update from 'master2' for object 'redacted.domain!App - Automic Generating!ea1b5399-0dab-4a15-9dc0-de787f7fdf10' of type 'Downtime'. 'api' does not accept config.
[2018-11-20 00:00:17 +0100] information/ConfigObjectUtility: Created and activated object 'kavpas286-shd1.united.domain!App - Icinga 2 Cluster Zone!ea3ee6da-1cd7-4007-8089-43bd59d83346' of type 'Downtime'.
[2018-11-20 00:00:17 +0100] warning/ApiListener: Ignoring config update from 'master2' for object 'kavpas286-shd1.united.domain!App - Icinga 2 Cluster Zone!ea3ee6da-1cd7-4007-8089-43bd59d83346' of type 'Downtime'. 'api' does not accept config.
--
[2018-11-20 01:14:51 +0100] information/HttpServerConnection: HTTP client disconnected (from [10.76.160.234]:55912)
[2018-11-20 01:14:57 +0100] information/RemoteCheckQueue: items: 0, rate: 0/s (12/min 60/5min 180/15min);
[2018-11-20 01:15:01 +0100] information/Notification: Sending reminder 'Problem' notification 'redacted.domain!App - Automic Generating!Operator 24x7 Service Notifications' for user 'Operator'
[2018-11-20 01:15:01 +0100] information/Notification: Completed sending 'Problem' notification 'redacted.domain!App - Automic Generating!Operator 24x7 Service Notifications' for checkable 'redacted.domain!App - Automic Generating' and user 'Operator'.
[2018-11-20 01:15:01 +0100] information/ApiListener: New client connection from [10.76.160.234]:55930 (no client certificate)
[2018-11-20 01:15:01 +0100] information/ApiListener: New client connection from [10.76.160.234]:55932 (no client certificate)
--
[2018-11-20 01:15:11 +0100] information/HttpServerConnection: HTTP client disconnected (from [10.76.160.234]:56014)
[2018-11-20 01:15:17 +0100] information/RemoteCheckQueue: items: 0, rate: 0/s (12/min 60/5min 180/15min);
[2018-11-20 01:15:17 +0100] information/Checkable: Checking for configured notifications for object 'redacted.domain!App - Automic Generating'
[2018-11-20 01:15:17 +0100] warning/ApiListener: Ignoring config update. 'api' does not accept config.
[2018-11-20 01:15:21 +0100] information/ApiListener: New client connection from [10.76.160.234]:56072 (no client certificate)
--
[2018-11-20 01:49:31 +0100] information/HttpServerConnection: HTTP client disconnected (from [10.76.160.234]:41170)
[2018-11-20 01:49:31 +0100] information/HttpServerConnection: HTTP client disconnected (from [10.76.160.234]:41172)
[2018-11-20 01:49:34 +0100] information/Checkable: Checking for configured notifications for object 'redacted.domain!App - Automic Generating'
[2018-11-20 01:49:34 +0100] information/Notification: Sending 'Recovery' notification 'redacted.domain!App - Automic Generating!Operator 24x7 Service Notifications' for user 'Operator'
[2018-11-20 01:49:34 +0100] information/Notification: Completed sending 'Recovery' notification 'redacted.domain!App - Automic Generating!Operator 24x7 Service Notifications' for checkable 'redacted.domain!App - Automic Generating' and user 'Operator'.

Your Environment

  • Version used (icinga2 --version): r2.10.1-1
  • Operating System and version: Debian 9.5
  • Enabled features (icinga2 feature list): api checker ido-mysql influxdb mainlog notification
  • Config validation (icinga2 daemon -C):
[2018-11-20 09:52:19 +0100] information/cli: Icinga application loader (version: r2.10.1-1)
[2018-11-20 09:52:19 +0100] information/cli: Loading configuration file(s).
[2018-11-20 09:52:19 +0100] information/ConfigItem: Committing config item(s).
[2018-11-20 09:52:19 +0100] information/ApiListener: My API identity: master1
[2018-11-20 09:52:19 +0100] information/ConfigItem: Instantiated 23 ScheduledDowntimes.
[2018-11-20 09:52:19 +0100] information/ConfigItem: Instantiated 1302 Services.
[2018-11-20 09:52:19 +0100] information/ConfigItem: Instantiated 1 InfluxdbWriter.
[2018-11-20 09:52:19 +0100] information/ConfigItem: Instantiated 1 IcingaApplication.
[2018-11-20 09:52:19 +0100] information/ConfigItem: Instantiated 70 Hosts.
[2018-11-20 09:52:19 +0100] information/ConfigItem: Instantiated 1 FileLogger.
[2018-11-20 09:52:19 +0100] information/ConfigItem: Instantiated 3054 Dependencies.
[2018-11-20 09:52:19 +0100] information/ConfigItem: Instantiated 2 NotificationCommands.
[2018-11-20 09:52:19 +0100] information/ConfigItem: Instantiated 1865 Notifications.
[2018-11-20 09:52:19 +0100] information/ConfigItem: Instantiated 1 NotificationComponent.
[2018-11-20 09:52:19 +0100] information/ConfigItem: Instantiated 18 HostGroups.
[2018-11-20 09:52:19 +0100] information/ConfigItem: Instantiated 1 ApiListener.
[2018-11-20 09:52:19 +0100] information/ConfigItem: Instantiated 32 Downtimes.
[2018-11-20 09:52:19 +0100] information/ConfigItem: Instantiated 35 Comments.
[2018-11-20 09:52:19 +0100] information/ConfigItem: Instantiated 1 CheckerComponent.
[2018-11-20 09:52:19 +0100] information/ConfigItem: Instantiated 58 Zones.
[2018-11-20 09:52:19 +0100] information/ConfigItem: Instantiated 57 Endpoints.
[2018-11-20 09:52:19 +0100] information/ConfigItem: Instantiated 3 ApiUsers.
[2018-11-20 09:52:19 +0100] information/ConfigItem: Instantiated 15 Users.
[2018-11-20 09:52:19 +0100] information/ConfigItem: Instantiated 1 IdoMysqlConnection.
[2018-11-20 09:52:19 +0100] information/ConfigItem: Instantiated 252 CheckCommands.
[2018-11-20 09:52:19 +0100] information/ConfigItem: Instantiated 6 UserGroups.
[2018-11-20 09:52:19 +0100] information/ConfigItem: Instantiated 2 TimePeriods.
[2018-11-20 09:52:19 +0100] information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars'
[2018-11-20 09:52:19 +0100] information/cli: Finished validating the configuration file(s).
  • If you run multiple Icinga 2 instances, the zones.conf file (or icinga2 object list --type Endpoint and icinga2 object list --type Zone) from all affected nodes.
object Endpoint "master1" {
  host = "master1"
}

object Endpoint "master2" {
  host = "master2"
}

object Zone "master" {
  endpoints = [ "master1", "master2" ]
}



object Zone "global-templates" {
  global = true
}

object Zone "director-global" {
  global = true
}
@dnsmichi
Copy link
Contributor

Where exactly is the downtime event and then the missing notification? I don't know where to look at in the provided logs. Especially interesting would be a test case with debug log enabled to see more details.

@dnsmichi dnsmichi added area/notifications Notification events needs feedback We'll only proceed once we hear from you again labels Nov 20, 2018
@matthenning
Copy link
Author

I'll extract the requested lines from the log above:

[2018-11-20 00:00:17 +0100] information/ConfigObjectUtility: Created and activated object 'redacted.domain!App - Automic Generating!ea1b5399-0dab-4a15-9dc0-de787f7fdf10' of type 'Downtime'.
[2018-11-20 01:15:01 +0100] information/Notification: Completed sending 'Problem' notification 'redacted.domain!App - Automic Generating!Operator 24x7 Service Notifications' for checkable 'redacted.domain!App - Automic Generating' and user 'Operator'.
[2018-11-20 01:49:34 +0100] information/Notification: Sending 'Recovery' notification 'redacted.domain!App - Automic Generating!Operator 24x7 Service Notifications' for user 'Operator'

The notification that was sent results from redacted.domain!App - Automic Generating!Operator 24x7 Service Notifications. The notification which should also be there results from redacted.domain!App - Automic Generating!Automic 24x7 Service Notification.

In the following screenshots you can see the notifications on Monday where no downtime was involved. Two different notifications are sent. Operator (blue) and Automic (red).
On Tuesday, where the downtime was involved, you only see the notification to Operator.

grafik

I'll see if I can set up a test case on the dev cluster and supplement this with debug logs.

@matthenning
Copy link
Author

matthenning commented Nov 21, 2018

Getting debug logs from prod turns out to be tricky but I set up a fresh environment and tried to reproduce the behaviour.

Surprisingly the notifications behaved differently.
If the service enters the critical state during a downtime no notifications are sent at all after the downtime ends.

Here's the Icingaweb 2 history:

grafik

The configuration:

object Host "dummy01" {
    display_name = "Dummy 01"
    address = "127.0.0.1"
    check_command = "hostalive"
}


object Service "Dummy Service 01" {
    host_name = "dummy01"
    check_command = "dummy"
    max_check_attempts = "2"
    check_interval = 2m
    retry_interval = 2m
    check_timeout = 30s
}


apply Notification "E-Mail Notification" to Service {
    command = "dummynotification"
    assign where host.name == "dummy01"
    states = [ Critical, OK, Unknown, Warning ]
    types = [ Problem, Recovery ]
    users = [ "Dummy E-Mail User" ]
}


apply Notification "SMS Notification" to Service {
    command = "dummynotification"
    assign where host.name == "dummy01"
    states = [ Critical, OK, Unknown, Warning ]
    types = [ Problem, Recovery ]
    users = [ "Dummy SMS User" ]
}


object User "Dummy E-Mail User" {
    enable_notifications = true
    states = [ Critical, OK, Warning ]
    types = [ Problem, Recovery ]
}


object User "Dummy SMS User" {
    enable_notifications = true
    states = [ Critical, OK, Warning ]
    types = [ Problem, Recovery ]
}
icinga2 daemon -C
[2018-11-21 13:18:32 +0100] information/cli: Icinga application loader (version: r2.10.2-1)
[2018-11-21 13:18:32 +0100] information/cli: Loading configuration file(s).
[2018-11-21 13:18:32 +0100] information/ConfigItem: Committing config item(s).
[2018-11-21 13:18:32 +0100] information/ApiListener: My API identity: redacted
[2018-11-21 13:18:33 +0100] information/ConfigItem: Instantiated 1 Service.
[2018-11-21 13:18:33 +0100] information/ConfigItem: Instantiated 1 InfluxdbWriter.
[2018-11-21 13:18:33 +0100] information/ConfigItem: Instantiated 1 IcingaApplication.
[2018-11-21 13:18:33 +0100] information/ConfigItem: Instantiated 1 Host.
[2018-11-21 13:18:33 +0100] information/ConfigItem: Instantiated 2 FileLoggers.
[2018-11-21 13:18:33 +0100] information/ConfigItem: Instantiated 1 NotificationCommand.
[2018-11-21 13:18:33 +0100] information/ConfigItem: Instantiated 2 Notifications.
[2018-11-21 13:18:33 +0100] information/ConfigItem: Instantiated 1 NotificationComponent.
[2018-11-21 13:18:33 +0100] information/ConfigItem: Instantiated 1 ApiListener.
[2018-11-21 13:18:33 +0100] information/ConfigItem: Instantiated 1 CheckerComponent.
[2018-11-21 13:18:33 +0100] information/ConfigItem: Instantiated 3 Zones.
[2018-11-21 13:18:33 +0100] information/ConfigItem: Instantiated 1 Endpoint.
[2018-11-21 13:18:33 +0100] information/ConfigItem: Instantiated 2 ApiUsers.
[2018-11-21 13:18:33 +0100] information/ConfigItem: Instantiated 2 Users.
[2018-11-21 13:18:33 +0100] information/ConfigItem: Instantiated 1 IdoMysqlConnection.
[2018-11-21 13:18:33 +0100] information/ConfigItem: Instantiated 215 CheckCommands.
[2018-11-21 13:18:33 +0100] information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars'
[2018-11-21 13:18:33 +0100] information/cli: Finished validating the configuration file(s).

The complete debug log after I removed all IdoMysqlConnection, InfluxdbWriter and ApiListener events:
https://gist.github.com/matthenning/06f1e78e00ceada55152b010de442212

Edit: It's weird that two events with "This notification was not sent out to any contact" were fired as if Icinga thought there were problem notifications and now needs to send the recovery but can't find a matching contact.

@dnsmichi
Copy link
Contributor

Thanks, that works better with a shorter debug log. The flow is like this ...

Here's the downtime being added, when the service is in an OK state.

[2018-11-21 13:02:53 +0100] information/HttpServerConnection: Request: POST /v1/actions/schedule-downtime (from [10.88.30.63]:48282), user: icingaweb2)

[2018-11-21 13:02:53 +0100] debug/HttpUtility: Request body: '{"author":"admin","comment":"My Downtime","start_time":1542801810,"end_time":1542801870,"duration":null,"fixed":true,"trigger_name":null,"service":"dummy01!Dummy Service 01"}'

[2018-11-21 13:02:53 +0100] notice/Downtime: Added downtime 'dummy01!Dummy Service 01!a0b2b6fc-affb-4321-9e6b-4abc47863c10' between '2018-11-21 13:03:30' and '2018-11-21 13:04:30'.

Then an external check result changes state to SOFT critical.

[2018-11-21 13:03:42 +0100] information/HttpServerConnection: Request: POST /v1/actions/process-check-result (from [10.88.30.63]:48284), user: icingaweb2)

[2018-11-21 13:03:42 +0100] debug/HttpUtility: Request body: '{"exit_status":2,"plugin_output":"Inside Downtime","performance_data":null,"service":"dummy01!Dummy Service 01"}'

[2018-11-21 13:03:42 +0100] notice/Checkable: State Change: Checkable 'dummy01!Dummy Service 01' soft state change from OK to CRITICAL detected.

Then there's another one which moves this to HARD critical.

[2018-11-21 13:03:47 +0100] information/HttpServerConnection: Request: POST /v1/actions/process-check-result (from [10.88.30.63]:48286), user: icingaweb2)

[2018-11-21 13:03:47 +0100] debug/HttpUtility: Request body: '{"exit_status":2,"plugin_output":"Inside Downtime","performance_data":null,"service":"dummy01!Dummy Service 01"}'

[2018-11-21 13:03:47 +0100] notice/Checkable: State Change: Checkable 'dummy01!Dummy Service 01' hard state change from CRITICAL to CRITICAL detected.

At this timestamp 2018-11-21 13:03:47 the first notification would be triggered, but it isn't since the object is in an active downtime.

Note: The notification interval is set to 0 or 3600.

At some point, the downtime is removed.

[2018-11-21 13:04:53 +0100] notice/Downtime: Removed downtime 'dummy01!Dummy Service 01!a0b2b6fc-affb-4321-9e6b-4abc47863c10' from object 'dummy01!Dummy Service 01'.

Then a recovery is forced via HARD OK change. This happens at a later point, but since no-one was notified before, those recoveries won't hit any users.

[2018-11-21 13:05:38 +0100] information/HttpServerConnection: Request: POST /v1/actions/process-check-result (from [10.88.30.63]:48292), user: icingaweb2)

[2018-11-21 13:07:38 +0100] notice/Checkable: State Change: Checkable 'dummy01!Dummy Service 01' hard state change from CRITICAL to OK detected.

Final conclusion:

  • If interval is 0 (just one-shot notifications) and a downtime is set during the event time, this notification won't be sent
  • If interval is 1h, the next notification happens 3600s after the first event (which was cut away by the downtime). Typically you should then see that next_notification is set to 2018-11-21 13:03:47 + 3600 via REST API query against the notification object.

From the current implementation and design that downtimes cast a "don't let notifications being sent" window for a specified period of time, this works as expected. I do understand the requirement that next_notification should be delayed up until the downtime ends, still, this is a feature request then.

@matthenning
Copy link
Author

What you've concluded makes sense to me and explains why the Automic notification is missing as it has an interval of 3600. If the service would stay critical for an additional 45 minutes to hit the next 3600 interval it would be sent.

It also explains why in the test setup no notifications are sent at all.

The Operator notification in my original post on the other hand is sent after the downtime ends but has an interval of 0. That shouldn't happen right?
It happened multiple nights in a row so I don't think this is an isolated incident.
I'll try to grab debug logs next night and strip them down as much as possible.

@matthenning
Copy link
Author

I wasn't able to fetch those debug logs yet but I'm working on it.

In the meantime I found another situation which looks like it might be related.
A few days ago a service went critical. Then a downtime was started and the service recovered while the downtime was active.
Today the service went into unknown (there are no notifications configured for unknown) and after recovering recovery notifications were sent.
Those shouldn't have been sent as no problem notifications were sent for unknown.
What I guess happened is that those recovery notifications are related to the critical notifications which were sent days before which never got to send their recovery notifications as the service recovered during a downtime.

grafik

@ekeih
Copy link
Contributor

ekeih commented Dec 5, 2018

From the current implementation and design that downtimes cast a "don't let notifications being sent" window for a specified period of time, this works as expected. I do understand the requirement that next_notification should be delayed up until the downtime ends, still, this is a feature request then.

This gets even worse when you use times.begin to implement escalations.

We have several levels of escalations like the following 1st and 2nd level:

apply Notification "escalation-1st-level -> Service" to Service {
  import "notification-template"

  users = [ "icinga-oncall-1st-level", ]
  command = "icinga2-notification"
  interval = 15m
  assign where host.name
}

apply Notification "escalation-2nd-level -> Service" to Service {
  import "notification-template"

  users = [ "icinga-oncall-2nd-level", ]
  times = {
    begin = 30m
  }
  command = "icinga2-notification"
  interval = 15m
  assign where host.name
}

Depending on the timing of the downtime we end up with a notification to our 2nd level before our 1st level.

screen shot 2018-12-05 at 12 16 51

After reading the explanation of @dnsmichi I understand why this happening but I think it really should not happen this way.

@Footur
Copy link

Footur commented Jan 8, 2019

@dnsmichi Do you still need feedback?

@dnsmichi dnsmichi added enhancement New feature or request and removed needs feedback We'll only proceed once we hear from you again labels Jun 19, 2019
@Al2Klimov
Copy link
Member

This one is a dup of #5919, isn't it?

@dnsmichi
Copy link
Contributor

dnsmichi commented Jul 8, 2019

@Al2Klimov Sort of, @lippserd & myself decided to leave this open as it provides quite a few more details.

@dnsmichi dnsmichi added this to the 2.11.0 milestone Jul 9, 2019
@dnsmichi
Copy link
Contributor

Implemented as part of #7270 and now merged to git master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/notifications Notification events enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants