Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Notifications lost during restart #7086

Closed
Crunsher opened this issue Apr 5, 2019 · 6 comments · Fixed by #7297
Closed

Notifications lost during restart #7086

Crunsher opened this issue Apr 5, 2019 · 6 comments · Fixed by #7297
Assignees
Labels
area/notifications Notification events bug Something isn't working ref/NC
Milestone

Comments

@Crunsher
Copy link
Contributor

Crunsher commented Apr 5, 2019

Steps to reproduce:

  1. Have an Icinga with a lot of config
  2. Restart Icinga
  3. During the Restart have a Checkable recover
  4. Recovery Notification will not be sent

Why?
A Notification will not be sent if it is paused (HA, other instance is responsible). Due to the large number of Objects the HA state has not been computed yet for our Notification in question during startup and therefore returns the default: Notification is paused. So even in a single instance setup, Icinga thinks someone else in the cluster is taking care of it and the Notification is just discarded.

Possible solutions:
A lot of the ideas I collected have a high chance of breaking the cluster. Disregarding those I came to the conclusion that reworking the NotficationComponent to deal with this case is the most stable way to go about this.
The NotficationComponent could keep a workqueue multi index for all the Notifications to be sent. Execution can then easily be ties to the HA state by either waiting for the first Object Authority run. This would also require changes to the APiListener/ObjectAuthority and while we are at it, it could make sense to split UpdateObjectAuthority for types.

refs #5521
ref/NC/601223

@Crunsher Crunsher added bug Something isn't working area/notifications Notification events labels Apr 5, 2019
@Crunsher Crunsher self-assigned this Apr 5, 2019
@dnsmichi
Copy link
Contributor

dnsmichi commented Apr 8, 2019

Also, please add the logs and configs required to reproduce the problem.

@Al2Klimov
Copy link
Member

@Crunsher Please also write down the ideas you've collected and why they'd break the cluster.

@Al2Klimov
Copy link
Member

@dnsmichi What's your opinion about behaving as if there's a split-brain on startup until the first connection (object authority)? We could send two notifications sometimes, but IMO it's better than zero.

@dnsmichi
Copy link
Contributor

dnsmichi commented Jul 4, 2019

Imho notifications should be suppressed up until everything is running again. This involves two things:

  • Not processing check results/cluster messages at this point, eventually queuing them for later
  • Delaying reminder notification events whenever the application is not fully started yet

@Al2Klimov Al2Klimov added the needs feedback We'll only proceed once we hear from you again label Jul 4, 2019
@Al2Klimov Al2Klimov removed their assignment Jul 4, 2019
@dnsmichi dnsmichi assigned Al2Klimov and unassigned dnsmichi and Crunsher Jul 8, 2019
@Al2Klimov Al2Klimov removed the needs feedback We'll only proceed once we hear from you again label Jul 8, 2019
@Al2Klimov
Copy link
Member

How to reproduce

  • fresh single-node Icinga 2 w/ API
  • nothing in /etc/icinga2/conf.d ex. config.conf (see below)
  1. start Icinga
  2. 3x curl -LkisSu root:123456 -H 'Accept: application/json' -X POST https://10.37.129.59:5665/v1/actions/process-check-result -d '{"type": "Host", "filter": "host.name == \"example.com\"", "exit_status": 1, "plugin_output": " "}'
  3. stop Icinga
  4. comment out the second enable_active_checks = false
  5. start Icinga

Icinga will send (and log) a problem notification, but not a recovery one – ex. if (1) { is changed to if (0) {.

/etc/icinga2/conf.d/config.conf

object IcingaApplication "app" { }

object ApiUser "root" {
	password = "123456"
	permissions = [ "*" ]
}

object NotificationCommand "noop" {
	command = [ "/bin/true" ]
}

if (1) {
	for (var i in range(100000)) {
		object Host i {
			enable_active_checks = false
			check_command = "dummy"
		}
	}
}

object Host "example.com" {
	enable_active_checks = false
	check_command = "dummy"
	check_interval = 1s
	retry_interval = 1s

	vars.dummy_state = 0
	vars.dummy_text = " "
}

object User "icingaadmin" {
}

object Notification "noop" {
	host_name = "example.com"
	users = [ "icingaadmin" ]
	command = "noop"
	//interval = 30m
	types = [ Problem, Recovery, Custom, DowntimeStart, DowntimeEnd ]
	states = [ Up, Down ]
}

@Al2Klimov
Copy link
Member

The ideas @Crunsher has collected seem to be just the one in @dnsmichi's last comment here and #7236.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/notifications Notification events bug Something isn't working ref/NC
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants