conditional-reboot orchestrates the execution of reboots across server fleets to ensure security and stability of your infrastructure. Primarily it is aimed at executing pending security updates but also to recover from sudden network loss due to faulty drivers and / or hardware for machines thousands of kilometres away.
🚨 Automatic reboot on pending security updates
🐞 Automatic reboot on buggy drivers / network adapters after network loss
⏲ Allows setting time-based schedules when it's safe to reboot
🚦 Allows defining fine-grained conditions to combine multiple input sources
📫 Keeps an audit log of reboots
☔️ Safety net that prevents reboot loops
Head over to the prebuilt binaries and download the correct binary for your system. Alternatively, if you have Golang SDK installed, you can install it from source by invoking:
go install github.com/soerenschneider/conditional-reboot@latest
Use the example systemd service file to run it at boot.
You can use the example configuration. Download the file and place it to /etc/conditional-reboot.json
, the default location for its config file.
{
"groups": [
{
"name": "lost connectivity",
"state_evaluator_name": "and",
"state_evaluator_args": {
"reboot": "30m"
},
"agents": [
{
"checker_name": "tcp",
"checker_args": {
"host": "8.8.8.8",
"port": "53"
},
"check_interval": "1m",
"streak_until_ok": 1,
"streak_until_reboot": 3
},
{
"checker_name": "tcp",
"checker_args": {
"host": "1.1.1.1",
"port": "53"
},
"check_interval": "1m",
"streak_until_ok": 1,
"streak_until_reboot": 3
}
]
}
]
}
This configuration defines a group of two checkers that respectively check whether Google's or Cloudflare's DNS server answer via TCP.
On 3 consecutive failures to get an answer, the checker transitions into the state of reboot
, signalling the checker wants the system to be rebooted. The state_evaluator
is configured
to actually perform a reboot if both checkers reside in state reboot
for at least 30 minutes.
Even though it's a highly unlikely scenario that both Google's and Cloudflare's public DNS servers are offline at the same time for more than 30 minutes, this scenario obviously only serves as an oversimplified example. It probably makes more sense to (also) check for servers inside your local network, such as your router.
Scenario | Description |
---|---|
Example a | Check whether pending kernel/microcode/service updates need to be applied via needrestart, but only between 02:00 and 03:00 each night. Reboot immediately if all configured local DNS servers fail to respond. |
Checkers try to provide information to conditional-reboot to determine whether a reboot is needed or not.
Multiple checkers are available
Name | Description |
---|---|
DNS | Checks if a specified DNS server returns a reply to a query |
File | Checks for the existence or absence of a given file |
ICMP | Checks for a reply of an ICMP echo request (ping) |
Kafka | Checks for incoming request on a kafka topic |
Needrestart | Checks the output of needrestart to determine whether there are pending kernel/service/microcode updates |
Prometheus | Queries Prometheus API to check whether a reboot should be performed |
TCP | Checks whether a TCP connection to a given server can be established |
Preconditions add the feature of running a checker only when a precondition is met. Currently, there are two preconditions defined
Name | Description |
---|---|
always | Invoke the checker at each tick |
time_window | Only invoke the checker during a given time window |
Agents combine a single checker with a precondition. Multiple agents form a group. Also, it's possible to define (optional) streaks.
With streaks defined, a state does not immediately change from ok
to reboot
or vice versa but needs n consecutive identical checker results to change a state.
Name | Description |
---|---|
streak_until_ok | A checker must return at least n consecutive results indicating no reboot is needed to recover and transition to state ok |
streak_until_reboot | A checker must return at least n consecutive results indicating a reboot is needed to transition to state reboot |
Groups are formed by [1, n] agents and a single state evaluator.
A state evaluator checks multiple agents within a group and emits a single based on the agents' status.
Name | Description |
---|---|
and | All agents within a group need to require a reboot |
or | A single agent within a group is enough to request a reboot |
Name | Description |
---|---|
initial | No state known yet |
ok | Checker indicates that no reboot needed |
reboot | Checker indicates that a reboot is needed |
error | Error occurred while running checker |
uncertain | When streaks are configured, a checker only transitions to 'ok' or 'reboot' state after n consecutive positive / negative check results. |
All metrics are prefixed with conditional_reboot
.
Name | Type | Labels |
---|---|---|
start_timestamp_seconds | Gauge | |
heartbeat_timestamp_seconds | Gauge | |
version | GaugeVec | version |
checker_last_check_timestamp_seconds | GaugeVec | checker |
checker_status | GaugeVec | checker, status |
agent_state | GaugeVec | state, checker |
agent_state_change_timestamp_seconds | GaugeVec | state, checker |
invocation_errors_total | Counter |
Check the full changelog