-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use poweroff + polling instead of qvm-kill for forced shutdowns #534
Changes from 5 commits
b79250f
47c6a3c
3de355f
111ccb0
dc372ce
b0154ba
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -11,9 +11,19 @@ | |
import logging | ||
import os | ||
import subprocess | ||
import time | ||
from datetime import datetime, timedelta | ||
from enum import Enum | ||
|
||
# Used for VM checks when run in Qubes dom0 | ||
try: | ||
import qubesadmin | ||
|
||
qubes = qubesadmin.Qubes() # pragma: no cover | ||
except ImportError: | ||
qubes = None | ||
|
||
|
||
DATE_FORMAT = "%Y-%m-%d %H:%M:%S" | ||
DEFAULT_HOME = ".securedrop_launcher" | ||
FLAG_FILE_STATUS_SD_APP = "/home/user/.securedrop_client/sdw-update-status" | ||
|
@@ -403,8 +413,8 @@ def shutdown_and_start_vms(): | |
on, for example. | ||
|
||
All system AppVMs (sys-net, sys-firewall and sys-usb) need to be restarted. | ||
We use qvm-kill for sys-firewall and sys-net, because a shutdown may fail | ||
if they are currently in use as NetVMs by any of the user's other VMs. | ||
We use a forced shutdown for sys-net and sys-firewall because a regular | ||
shutdown command will return an error if they are in use as NetVMs. | ||
""" | ||
|
||
sdw_vms_in_order = [ | ||
|
@@ -433,17 +443,10 @@ def shutdown_and_start_vms(): | |
sdlog.info("Safely shutting down system VM: {}".format(vm)) | ||
_safely_shutdown_vm(vm) | ||
|
||
# TODO: Use of qvm-kill should be considered unsafe and may have unexpected | ||
# side effects. We should aim for a more graceful shutdown strategy. | ||
unsafe_sys_vms_in_order = ["sys-firewall", "sys-net"] | ||
for vm in unsafe_sys_vms_in_order: | ||
sdlog.info("Killing system VM: {}".format(vm)) | ||
try: | ||
subprocess.check_output(["qvm-kill", vm], stderr=subprocess.PIPE) | ||
except subprocess.CalledProcessError as e: | ||
sdlog.error("Error while killing system VM: {}".format(vm)) | ||
sdlog.error(str(e)) | ||
sdlog.error(str(e.stderr)) | ||
sdlog.info("Forcing shutdown of system VM: {}".format(vm)) | ||
_force_shutdown_vm(vm) | ||
|
||
all_sys_vms_in_order = safe_sys_vms_in_order + unsafe_sys_vms_in_order | ||
sdlog.info("Starting fedora-based system VMs after updates") | ||
|
@@ -465,6 +468,77 @@ def _safely_shutdown_vm(vm): | |
return UpdateStatus.UPDATES_FAILED | ||
|
||
|
||
def _force_shutdown_vm(vm): | ||
""" | ||
Qubes does not yet provide an option to force shutdown for VMs that act as | ||
NetVMs. Issuing a poweroff command and polling for the VM to stop running | ||
is the recommended workaround until then. | ||
|
||
Return value: | ||
- True - if VM was successfully shut down or is not running | ||
- False - if there was a problem shutting down the VM (logged as error) | ||
""" | ||
if vm not in qubes.domains: | ||
sdlog.error( | ||
"Error shutting down VM '{}'. No VM with this name found.".format(vm) | ||
) | ||
return False | ||
|
||
if qubes.domains[vm].is_running() is False: | ||
sdlog.info("VM '{}' is not running. No shutdown necessary.".format(vm)) | ||
return True | ||
|
||
if qubes.domains[vm].is_paused(): | ||
sdlog.info("VM '{}' is paused. Unpausing before shutdown.".format(vm)) | ||
qubes.domains[vm].unpause() | ||
|
||
try: | ||
qubes.domains[vm].run("poweroff", user="root") | ||
except subprocess.CalledProcessError as e: | ||
# Exit codes 1 and 143 may occur with successful shutdown; log others | ||
if e.returncode != 1 and e.returncode != 143: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How did you chooose these return codes? Should we not log in all cases, in case the error code is unexpected and non-zero? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In my testing I was getting return code 1 almost all the time, and 143 some of the time. I'll confirm if those results hold, but if so, I'm not sure how useful it is to log return code 1. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Testing CC'ing @marmarek in case he can shed light on the exit code behavior. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think poweroff command doesn't return until the system is really off. SIGTERM is probably systemd terminating all the user processes during shutdown. And exit code 1 is qrexec connection being terminated (because of domain shutdown) without sending any exit code first. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That explanation makes sense to me, thanks for jumping in. :) |
||
sdlog.error( | ||
"Error shutting down VM '{}'. " | ||
"poweroff command returned unexpected exit code {}.".format( | ||
vm, e.returncode | ||
) | ||
) | ||
return False | ||
|
||
return _wait_for(vm, (lambda vm: qubes.domains[vm].is_running() is False)) | ||
|
||
|
||
def _wait_for(vm, condition, timeout=60, interval=0.2): | ||
""" | ||
Poll for a VM to enter the state using the function condition() | ||
that must return True when the state is reached | ||
|
||
Return value: | ||
- True if the VM reached the expected state | ||
- False if it did not | ||
""" | ||
start_time = time.time() | ||
stop_time = start_time + timeout | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. have you tested several values for interval here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No, happy to tweak / test a lower value. The motivation for an interval is to avoid causing undue load, especially during a 10-20 second wait. |
||
while time.time() < stop_time: | ||
# Evaluate condition before time measurement to include its runtime | ||
condition_reached = condition(vm) | ||
elapsed = time.time() - start_time | ||
if condition_reached: | ||
sdlog.info( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For 5 runs, I get the following measures:
Do you have any insight into the variance for sys-firewall? Have you need a value of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If the API |
||
"VM '{}' entered expected state after {:.2f} seconds of " | ||
"polling.".format(vm, elapsed) | ||
) | ||
|
||
return True | ||
time.sleep(interval) | ||
|
||
sdlog.error( | ||
"VM '{}' did not enter expected state in the provided timeout of " | ||
"{} seconds.".format(vm, timeout) | ||
) | ||
return False | ||
|
||
|
||
def _safely_start_vm(vm): | ||
try: | ||
running_vms = subprocess.check_output( | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should log something when
qubes=None
, and probably test these linesThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you have ideas for the error handling here, I'd appreciate guidance/collaboration on the branch. The issues I see:
platform.linux_distribution
is deprecated;platform.uname
seems our best bet though really only tells us that we're on a host calleddom0
)