Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use poweroff + polling instead of qvm-kill for forced shutdowns #534

Closed
wants to merge 6 commits into from
Closed
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
96 changes: 85 additions & 11 deletions launcher/sdw_updater_gui/Updater.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,19 @@
import logging
import os
import subprocess
import time
from datetime import datetime, timedelta
from enum import Enum

# Used for VM checks when run in Qubes dom0
try:
import qubesadmin

qubes = qubesadmin.Qubes() # pragma: no cover
except ImportError:
qubes = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should log something when qubes=None, and probably test these lines

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have ideas for the error handling here, I'd appreciate guidance/collaboration on the branch. The issues I see:

  • handling the error, but only if we're on Qubes (platform.linux_distribution is deprecated; platform.uname seems our best bet though really only tells us that we're on a host called dom0)
  • ensuring we don't log to disk when the file is imported as part of the test suite;
  • cleanly unit testing any added code without overcomplicating things.



DATE_FORMAT = "%Y-%m-%d %H:%M:%S"
DEFAULT_HOME = ".securedrop_launcher"
FLAG_FILE_STATUS_SD_APP = "/home/user/.securedrop_client/sdw-update-status"
Expand Down Expand Up @@ -403,8 +413,8 @@ def shutdown_and_start_vms():
on, for example.

All system AppVMs (sys-net, sys-firewall and sys-usb) need to be restarted.
We use qvm-kill for sys-firewall and sys-net, because a shutdown may fail
if they are currently in use as NetVMs by any of the user's other VMs.
We use a forced shutdown for sys-net and sys-firewall because a regular
shutdown command will return an error if they are in use as NetVMs.
"""

sdw_vms_in_order = [
Expand Down Expand Up @@ -433,17 +443,10 @@ def shutdown_and_start_vms():
sdlog.info("Safely shutting down system VM: {}".format(vm))
_safely_shutdown_vm(vm)

# TODO: Use of qvm-kill should be considered unsafe and may have unexpected
# side effects. We should aim for a more graceful shutdown strategy.
unsafe_sys_vms_in_order = ["sys-firewall", "sys-net"]
for vm in unsafe_sys_vms_in_order:
sdlog.info("Killing system VM: {}".format(vm))
try:
subprocess.check_output(["qvm-kill", vm], stderr=subprocess.PIPE)
except subprocess.CalledProcessError as e:
sdlog.error("Error while killing system VM: {}".format(vm))
sdlog.error(str(e))
sdlog.error(str(e.stderr))
sdlog.info("Forcing shutdown of system VM: {}".format(vm))
_force_shutdown_vm(vm)

all_sys_vms_in_order = safe_sys_vms_in_order + unsafe_sys_vms_in_order
sdlog.info("Starting fedora-based system VMs after updates")
Expand All @@ -465,6 +468,77 @@ def _safely_shutdown_vm(vm):
return UpdateStatus.UPDATES_FAILED


def _force_shutdown_vm(vm):
"""
Qubes does not yet provide an option to force shutdown for VMs that act as
NetVMs. Issuing a poweroff command and polling for the VM to stop running
is the recommended workaround until then.

Return value:
- True - if VM was successfully shut down or is not running
- False - if there was a problem shutting down the VM (logged as error)
"""
if vm not in qubes.domains:
sdlog.error(
"Error shutting down VM '{}'. No VM with this name found.".format(vm)
)
return False

if qubes.domains[vm].is_running() is False:
sdlog.info("VM '{}' is not running. No shutdown necessary.".format(vm))
return True

if qubes.domains[vm].is_paused():
sdlog.info("VM '{}' is paused. Unpausing before shutdown.".format(vm))
qubes.domains[vm].unpause()

try:
qubes.domains[vm].run("poweroff", user="root")
except subprocess.CalledProcessError as e:
# Exit codes 1 and 143 may occur with successful shutdown; log others
if e.returncode != 1 and e.returncode != 143:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you chooose these return codes? Should we not log in all cases, in case the error code is unexpected and non-zero?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my testing I was getting return code 1 almost all the time, and 143 some of the time. I'll confirm if those results hold, but if so, I'm not sure how useful it is to log return code 1.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing qvm-run -u root sys-net poweroff a few times, I mostly get exit code 143, and got exit code 1 a few times, without any clear conditional differences; nevertheless the command completed in all cases. poweroff itself triggers SIGTERM, which is what exit code 143 signifies; I'm guessing that its parent shell process may sometimes prematurely terminate. I do not know why the command sometimes returns exit code 1, but I never get exit code 0.

CC'ing @marmarek in case he can shed light on the exit code behavior.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think poweroff command doesn't return until the system is really off. SIGTERM is probably systemd terminating all the user processes during shutdown. And exit code 1 is qrexec connection being terminated (because of domain shutdown) without sending any exit code first.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That explanation makes sense to me, thanks for jumping in. :)

sdlog.error(
"Error shutting down VM '{}'. "
"poweroff command returned unexpected exit code {}.".format(
vm, e.returncode
)
)
return False

return _wait_for(vm, (lambda vm: qubes.domains[vm].is_running() is False))


def _wait_for(vm, condition, timeout=60, interval=0.2):
"""
Poll for a VM to enter the state using the function condition()
that must return True when the state is reached

Return value:
- True if the VM reached the expected state
- False if it did not
"""
start_time = time.time()
stop_time = start_time + timeout
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have you tested several values for interval here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, happy to tweak / test a lower value. The motivation for an interval is to avoid causing undue load, especially during a 10-20 second wait.

while time.time() < stop_time:
# Evaluate condition before time measurement to include its runtime
condition_reached = condition(vm)
elapsed = time.time() - start_time
if condition_reached:
sdlog.info(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For 5 runs, I get the following measures:

  • sys-firewall: [17.94, 27.42, 17.94, 27.45, 17.50]
  • sys-net: [3.38, 3.25, 3.22, 0.00, 0.00]

Do you have any insight into the variance for sys-firewall? Have you need a value of 0.00 for sys-net before?

Copy link
Member Author

@eloquence eloquence Apr 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the API run command completes after the is_running state has already reached False, no polling will be required, and the elapsed time will be 0.00 (which is the best case). I've seen fairly inconsistent behavior when using qvm-run; sometimes it exits immediately, sometimes it aligns nicely with the shutdown sequence. I think we're at the mercy of the scheduler here. Fortunately, all poweroff invocations I've tried have been successful.

"VM '{}' entered expected state after {:.2f} seconds of "
"polling.".format(vm, elapsed)
)

return True
time.sleep(interval)

sdlog.error(
"VM '{}' did not enter expected state in the provided timeout of "
"{} seconds.".format(vm, timeout)
)
return False


def _safely_start_vm(vm):
try:
running_vms = subprocess.check_output(
Expand Down
208 changes: 148 additions & 60 deletions launcher/tests/test_updater.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
import os
import pytest
import subprocess
import re
from importlib.machinery import SourceFileLoader
from datetime import datetime, timedelta
from tempfile import TemporaryDirectory
Expand Down Expand Up @@ -56,6 +57,17 @@
"sd-viewer": UpdateStatus.UPDATES_REQUIRED,
}

VM_SHUTDOWN_FAILURE_REGEX = (
r"Error shutting down VM '.*'. poweroff command returned unexpected exit code (.*)."
)

VM_POLLING_SUCCESS_REGEX = (
r"VM '.*' entered expected state after (.*) seconds of polling."
)
VM_POLLING_FAILURE_REGEX = (
r"VM '.*' did not enter expected state in the provided timeout of (.*) seconds."
)


def test_updater_vms_present():
assert len(updater.current_templates) == 9
Expand Down Expand Up @@ -504,6 +516,133 @@ def test_apply_updates_dom0_failure(mocked_info, mocked_error, mocked_call):
mocked_error.assert_has_calls(error_log)


@mock.patch("Updater._wait_for")
@mock.patch("Updater.qubes")
@mock.patch("Updater.sdlog.error")
@mock.patch("Updater.sdlog.info")
def test_force_shutdown_vm_not_found(
mocked_info, mocked_error, mocked_qubes, mocked_wait
):
mocked_qubes.domains = {}
assert updater._force_shutdown_vm("sys-net") is False
assert mocked_error.called
assert not mocked_wait.called


@mock.patch("Updater._wait_for")
@mock.patch("Updater.qubes")
@mock.patch("Updater.sdlog.error")
@mock.patch("Updater.sdlog.info")
def test_force_shutdown_vm_not_running(
mocked_info, mocked_error, mocked_qubes, mocked_wait
):
mocked_qubes.domains = {"sys-net": mock.MagicMock()}
mocked_qubes.domains["sys-net"].is_running = mock.MagicMock(return_value=False)
assert updater._force_shutdown_vm("sys-net") is True
assert not mocked_error.called
assert mocked_info.called
assert not mocked_wait.called


@mock.patch("Updater._wait_for", return_value=True)
@mock.patch("Updater.qubes")
@mock.patch("Updater.sdlog.error")
@mock.patch("Updater.sdlog.info")
def test_force_shutdown_unpauses_and_shuts_down(
mocked_info, mocked_error, mocked_qubes, mocked_wait
):
mocked_qubes.domains = {"sys-net": mock.MagicMock()}
mocked_qubes.domains["sys-net"].is_running = mock.MagicMock(return_value=True)
mocked_qubes.domains["sys-net"].is_paused = mock.MagicMock(return_value=True)

assert updater._force_shutdown_vm("sys-net") is True
assert mocked_qubes.domains["sys-net"].unpause.called
assert mocked_qubes.domains["sys-net"].run.called

assert not mocked_error.called
assert mocked_info.called
assert mocked_wait.called


@pytest.mark.parametrize("good_returncode", (1, 143))
@mock.patch("Updater._wait_for", return_value=True)
@mock.patch("Updater.qubes")
@mock.patch("Updater.sdlog.error")
@mock.patch("Updater.sdlog.info")
def test_force_shutdown_expected_return_code_is_handled(
mocked_info, mocked_error, mocked_qubes, mocked_wait, good_returncode
):
mocked_qubes.domains = {"sys-net": mock.MagicMock()}
mocked_qubes.domains["sys-net"].is_running = mock.MagicMock(return_value=True)
mocked_qubes.domains["sys-net"].run = mock.MagicMock(
side_effect=subprocess.CalledProcessError(returncode=good_returncode, cmd="")
)
assert updater._force_shutdown_vm("sys-net") is True
assert mocked_qubes.domains["sys-net"].run.called
assert mocked_wait.called


@mock.patch("Updater._wait_for")
@mock.patch("Updater.qubes")
@mock.patch("Updater.sdlog.error")
@mock.patch("Updater.sdlog.info")
def test_force_shutdown_unexpected_return_code_is_reported(
mocked_info, mocked_error, mocked_qubes, mocked_wait
):
mocked_qubes.domains = {"sys-net": mock.MagicMock()}
mocked_qubes.domains["sys-net"].is_running = mock.MagicMock(return_value=True)
mocked_qubes.domains["sys-net"].run = mock.MagicMock(
side_effect=subprocess.CalledProcessError(returncode=666, cmd="")
)
assert updater._force_shutdown_vm("sys-net") is False
assert mocked_qubes.domains["sys-net"].run.called
assert mocked_error.called
error_string = mocked_error.call_args[0][0]
match = re.search(VM_SHUTDOWN_FAILURE_REGEX, error_string)
assert match is not None
found_code = int(match.group(1))
assert found_code == 666
assert not mocked_wait.called


@mock.patch("Updater.sdlog.error")
@mock.patch("Updater.sdlog.info")
def test_force_shutdown_failure(mocked_info, mocked_error):
assert True


@mock.patch("Updater.sdlog.error")
@mock.patch("Updater.sdlog.info")
def test_vm_polling_success(mocked_info, mocked_error):
poll_results = mock.MagicMock(side_effect=(False, False, True))
assert updater._wait_for("sys-net", poll_results, interval=0.1, timeout=1) is True
assert mocked_info.called
info_string = mocked_info.call_args[0][0]
match = re.search(VM_POLLING_SUCCESS_REGEX, info_string)
assert match is not None
elapsed = float(match.group(1))
# With a sleep interval of 0.1, at least 0.2 seconds should pass before we
# get the expected result.
assert elapsed >= 0.20
assert not mocked_error.called


@mock.patch("Updater.sdlog.error")
@mock.patch("Updater.sdlog.info")
def test_vm_polling_failure(mocked_info, mocked_error):
poll_results = mock.MagicMock(side_effect=(False, False, False))
assert (
updater._wait_for("sys-net", poll_results, interval=0.1, timeout=0.3) is False
)
assert not mocked_info.called
assert mocked_error.called
error_string = mocked_error.call_args[0][0]
match = re.search(VM_POLLING_FAILURE_REGEX, error_string)
assert match is not None
timeout = float(match.group(1))
assert timeout == 0.30


@pytest.mark.parametrize("vm", current_templates.keys())
@mock.patch("subprocess.check_call", side_effect="0")
@mock.patch("Updater.sdlog.error")
Expand Down Expand Up @@ -668,15 +807,18 @@ def test_safely_shutdown_fails(mocked_info, mocked_error, mocked_call, vm):
@mock.patch("subprocess.check_output")
@mock.patch("Updater._safely_start_vm")
@mock.patch("Updater._safely_shutdown_vm")
@mock.patch("Updater._force_shutdown_vm")
@mock.patch("Updater.sdlog.error")
@mock.patch("Updater.sdlog.info")
def test_shutdown_and_start_vms(
mocked_info, mocked_error, mocked_shutdown, mocked_start, mocked_output
mocked_info,
mocked_error,
mocked_force_shutdown,
mocked_shutdown,
mocked_start,
mocked_output,
):
sys_vm_kill_calls = [
call(["qvm-kill", "sys-firewall"], stderr=-1),
call(["qvm-kill", "sys-net"], stderr=-1),
]
force_shutdown_calls = [call("sys-firewall"), call("sys-net")]
sys_vm_shutdown_calls = [
call("sys-usb"),
call("sys-whonix"),
Expand Down Expand Up @@ -705,7 +847,7 @@ def test_shutdown_and_start_vms(
call("sd-log"),
]
updater.shutdown_and_start_vms()
mocked_output.assert_has_calls(sys_vm_kill_calls)
mocked_force_shutdown.assert_has_calls(force_shutdown_calls)
mocked_shutdown.assert_has_calls(
template_vm_calls + app_vm_calls + sys_vm_shutdown_calls
)
Expand All @@ -714,60 +856,6 @@ def test_shutdown_and_start_vms(
assert not mocked_error.called


@mock.patch(
"subprocess.check_output",
side_effect=subprocess.CalledProcessError(1, "check_output"),
)
@mock.patch("Updater._safely_start_vm")
@mock.patch("Updater._safely_shutdown_vm")
@mock.patch("Updater.sdlog.error")
@mock.patch("Updater.sdlog.info")
def test_shutdown_and_start_vms_sysvm_fail(
mocked_info, mocked_error, mocked_shutdown, mocked_start, mocked_output
):
sys_vm_kill_calls = [
call(["qvm-kill", "sys-firewall"], stderr=-1),
call(["qvm-kill", "sys-net"], stderr=-1),
]
sys_vm_start_calls = [
call("sys-net"),
call("sys-firewall"),
call("sys-whonix"),
call("sys-usb"),
]
app_vm_calls = [
call("sd-app"),
call("sd-proxy"),
call("sd-whonix"),
call("sd-gpg"),
call("sd-log"),
]
template_vm_calls = [
call("fedora-30"),
call("sd-viewer-buster-template"),
call("sd-app-buster-template"),
call("sd-log-buster-template"),
call("sd-devices-buster-template"),
call("sd-proxy-buster-template"),
call("whonix-gw-15"),
call("securedrop-workstation-buster"),
]
error_calls = [
call("Error while killing system VM: sys-firewall"),
call("Command 'check_output' returned non-zero exit status 1."),
call("None"),
call("Error while killing system VM: sys-net"),
call("Command 'check_output' returned non-zero exit status 1."),
call("None"),
]
updater.shutdown_and_start_vms()
mocked_output.assert_has_calls(sys_vm_kill_calls)
mocked_shutdown.assert_has_calls(template_vm_calls + app_vm_calls)
app_vm_calls_reversed = list(reversed(app_vm_calls))
mocked_start.assert_has_calls(sys_vm_start_calls + app_vm_calls_reversed)
mocked_error.assert_has_calls(error_calls)


@pytest.mark.parametrize("status", UpdateStatus)
@mock.patch("subprocess.check_call")
@mock.patch("os.path.expanduser", return_value=temp_dir)
Expand Down