Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document how to run staging env in Qubes #3624

Merged
merged 22 commits into from
Jul 12, 2018
Merged

Conversation

conorsch
Copy link
Contributor

@conorsch conorsch commented Jul 9, 2018

Status

Ready for review.

Description of Changes

Fixes freedomofpress/securedrop-workstation#105.

Implements a staging environment for use in Qubes, so that devs working on the workstation code do not need to run VMs on a separate laptop in order to contact an SD server for development. The work is the result of collaboration with @joshuathayer. Submitting the PR as a "docs-only" PR because it doesn't change any files used for managing production SecureDrop instances:

$ git diff develop --name-status
A       docs/development/qubes_staging.rst
M       docs/index.rst
M       docs/servers.rst
A       molecule/qubes-staging/create.yml
A       molecule/qubes-staging/destroy.yml
A       molecule/qubes-staging/molecule.yml
A       molecule/qubes-staging/playbook.yml
A       molecule/qubes-staging/qubes-vars.yml
A       molecule/qubes-staging/ssh_config.j2

There are a few unimplemented features of the staging environment:

Config tests are not wired up

We already have the full config test suite running in CI on every (non-docs) PR. That's sufficient for preventing regressions. Happy to add config tests to the staging environment to the work presented here, but it should wait for the Vagrant -> Molecule conversion (#3208), in which can ideally axe the testinfra/test.py wrapper script.

Machines need a nudge to boot after a reboot has triggered

When Ansible requests the VM reboot, it powers off, but Qubes won't restart it. You must run qvm-start sd-app sd-mon in order to boot the VMs again, and you must run that command within the "wait_for" window, which is five minutes. Technically it's possible for us to detect that we're in Qubes and make the qvm-start calls dynamically—the Qubes Admin API makes this possible—but doing so would affect Ansible logic that runs against prod instances, so I'd prefer to follow up in a discrete PR targeting that improvement at a later date, if warranted.

Halting is not a first-class feature

Molecule assumes we want to destroy a VM after testing it. By default, there's no corresponding action in Molecule to vagrant halt <vm>. In the future, we can certainly implement one, with a bit of additional scripting, but I'd prefer to rely on developer feedback before attempting a convenient solution. Developers working in Qubes will need to run qvm-shutdown sd-app sd-mon in order to preserve the VMs for later use. The Molecule logic is only required the first time (or if server-side code changes are required); thereafter, the qvm- actions will be sufficient to get the Source and Journalist Interfaces up and running.

No automatic integration of HidServAuth tokens

After provisioning the staging environment, the sd-dev VM will have the HidServAuth tokens in the usual place: install_files/ansible-base/*-aths. These will need to be copied manually as config.json for use with development on the SecureDrop Workstation. Given that the VM provisioning and token porting tasks are each one-time, I consider that an acceptable trade-off, especially because we've reduced the number of required laptops from two to one with these changes.

Testing

Qubes machine required for testing.

  • Follow the documentation to create a "staging" VM setup in Qubes.
  • Run molecule test -s qubes-staging.
  • When the machines go down for a reboot, run qvm-start sd-app sd-mon in dom0.
  • Confirm playbook finishes without error
  • Access the Source Interface over Tor and confirm it loads.
  • Run qvm-prefs sys-whonix and confirm it fails with "Service call error: request refused".

If anything is unclear in the docs, get loud. If anything doesn't work with the staging provisioning process, please post error messaging to aid in debugging.

Worthy of special attention are the Qubes RPC grants, which enable VM management from within sd-dev, rather than from dom0. Using the Qubes Admin API makes for a much slicker developer workflow, but we should scrutinize any grants we provide, here or hereafter. To the best of my ability, I've restricted the permissions grant management rights of only the staging instances. For further reading:

Deployment

None, dev env only.

Checklist

If you made changes to the server application code:

  • Linting (make ci-lint) and tests (make -C securedrop test) pass in the development container

If you made changes to securedrop-admin:

  • Linting and tests (make -C admin test) pass in the admin development container

If you made changes to the system configuration:

If you made non-trivial code changes:

  • I have written a test plan and validated it for this PR

If you made changes to documentation:

  • Doc linting (make docs-lint) passed locally

@conorsch
Copy link
Contributor Author

conorsch commented Jul 9, 2018

On a few recent runs of this setup, I noticed a provisioning error related to a missing AppArmor profile for the tor abstractions. None of the diff here indicates that's caused by these changes, so I'm inviting review immediately, to see if the problem occurs for anyone else.

@codecov-io
Copy link

codecov-io commented Jul 9, 2018

Codecov Report

Merging #3624 into develop will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff            @@
##           develop    #3624   +/-   ##
========================================
  Coverage    85.04%   85.04%           
========================================
  Files           37       37           
  Lines         2367     2367           
  Branches       260      260           
========================================
  Hits          2013     2013           
  Misses         290      290           
  Partials        64       64

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 997db39...0d02344. Read the comment docs.

Copy link
Contributor

@emkll emkll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through these (very clear) docs and successfully built a SecureDrop staging environment in Qubes. The Qubes API policy works as intended.
Because the playbook updates all packages and given the incremental effort in using template-based VMs, I think standalone VMs are the right choice in this situation. I've encountered two errors on playbook runs during the course of my testing, both occuring when I use the molecule scenarios:

  • An issue with reboot_if_first_install (more comments inline, but does not impact the run, just fails at the end)
  • An error at the OSSEC step where the public key is copied over to the mon server as part of the Add the OSSEC GPG public key to the OSSEC manager keyring task:
Failed to set permissions on the temporary files Ansible needs to create when becoming an unprivileged user (rc:1, err: chown: changing ownership of '/tmp/ansible-tmp-15325777.69-4139702993444/': Operation not permitted

Workaround (or fix) is to install the acl package on the mon server and running the converge scenario once again.

The development VM, ``sd-dev``, will be based on Debian 9. All the other VMs
will be based on Ubuntu Trusty.

Create ``sd-dev``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we refer to the (now merged) Qubes dev env docs introduced in #3623 to avoid duplication

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refer to the (now merged) Qubes dev env docs

Absolutely, will rebase and cross-link.

- For DNS, use Qubes's DNS servers: ``10.139.1.1`` and ``10.139.1.2``.
- Hostname: ``sd-trusty-base``
- Domain name should be left blank

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While it is the default option, it may be useful to specify on which disk to install the operating system, perhaps by stating configure LVM and use Virtual disk 1 (xvda 20.0GB Xen Virtual Block device)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can do; @joshuathayer had originally explicitly documented the target volume, and the declaration got lost in one of the follow-up concision edits.

molecule test -s qubes-staging

.. note::
The reboot actions run against the VMs during provisioning will only shutdown
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There also appears to be an issue at the end of the install playbook run where it errors out at the end with the following error:

Unable to retrieve file contents from /home/user/securedrop/molecule/qubes-staging/tasks/reboot_if_first_install.yml

This happens at the very end, and the install is still completely functional.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @msheiny wrote a different reboot handler for the AWS scenario. Will take a closer look and see if we can modify the original to work everywhere, otherwise I'll implement a scenario-specific handler.

.. code:: sh

molecule create -s qubes-staging
molecule prepare -s qubes-staging
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the prepare step necessary? The molecule create does call the prepare action after creating the VMs, and warns that there is no prepare playbook found (molecule/qubes-staging/prepare.yml). Running molecule prepare explicitly returns a Skipping, instances already prepared message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, @emkll, the prepare step is indeed not necessary. I had the prepare action in there as a stopgap to do some of the config that we later moved to the base VM setup docs. Will revise!

joshuathayer and others added 20 commits July 11, 2018 12:09
Using the "delegated" driver for Molecule, meaning that create
and destroy actions must be handled explicitly via the playbooks
within the scenario (rather than leaning on a convenient API).

There are a few shortcomings with the current implementation:

  * become pass is hardcoded (due to template VMs; should be changed)
  * VMs only halt, not start, when rebooted
  * the create/destroy logic is task-based, rather than module-based

we should consider writing a "qubes_vm" Ansible module, so the VMs
can be managed via e.g. state=started, state=absent, etc. The catch,
however, is that the `qubesadmin` Python library is only installable
via apt/yum, meaning that the libs won't be available inside the
virtualenv without a sys.path hack.
The changes here are *very* permissive, and not appropriate for a production Qubes machine.
We can refine them after more testing, to get the minimum set of perms required.
Did not link out to the Qubes Admin API docs, but we should definitely do so.
Using qvm-prefs to force the static IP on the base VMs.
Few edits throughout, focusing on docs clarity.
During the first pass on documentation for running the staging
environment in Qubes, we used a Markdown file to jot notes quickly. This
commit preserves the content of the docs, while converting the format
to ReST, so it's visible alongside all the other project documentation.
Trying to slot all the related commands into one codeblock, so it's easy
for devs to follow along.
We already have copious documentation on how to install Trusty from an
ISO, so let's link out to that to be sure. Tacked on a reference in the
Trusty docs so that we can link directly to the install part.

Removes the recommendation to edit /etc/hosts on sd-dev, since we can
trust the provisioning scripts to manage access to the VMs for us.
Removes some extraneous notes from the WIP draft of the docs, and
includes some links out to Qubes official docs for more info.
Clarifying that this is a user account for a human administrator, rather
than a machine account used for running the SecureDrop application.
We've used "sdadmin" in other contexts, as well, such as the default
vars for configuring an instance via a Tails Workstation.

It doesn't really matter what this value is, but it will be hardcoded in
other aspects of the Qubes staging environment, and so any changes down
the road will require substantial developer intervention in order to
accommodate. So "sdadmin" is the most conservative approach, and
hopefully the most future-proof, as well.
These are still more permissive than I'd like, but better than the first
draft. More docs research required on the RPC policy syntax. The
policies here *do* prohibit management of other VMs on the Qubes machine
(e.g. `sd-whonix`, which is managed via the SecureDrop Workstation
 repo), which is better than what we had before.
Also added ReST-friendly codeblocks for the new apt install tasks for
the Qubes Admin API packages.
Keep the docs DRY! Linking out to the Trusty ISO download and verify
docs, via a new section reference. Also snuck in a unrelated typo fix in
the Trusty docs.
For practicality, reusing an existing SSH key in the sd-dev VM if found.
Spruced up the language a bit in places to make it more candid.
Pointed out by @emkll during review. We recently merged simple dev env
setup docs targeting Qubes, so we don't need the duplicate content in
the staging docs any longer.
Originally included by @joshuathayer, but aggressive concision edits on
my part removed it. During review, @emkll requested the additional
clarity, which sounds well warranted, so adding back.
@conorsch
Copy link
Contributor Author

Workaround (or fix) is to install the acl package on the mon server and running the converge scenario once again.

Great catch on this one. We don't need that package in prod or other VMs, so I'm loathe to install via the prod config logic simply to satisfy the Qubes VMs. After a bit of digging, that error relates to privilege escalation via Ansible, and one of the recommended resolutions is:

Use pipelining. When pipelining is enabled, Ansible doesn’t save the module to a temporary file on the client. Instead it pipes the module to the remote python interpreter’s stdin. Pipelining does not work for python modules involving file transfer (for example: copy, fetch, template), or for non-python modules.

That's precisely what we do with our Ansible config, in the ansible.cfg file, but notably the Qubes scenario wasn't including that config file, so the pipelining option was off, triggering the error. I'll add a reference to that same config file to the scenario, so that we're testing with the same connection for Ansible that we use elsewhere. That should obviate the need to install acl, as well (although that's also a recommended resolution in the docs).

We want the pipelining option in particular, but in general it's sound
practice to reuse the config so we're testing what's running in prod in
as many different contexts as possible. The dynamic inventory statement
in the ansible.cfg is not used, due to precedence of the Molecule
inventory file.
By default, Molecule sets the "create" action to call the "prepare"
action once during an instance's lifecycle [0]. We don't need the
prepare action, so let's override the create sequence to avoid
deprecation warnings displaying in the console during normal, fully
working provisioning runs of the Qubes staging environment.

Also snipped out the corresponding "prepare" action in the Qubes staging
documentation, pointed out by @emkll during review.

[0] https://molecule.readthedocs.io/en/latest/configuration.html
@conorsch conorsch force-pushed the docs-qubes-staging-environment branch from 7bc46c9 to 0d02344 Compare July 12, 2018 00:26
@conorsch
Copy link
Contributor Author

@emkll Nearly all comments addressed, please re-review. The sole comment unaddressed is:

An issue with reboot_if_first_install (more comments inline, but does not impact the run, just fails at the end)

Took at look at this, and turns out we encountered this problem in the Molecule scenario for AWS/CI, as well, in which we added a var to detect whether we're on AWS, then overrode the reboot logic with a custom implementation specific to that scenario. I'd prefer not to do the same for the Qubes staging setup.

The next best bet is to convert the included task to a role, preserving conditional logic. This will resolve the failure to find the file that you reported. I've tested this locally and it works well-ish, since the additional reboot requires developer intervention in the form of another qvm-start sd-app sd-mon, but as mentioned above, that's an acceptable shortcoming for now. However, that logic change will affect the playbooks, meaning it's inappropriate for a docs- PR, as I've submitted.

Willing to PR into this PR, if that is a reasonable workflow for you during review. Otherwise, I suggest merging as-is (with the documentation fixes already in place), and I can follow-up with a separate, tiny PR that will include a staging CI run to validate the tweak to logic doesn't break the non-Qubes config. I slightly prefer the latter option, since that would allow me to clean up the duplicated reboot logic in the molecule/aws/ scenario, as well, but that's a stretch goal, so I'm willing to forgo it to keep the wheels turning.

For clarity, see proposed patch on the task to role conversion described above in this gist: https://gist.github.com/conorsch/49830560f8fce0835ab4b477e8387f56

Copy link
Contributor

@emkll emkll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick fixes @conorsch ! I confirm that changing the ANSIBLE_CONFIG env variable fixes the issue with the key copy.
As for the reboot error, I agree, we shouldn't block merge on this, given the scope of this PR ( and the scenario actually works). I have opened a ticket to track the outstanding issue (#3629), as to better scope QA to those changes (which will likely require careful qa for both the qubes but also other scenarios).

@emkll emkll merged commit 68aa4ff into develop Jul 12, 2018
@emkll emkll deleted the docs-qubes-staging-environment branch July 12, 2018 16:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants