Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Failed to return clean data" in sd-log stage #519

Closed
eloquence opened this issue Mar 31, 2020 · 11 comments
Closed

"Failed to return clean data" in sd-log stage #519

eloquence opened this issue Mar 31, 2020 · 11 comments
Labels

Comments

@eloquence
Copy link
Member

Seen previously during QA in the securedrop-admin --apply stage, and now also once during a prod install, the install sometimes fails during the sd-log stage with a "Failed to return clean data" error:

sd-log:
      ----------
      _error:
          Failed to return clean data
      retcode:
          None
      stderr:
      stdout:
          deploy
sd-gpg:
@eloquence eloquence added the bug label Mar 31, 2020
@conorsch
Copy link
Contributor

Currently there are no open upstream issues that match https://github.com/QubesOS/qubes-issues/issues?q=is%3Aissue+%22failed+to+return+clean+data%22 Given how frequently I see this message myself, we're obligated to report upstream, so let's aim to collect more data here. I'll make sure to post the next time I see the error; let's also look for opportunities to make our logging more informative (#521).

@eloquence
Copy link
Member Author

Since we have encountered this during a production install, @conorsch has committed to at least an initial investigation, time-boxed to 8 hours, during the 4/2-4/15 sprint.

@eloquence
Copy link
Member Author

eloquence commented Apr 8, 2020

Observed this today during an update of the sd-app-buster-template VM (running current prod release, with the changes in https://github.com/freedomofpress/securedrop-workstation/tree/478-more-automatic-updates applied locally - which don't alter the update invocation). Given the frequency of updates, that's further increasing the potential severity of this issue. From /var/log/qubes/mgmt-sd-app-buster-template.log:

2020-04-08 12:03:48,928 calling 'state.sls update.qubes-vm'...
2020-04-08 12:04:35,283 output: sd-app-buster-template:
2020-04-08 12:04:35,284 output:     ----------
2020-04-08 12:04:35,284 output:     _error:
2020-04-08 12:04:35,284 output:         Failed to return clean data
2020-04-08 12:04:35,284 output:     retcode:
2020-04-08 12:04:35,284 output:         None
2020-04-08 12:04:35,284 output:     stderr:
2020-04-08 12:04:35,284 output:     stdout:
2020-04-08 12:04:35,284 output:         deploy
2020-04-08 12:04:35,284 exit code: 20

@eloquence
Copy link
Member Author

@conorsch has agreed to continue to spend up to ~4 hours on further investigating this issue during the 4/22-5/6 sprint.

@kushaldas
Copy link
Contributor

I did a timeboxed investigation. Though I saw this error many time before, I did not see it today.

This comes when the salt mgmt can not execute/read from the vm/remote/minion machine. In normal salt, salt-ssh is where to look. But, in our case it would be the corresponding qrexec calls, which may not be returning data back (maybe a timeout), or other errors in the vm. Maybe the qrexec service is not ready at all.

@conorsch
Copy link
Contributor

conorsch commented May 5, 2020

Salt connection failures

The Qubes Salt logic occasionally fails with a message "Failed to return clean data". There are two categories of this error that we've observed to date:

  1. "Request refused" - [prod] qrexec error during install, "Domain sd-log has failed to start" #495
  2. "stdout: deploy" - "Failed to return clean data" in sd-log stage #519

The following shows most detail on the "request refused" option, but timebox expired before the "stdout: deploy" option was deeply investigated.

Request refused

Example of error, from Salt logs /var/log/qubes/mgmt-sd-log-buster-template.log:

2020-04-29 08:48:42,287 calling 'state.sls update.qubes-vm'...
2020-04-29 08:52:05,744 output: sd-log-buster-template:
2020-04-29 08:52:05,746 output:     ----------
2020-04-29 08:52:05,746 output:     _error:
2020-04-29 08:52:05,746 output:         Failed to return clean data
2020-04-29 08:52:05,747 output:     retcode:
2020-04-29 08:52:05,747 output:         126
2020-04-29 08:52:05,747 output:     stderr:
2020-04-29 08:52:05,747 output:         Request refused
2020-04-29 08:52:05,747 output:     stdout:
2020-04-29 08:52:05,747 exit code: 20

Inspecting the syslog via journalctl -a -b | grep sd-log-buster-template, one sees:

Apr 29 08:50:30 dom0 qubesd[2468]: Starting sd-log-buster-template
Apr 29 08:50:56 dom0 qubesd[2468]: Activating the sd-log-buster-template VM
Apr 29 08:51:56 dom0 qubesd[2468]: Start failed: Cannot connect to qrexec agent for 60 seconds, see /var/log/xen/console/guest-sd-log-buster-template.log for details

Then, inside the console logs /var/log/xen/console/guest-sd-log-buster-template.log:

Console logs for failed template boot
[    6.892610] piix4_smbus 0000:00:01.3: SMBus Host Controller not enabled!
Setting up swapspace version 1, size = 1073737728 bytes
UUID=7b756995-929f-4d44-9521-b72e45ad8a4e
/dev/xvda3: clean, 136214/643376 files, 977812/2569467 blocks
[    9.162257] grsec: time set by /usr/lib/systemd/systemd[systemd:1] uid/euid:0/0 gid/egid:0/0, parent /[swapper/0:0] uid/euid:0/0 gid/egid:0/0
[.[0;1;31mFAILED.[0m] Failed to start .[0;1;39mLoad Kernel Modules.[0m.
See 'systemctl status systemd-modules-load.service' for details.
         Starting .[0;1;39mQubes DB agent.[0m...
         Starting .[0;1;39mApply Kernel Variables.[0m...
[.[0;32m  OK  .[0m] Started .[0;1;39mCreate Static Device Nodes in /dev.[0m.
         Starting .[0;1;39mudev Kernel Device Manager.[0m...
[.[0;32m  OK  .[0m] Reached target .[0;1;39mLocal File Systems (Pre).[0m.
[.[0;32m  OK  .[0m] Started .[0;1;39mudev Coldplug all Devices.[0m.
         Starting .[0;1;39mHelper to synchronize boot up for ifupdown.[0m...
[.[0;32m  OK  .[0m] Started .[0;1;39mJournal Service.[0m.
         Starting .[0;1;39mFlush Journal to Persistent Storage.[0m...
[.[0;32m  OK  .[0m] Started .[0;1;39mQubes DB agent.[0m.
         Starting .[0;1;39mInit Qubes Services settings.[0m...
         Starting .[0;1;39mLoad/Save Random Seed.[0m...
[.[0;32m  OK  .[0m] Started .[0;1;39mApply Kernel Variables.[0m.
[.[0;32m  OK  .[0m] Started .[0;1;39mHelper to synchronize boot up for ifupdown.[0m.
[.[0;32m  OK  .[0m] Started .[0;1;39mFlush Journal to Persistent Storage.[0m.
[.[0;32m  OK  .[0m] Started .[0;1;39mLoad/Save Random Seed.[0m.
[.[0;32m  OK  .[0m] Started .[0;1;39mudev Kernel Device Manager.[0m.
[.[0;32m  OK  .[0m] Started .[0;1;39mInit Qubes Services settings.[0m.
         Starting .[0;1;39mInitialize and mount /rw and /home.[0m...
         Starting .[0;1;39mAdjust root filesystem size.[0m...
[.[0;32m  OK  .[0m] Started .[0;1;39mAdjust root filesystem size.[0m.
[.[0;32m  OK  .[0m] Started .[0;1;39mInitialize and mount /rw and /home.[0m.
[.[0;32m  OK  .[0m] Reached target .[0;1;39mLocal File Systems.[0m.
         Starting .[0;1;39mLoad AppArmor profiles.[0m...
         Starting .[0;1;39mCreate Volatile Files and Directories.[0m...
         Starting .[0;1;39mEarly Qubes VM settings.[0m...
[.[0;32m  OK  .[0m] Started .[0;1;39mCreate Volatile Files and Directories.[0m.
[.[0;32m  OK  .[0m] Reached target .[0;1;39mSystem Time Synchronized.[0m.
         Starting .[0;1;39mUpdate UTMP about System Boot/Shutdown.[0m...
[.[0;32m  OK  .[0m] Started .[0;1;39mUpdate UTMP about System Boot/Shutdown.[0m.
[.[0;32m  OK  .[0m] Started .[0;1;39mLoad AppArmor profiles.[0m.
[.[0;32m  OK  .[0m] Started .[0;1;39mEntropy daemon using the HAVEGE algorithm.[0m.
         Starting .[0;1;39mRaise network interfaces.[0m...
[.[0;32m  OK  .[0m] Started .[0;1;39mEarly Qubes VM settings.[0m.
[.[0;32m  OK  .[0m] Started .[0;1;39mRaise network interfaces.[0m.
[.[0;32m  OK  .[0m] Found device .[0;1;39m/dev/hvc0.[0m.
[.[0;32m  OK  .[0m] Found device .[0;1;39m/dev/xvdc1.[0m.
         Activating swap .[0;1;39m/dev/xvdc1.[0m...
[.[0;32m  OK  .[0m] Activated swap .[0;1;39m/dev/xvdc1.[0m.
[.[0;32m  OK  .[0m] Reached target .[0;1;39mSwap.[0m.
[.[0;32m  OK  .[0m] Reached target .[0;1;39mSystem Initialization.[0m.
[.[0;32m  OK  .[0m] Started .[0;1;39mPeriodically check for updates.[0m.
[.[0;32m  OK  .[0m] Started .[0;1;39mUpdate system time each 6h.[0m.
[.[0;32m  OK  .[0m] Listening on .[0;1;39mD-Bus System Message Bus Socket.[0m.
[.[0;32m  OK  .[0m] Started .[0;1;39mDaily man-db regeneration.[0m.
[   10.235268] Error: Driver 'pcspkr' is already registered, aborting...
[.[0;32m  OK  .[0m] Listening on .[0;1;39mForward conne… updates proxy over Qubes RPC.[0m.
[.[0;32m  OK  .[0m] Started .[0;1;39mCUPS Scheduler.[0m.
[.[0;32m  OK  .[0m] Reached target .[0;1;39mPaths.[0m.
[.[0;32m  OK  .[0m] Started .[0;1;39mDaily rotation of log files.[0m.
[.[0;32m  OK  .[0m] Started .[0;1;39mDaily Cleanup of Temporary Directories.[0m.
[.[0;32m  OK  .[0m] Reached target .[0;1;39mTimers.[0m.
[.[0;32m  OK  .[0m] Listening on .[0;1;39mCUPS Scheduler.[0m.
[.[0;32m  OK  .[0m] Reached target .[0;1;39mSockets.[0m.
[.[0;32m  OK  .[0m] Reached target .[0;1;39mBasic System.[0m.
         Starting .[0;1;39mRestore /etc/reso… the ppp link was shut down.[0m...
         Starting .[0;1;39mQubes base firewall settings.[0m...
         Starting .[0;1;39mLogin Service.[0m...
         Starting .[0;1;39mQubes memory information reporter.[0m...
         Starting .[0;1;39mXen driver domain device daemon.[0m...
         Starting .[0;1;39mSystem Logging Service.[0m...
         Starting .[0;1;39mCUPS Scheduler.[0m...
[.[0;32m  OK  .[0m] Started .[0;1;39mD-Bus System Message Bus.[0m.
         Starting .[0;1;39mWPA supplicant.[0m...
[.[0;32m  OK  .[0m] Started .[0;1;39mSystem Logging Service.[0m.
[.[0;32m  OK  .[0m] Started .[0;1;39mRestore /etc/resol…re the ppp link was shut down.[0m.
[.[0;32m  OK  .[0m] Started .[0;1;39mQubes memory information reporter.[0m.
[.[0;32m  OK  .[0m] Started .[0;1;39mXen driver domain device daemon.[0m.
[.[0;32m  OK  .[0m] Started .[0;1;39mCUPS Scheduler.[0m.
         Starting .[0;1;39mRotate log files.[0m...
         Starting .[0;1;39mDaily man-db regeneration.[0m...
         Starting .[0;1;39mQubes remote exec agent.[0m...
         Starting .[0;1;39mQubes GUI Agent.[0m...
[.[0;32m  OK  .[0m] Started .[0;1;39mLogin Service.[0m.
[.[0;32m  OK  .[0m] Started .[0;1;39mWPA supplicant.[0m.
[.[0;32m  OK  .[0m] Started .[0;1;39mQubes remote exec agent.[0m.
[.[0;32m  OK  .[0m] Started .[0;1;39mUpdate time from ClockVM.[0m.
[.[0;32m  OK  .[0m] Started .[0;1;39mQubes GUI Agent.[0m.
         Stopping .[0;1;39mCUPS Scheduler.[0m...
[.[0;32m  OK  .[0m] Stopped .[0;1;39mCUPS Scheduler.[0m.
         Starting .[0;1;39mCUPS Scheduler.[0m...
[.[0;32m  OK  .[0m] Started .[0;1;39mCUPS Scheduler.[0m.
[.[0;32m  OK  .[0m] Started .[0;1;39mRotate log files.[0m.
[   10.699249] grsec: time set by /usr/bin/date[date:532] uid/euid:0/0 gid/egid:0/0, parent /usr/lib/qubes/qubes-sync-clock[qubes-sync-cloc:507] uid/euid:0/0 gid/egid:0/0
[.[0;32m  OK  .[0m] Started .[0;1;39mDaily man-db regeneration.[0m.
[.[0;32m  OK  .[0m] Started .[0;1;39mQubes base firewall settings.[0m.
         Starting .[0;1;39mQubes misc post-boot actions.[0m...
[.[0;32m  OK  .[0m] Reached target .[0;1;39mNetwork.[0m.
         Starting .[0;1;39m/etc/rc.local Compatibility.[0m...
         Starting .[0;1;39mPermit User Sessions.[0m...
         Starting .[0;1;39mAdvanced key-value store.[0m...
[.[0;32m  OK  .[0m] Started .[0;1;39mPermit User Sessions.[0m.
[.[0;32m  OK  .[0m] Started .[0;1;39m/etc/rc.local Compatibility.[0m.
[.[0;32m  OK  .[0m] Created slice .[0;1;39mUser Slice of UID 1000.[0m.
         Starting .[0;1;39mUser Runtime Directory /run/user/1000.[0m...
[.[0;32m  OK  .[0m] Started .[0;1;39mSerial Getty on hvc0.[0m.
[.[0;32m  OK  .[0m] Started .[0;1;39mGetty on tty1.[0m.
[.[0;32m  OK  .[0m] Reached target .[0;1;39mLogin Prompts.[0m.
[.[0;32m  OK  .[0m] Started .[0;1;39mUser Runtime Directory /run/user/1000.[0m.
         Starting .[0;1;39mUser Manager for UID 1000.[0m...
[.[0;32m  OK  .[0m] Started .[0;1;39mQubes misc post-boot actions.[0m.
[.[0;32m  OK  .[0m] Started .[0;1;39mAdvanced key-value store.[0m.
[.[0;32m  OK  .[0m] Reached target .[0;1;39mMulti-User System.[0m.
         Starting .[0;1;39mUpdate UTMP about System Runlevel Changes.[0m...
[.[0;32m  OK  .[0m] Started .[0;1;39mUpdate UTMP about System Runlevel Changes.[0m.
[.[0;32m  OK  .[0m] Started .[0;1;39mUser Manager for UID 1000.[0m.
[.[0;32m  OK  .[0m] Started .[0;1;39mSession c3 of user user.[0m.
[.[0;32m  OK  .[0m] Started .[0;1;39mSession c1 of user user.[0m.
[.[0;32m  OK  .[0m] Started .[0;1;39mSession c2 of user user.[0m.

Debian GNU/Linux 10 sd-log-buster-template hvc0

sd-log-buster-template login: [   18.391892] reboot: Power down

Nothing of particular note. The "failed to start kernel modules" line is intriguing at first glance, but it occurs on every boot of that VM, which you can prove by comparing counts of that line to "reached target multi-user system":

[root@dom0 ~]# grep -ci 'failed to start.*load kernel modules' /var/log/xen/console/guest-sd-log-buster-template.log
6
[root@dom0 ~]# grep -ci 'reached target.*multi-user system' /var/log/xen/console/guest-sd-log-buster-template.log
6

The fact that the qrexec connection never succeeded means that Qubes deemed the VM boot as a failure, even though we can see from the logs the machine did start successfully. "Failed to return clean data" shows few hits on the Qubes issue tracker, but we see more for "cannot connect to qrexec agent".

We've previously noted that HVMs take significantly longer to boot. To illustrate (see test script):

[user@dom0 scripts]$ ./test-hvm-start-time.sh
Run time for test-hvm-1: 40s
Run time for test-pvh-1: 17s
Run time for test-hvm-2: 40s
Run time for test-pvh-2: 17s
Run time for test-hvm-3: 43s
Run time for test-pvh-3: 19s

The HVMs clearly take ~20s longer than PVH to boot. That suggests we may want to increase the qrexec_timeout value on the SDW VMs, since they mostly use HVM for virt_mode. The
problem of qrexec delays is worse if memory is constrained; for example, restricting maxmem=400 (to simulate qubes.qmemman service failure) shows much longer run times on both HVM & PVH:

[user@dom0 scripts]$ ./test-hvm-start-time.sh
Run time for test-hvm-1: 59s
Run time for test-pvh-3: 74s
Run time for test-hvm-2: 79s
Run time for test-pvh-1: 79s
Run time for test-hvm-3: 86s
Run time for test-pvh-2: 87s

Since the default timeout is 60s, delays longer than that will cause a VM to be marked as failed and then killed. In fact, updating the test script above to use sudo qubesctl --show-output --skip-dom0 --target $vm_name state.sls update.qubes-vm shows the error message in question:

test-hvm-2-request-refused

With the matching error in the mgmt logs:

2020-05-04 18:52:12,530 calling 'state.sls update.qubes-vm'...
2020-05-04 18:55:52,519 output: test-hvm-2:
2020-05-04 18:55:52,521 output:     ----------
2020-05-04 18:55:52,521 output:     _error:
2020-05-04 18:55:52,521 output:         Failed to return clean data
2020-05-04 18:55:52,521 output:     retcode:
2020-05-04 18:55:52,521 output:         126
2020-05-04 18:55:52,522 output:     stderr:
2020-05-04 18:55:52,522 output:         Request refused
2020-05-04 18:55:52,522 output:     stdout:
2020-05-04 18:55:52,522 exit code: 20

stdout: deploy

Example failure:

2020-04-15 11:58:47,546 calling 'state.highstate'...
2020-04-15 11:59:32,975 output: sd-log-buster-template:
2020-04-15 11:59:32,976 output:     ----------
2020-04-15 11:59:32,976 output:     _error:
2020-04-15 11:59:32,976 output:         Failed to return clean data
2020-04-15 11:59:32,976 output:     retcode:
2020-04-15 11:59:32,976 output:         None
2020-04-15 11:59:32,976 output:     stderr:
2020-04-15 11:59:32,977 output:     stdout:
2020-04-15 11:59:32,977 output:         deploy
2020-04-15 11:59:32,977 exit code: 20

The failure here has been quite difficult to reproduce reliably. In order to track down occurrences locally, run in dom0:

sudo grep -rPI -B10 'output:\s+deploy' /var/log 

Further investigation required to establish a reliable reproduction of the issue.

@eloquence
Copy link
Member Author

This is very useful, @conorsch, thanks for the super detailed recap. For the "Request refused" errors, should we wait for the qmemman fix to land and then re-test to see if the issue still occurs? Or do you think there's a case for adjusting timeout values regardless?

@eloquence
Copy link
Member Author

Per above we've agreed to defer additional investigation on this for now, until at least the qmemman fixes land.

@eloquence
Copy link
Member Author

eloquence commented May 27, 2020

Seeing this during an update of fedora-31 (with stdout deploy) today, here's the full output from the management VM log:
https://gist.github.com/eloquence/58a937fb4c59607c88f2e17bbbc1972b

(This is with 0.3.0-rc2, staging; without qmemman fix manually applied)

@eloquence
Copy link
Member Author

No reports of this issue since qmemman fixes landed upstream, proposing to close.

@eloquence
Copy link
Member Author

Closing per above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants