-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Failed to return clean data" in sd-log stage #519
Comments
Currently there are no open upstream issues that match https://github.com/QubesOS/qubes-issues/issues?q=is%3Aissue+%22failed+to+return+clean+data%22 Given how frequently I see this message myself, we're obligated to report upstream, so let's aim to collect more data here. I'll make sure to post the next time I see the error; let's also look for opportunities to make our logging more informative (#521). |
Since we have encountered this during a production install, @conorsch has committed to at least an initial investigation, time-boxed to 8 hours, during the 4/2-4/15 sprint. |
Observed this today during an update of the
|
@conorsch has agreed to continue to spend up to ~4 hours on further investigating this issue during the 4/22-5/6 sprint. |
I did a timeboxed investigation. Though I saw this error many time before, I did not see it today. This comes when the salt mgmt can not execute/read from the vm/remote/minion machine. In normal salt, |
Salt connection failuresThe Qubes Salt logic occasionally fails with a message "Failed to return clean data". There are two categories of this error that we've observed to date:
The following shows most detail on the "request refused" option, but timebox expired before the "stdout: deploy" option was deeply investigated. Request refusedExample of error, from Salt logs
Inspecting the syslog via
Then, inside the console logs Console logs for failed template boot
Nothing of particular note. The "failed to start kernel modules" line is intriguing at first glance, but it occurs on every boot of that VM, which you can prove by comparing counts of that line to "reached target multi-user system":
The fact that the qrexec connection never succeeded means that Qubes deemed the VM boot as a failure, even though we can see from the logs the machine did start successfully. "Failed to return clean data" shows few hits on the Qubes issue tracker, but we see more for "cannot connect to qrexec agent". We've previously noted that HVMs take significantly longer to boot. To illustrate (see test script):
The HVMs clearly take ~20s longer than PVH to boot. That suggests we may want to increase the
Since the default timeout is 60s, delays longer than that will cause a VM to be marked as failed and then killed. In fact, updating the test script above to use With the matching error in the mgmt logs:
stdout: deployExample failure:
The failure here has been quite difficult to reproduce reliably. In order to track down occurrences locally, run in dom0:
Further investigation required to establish a reliable reproduction of the issue. |
This is very useful, @conorsch, thanks for the super detailed recap. For the "Request refused" errors, should we wait for the |
Per above we've agreed to defer additional investigation on this for now, until at least the |
Seeing this during an update of fedora-31 (with stdout (This is with 0.3.0-rc2, staging; without qmemman fix manually applied) |
No reports of this issue since |
Closing per above. |
Seen previously during QA in the
securedrop-admin --apply
stage, and now also once during a prod install, the install sometimes fails during thesd-log
stage with a "Failed to return clean data" error:The text was updated successfully, but these errors were encountered: