-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Low entropy causes VM agent to hang indefinitely during start #325
Comments
For more background on how I encountered/debugged this:
I looked into whether that
The key line there being |
I tried adding I then tried starting haveged during VM boot, which worked as expected, unblocking agent on @nmeyerhans Do you have any thoughts on all of this? Is there any standard way of dealing with this type of issue in debian? |
Have you tried passing haveged is a reasonable approach. It's actually embedded in the Debian installer to ensure that systems have usable entropy during the installation process. See the discussion of this at https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=923675 There's an in-kernel jitter-entropy collector available in very new kernels (linux 5.4, IIRC), but I don't think we're quite ready for that. https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=50ee7529ec45
|
I didn't try that because based on the thread we'd need at least kernel 4.19 (which is plausible of course but would be nice to avoid as a minimum dependency if possible) and I wasn't sure if all our users would actually want to trust the CPU rng (though I'm happy to be told that's not a meaningful concern). It's worth a shot though as it seems pretty easy to compile a new kernel using their guide, I'll update with the results. |
On a separate note, I think I managed to track down why this started suddenly happening to me out of seemingly nowhere (it's a doozy...) Here's a stack trace from gdb running on agent that caught the
So it looks like there is an As for why it happened during the changes I was making, while I was not changing the agent code at all as part of what I was working on I happened to migrate this code from the While it's good there is at least something we can do to avoid this particular situation (I can rearrange imports to avoid that dependency ending up in agent), this is still an extremely delicate spot to be in. It would be much preferable to not be at the mercy of any transitive dependency deciding to do a random read in an init. |
Using a 4.19 kernel @nmeyerhans sent plus including So in terms of what we can do for the image-builder rootfs, the top options I'm seeing right now are:
@nmeyerhans any thoughts on the above? I think my preference at the moment would be 2. and to document in bold that the image-builder rootfs uses |
RDRAND alone probably shouldn't be trusted. Mixing data from RDRAND with other sources of entropy would be reasonable, but we don't really have those. Consider issues like this one and these, which suggest that RDRAND probably shouldn't be trusted all by itself. I think I'm inclined to agree that option 2 is reasonable, at least for now. Supporting build-time configuration options, such as the customization of the kernel commandline, might be worth doing. We should also see about getting Firecracker support for virtio-rng. It may not be appropriate in all environments, and it's still got some of the same trust issues as RDRAND, but I think every option involves some level of tradeoff. At least by supporting multiple options we give the user the opportunity to deploy the one that makes sense for their use case. |
Fixed in our image-builder's rootfs by adding |
Make go mod tidy mandatory
It appears for unknown reasons a Firecracker VM may not have sufficient entropy during boot, which can cause our VM agent to experience long pauses very early during process start (possibly as the go runtime itself is still starting, before any of our code is actually executing). This results timeouts when the runtime shim attempts to connect to the agent over vsock, resulting in errors like
Because the pause happens so early when agent is starting, it never writes any output to stdout/stderr, so the debug logs show nothing from agent.
Running low on entropy during boot is an open issue w/ Firecracker. There are some suggested fixes there (which I have not yet tried),
but it doesn't appear to have a strong conclusion at this time. I just saw therngd
suggestion in the thread, I will give that a try and update this issue with the results.I don't know why this suddenly started happening to me, it's on the dev machine I normally use (an
i3.metal
with very similar setup to the CI machines). It may just be entirely random when it starts occurring.We need to follow that Firecracker thread on solutions to this issue and decide how we can possibly address this for Firecracker-containerd users.
The text was updated successfully, but these errors were encountered: