-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sporadic failures in E2E CI #347
Comments
From the containerd logs from this CI run, which generated these logs artifact
There are 3 This means that we are getting stuck somewhere inside of the |
this might be the root cause of #322 |
Is there anything I can help you? |
@utam0k , I am currently adding debugging print statements in a fork of youki, trying to find where we are stalling. If you have any pointer of good places to look at, that would be great. |
It looks like we are getting stuck in here: https://github.com/jprendes/youki/blob/runwasi-ci/crates/libcontainer/src/process/container_main_process.rs#L174
Also, it looks like we never get to the executor's |
I am seeing it get stuck on this line https://github.com/jprendes/youki/blob/runwasi-ci/crates/libcontainer/src/process/container_init_process.rs#L457
which seems really odd to me! |
It never makes it past https://github.com/jprendes/youki/blob/runwasi-ci/crates/libcontainer/src/syscall/linux.rs#L494
where |
How can I reproduce this error? |
I haven't been able to reproduce locally but it happens rather often in CI. My current approach is to use I'll see how much I can do in CI with the youki debugging guide. |
From what I've read, setgroups uses a thread synchronisation mechanism, as it needs to do the change in all threads, one by one. I'm currently leaning to something blocking there. |
Here you can see the locking in the musl implementation (although in CI it's the glibc the impl we are using): On the plus side, when the |
I am still trying to get some logs using bpftrace. |
This is interesting. |
@utam0k , this run has an example of both, a failure on k3s, and a failure on kind. k3s logs: https://github.com/containerd/runwasi/suites/16953613832/artifacts/968592411 They include the bpftrace result (they have a lot of added prints starting with My approach is counting the number of matches, e.g., in the
That means that one of the processes didn't make it past the |
I switched my patched youki to use the raw syscall instead of the libc wrappers and I haven't seen it fail yet. I will try removing the bpftracing, as that did reduce the frequency of failures and see if that has any impact. Using the raw syscalls is safe because the init process has only one thread when the calls are made. |
I wonder if this could be related |
I think that's the root cause, as youki is using a raw syscall to perform |
See if the investigation in this issue helps: youki-dev/youki#2144 A little more context, we are in a gray area with using musl and container's way of forking processes. Officially, the libc authors do not support calling raw syscalls. The official stance is after calling a raw syscall, calling any calls in libc is categorized as undefined behavior and is not supported. This is also the reason why |
Thanks @yihuaf |
The underlying issue is calling raw syscall and then calling libc syscall becomes undefined behavior. While authors of both glibc and musl maintains this stance, in the field,
I think the request has been there since 2020s. It will take some times :) |
Yeah, sorry, I don't mean to wait for glibc to provide a wrapper, it was just as a reference to the behaviour we observe being reported on glibc (not just musl). I will give the fallback path a go. If that works, could the clone3 path be gated by a feature flag? |
See logs here: https://github.com/containerd/runwasi/suites/17048867407/artifacts/973767783 In particular, You can see As before, it is getting stuck in |
@jprendes can you describe a way to run this locally, so we can iterate and figure out the issue more quickly? Also, when you say that it is getting stuck, is the CI getting canceled because of timeout? As far I can think, hanging in (essentially) a syscall is strange, as it should return error right away if there was any setup/ config issue when calling. Also, for this are we using the main branch or released v0.2.0 for crates? If we are using released version, and the errors have started only recently, it can be due to some underlying env config changes too. In the containerd logs, is the |
@YJDoc2 I haven't been able to reproduce locally, although I will give that a more thorough try. That job in CI has a 5 min tomeout, and it's getting killed after that timeout. I believe it is not hanging on the syscall itself, but on the glibc implementation which does container a thread synchronization mechanism (aparently the setgroups kernel syscall only affects the calling thread, and glibc takes care off calling it from every thread using signaling). Initially I thought that since youki was using a raw syscall for The logs are using a forked version adding debugging prints, based off my PR to disable I am not sure where the |
Hey, I saw the previous conversation regarding this, that the issue is only seen on CI ; I am just asking, can you share the instructions to run the particular test/CI things locally, so we can also try to do this and see if we can reproduce.
👍
👍
Yeah, I'd also think so. I have seen this error message (in proper crio/runc k8s) when something underlying k8s fails, such as image pulling and potentially runtime hanging up. |
You can try running:
in a loop until it happens, that's what I was going to try :-) |
The simplified reproduction is here: https://github.com/jprendes/clone-setgroups-reproduction |
Sorry I haven't had much time to debug it lately 🙇 🙇 Also, bpftrace or whatever, but I want to make sure first that the setgroups system call is actually called. |
@utam0k calling setgroups / setresgid / setresuid directly would workaround the issue. |
I managed to produce this using @jprendes repro repo, even when the total thread count is 2, i.e. the loop is set for 2 iterations (although it takes more runs to hang). My backtrace is
Which roughly leads to line 105 here : https://elixir.bootlin.com/glibc/glibc-2.35/source/nptl/nptl_setxid.c |
CI is sporadically failing on E2E tests, across all of them: smoke tests, kind, and k3s tests
For an example, see this run:
https://github.com/containerd/runwasi/actions/runs/6391393082/job/17346723004
The text was updated successfully, but these errors were encountered: