Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sporadic failures in E2E CI #347

Closed
jprendes opened this issue Oct 3, 2023 · 34 comments · Fixed by #354
Closed

Sporadic failures in E2E CI #347

jprendes opened this issue Oct 3, 2023 · 34 comments · Fixed by #354

Comments

@jprendes
Copy link
Collaborator

jprendes commented Oct 3, 2023

CI is sporadically failing on E2E tests, across all of them: smoke tests, kind, and k3s tests
For an example, see this run:
https://github.com/containerd/runwasi/actions/runs/6391393082/job/17346723004

@jprendes
Copy link
Collaborator Author

jprendes commented Oct 3, 2023

From the containerd logs from this CI run, which generated these logs artifact

[INFO] starting instance: 665173b26c87393303cfdd55a68427afc5025f083f2e6a56916d11005c02c5a1
[INFO] about to build container instance: 665173b26c87393303cfdd55a68427afc5025f083f2e6a56916d11005c02c5a1
[INFO] about to REALLY build the container instance: 665173b26c87393303cfdd55a68427afc5025f083f2e6a56916d11005c02c5a1
[INFO] cgroup manager V2 will be used
[WARN] Controller rdma is not yet implemented.
[WARN] Controller misc is not yet implemented.
[time="2023-10-03T21:33:53.554067718Z" level=info msg="StartContainer for \"33e8e760bae0ce1ed34b80d16e8e0cf187b51a641a82421fc1379a8a59526365\" returns successfully"
WARN] Controller rdma is not yet implemented.
[WARN] Controller misc is not yet implemented.
[INFO] starting instance: 25633cc2ec2915d7ab31407f5312d86a271498e784f25cd9268b8d300bf76f56
[INFO] about to build container instance: 25633cc2ec2915d7ab31407f5312d86a271498e784f25cd9268b8d300bf76f56
[INFO] about to REALLY build the container instance: 25633cc2ec2915d7ab31407f5312d86a271498e784f25cd9268b8d300bf76f56
[INFO] cgroup manager V2 will be used
[WARN] Controller rdma is not yet implemented.
[WARN] Controller misc is not yet implemented.
[WARN] Controller rdma is not yet implemented.
[WARN] Controller misc is not yet implemented.
[INFO] starting instance: c2f10adebbddfc541ac0eefbeb7382b428f9bd8c87af4fd675cbf49e7cd3bc16
[INFO] about to build container instance: c2f10adebbddfc541ac0eefbeb7382b428f9bd8c87af4fd675cbf49e7cd3bc16
[INFO] about to REALLY build the container instance: c2f10adebbddfc541ac0eefbeb7382b428f9bd8c87af4fd675cbf49e7cd3bc16
[INFO] cgroup manager V2 will be used
[WARN] Controller rdma is not yet implemented.
[WARN] Controller misc is not yet implemented.
[WARN] Controller rdma is not yet implemented.
[WARN] Controller misc is not yet implemented.
[INFO] XXXXXXXX validating container
[INFO] XXXXXXXX container executor get_or_init
[INFO] XXXXXXXX a linux container!
[WARN] intermediate process already reaped
[INFO] successfully built container instance: c2f10adebbddfc541ac0eefbeb7382b428f9bd8c87af4fd675cbf49e7cd3bc16
[INFO] container instance: c2f10adebbddfc541ac0eefbeb7382b428f9bd8c87af4fd675cbf49e7cd3bc16 has pid 5773
[INFO] XXXXXXXX executing container
[INFO] executing linux container
time="2023-10-03T21:33:53.864393966Z" level=info msg="StartContainer for \"c2f10adebbddfc541ac0eefbeb7382b428f9bd8c87af4fd675cbf49e7cd3bc16\" returns successfully"
[INFO] XXXXXXXX validating container
[INFO] XXXXXXXX container executor get_or_init
[INFO] XXXXXXXX a linux container!
[WARN] intermediate process already reaped
[INFO] successfully built container instance: 25633cc2ec2915d7ab31407f5312d86a271498e784f25cd9268b8d300bf76f56
[INFO] container instance: 25633cc2ec2915d7ab31407f5312d86a271498e784f25cd9268b8d300bf76f56 has pid 5758
[INFO] XXXXXXXX executing container
[INFO] executing linux container

There are 3 starting instance: ... corresponding to the 3 nginx linux containers.
There are 3 [INFO] about to REALLY build the container instance: ..., which is printed here just before calling build() on the libcontianer's ContainerBuilder.
The call to build() should call the validate(...) method in the executor here, however, thre are only 2 lines with XXXXXXXX validating container.

This means that we are getting stuck somewhere inside of the build() call.
@utam0k @yihuaf , do you have any clue on where in libcontainer we could be getting stuck?
Thanks!

@jsturtevant
Copy link
Contributor

this might be the root cause of #322

@utam0k
Copy link
Member

utam0k commented Oct 4, 2023

Is there anything I can help you?

@jprendes
Copy link
Collaborator Author

jprendes commented Oct 4, 2023

@utam0k , I am currently adding debugging print statements in a fork of youki, trying to find where we are stalling. If you have any pointer of good places to look at, that would be great.

@jprendes
Copy link
Collaborator Author

jprendes commented Oct 4, 2023

It looks like we are getting stuck in here: https://github.com/jprendes/youki/blob/runwasi-ci/crates/libcontainer/src/process/container_main_process.rs#L174

    main_receiver.wait_for_init_ready().map_err(|err| {
        tracing::error!("failed to wait for init ready: {}", err);
        err
    })?;

Also, it looks like we never get to the executor's validate(...), which IIUC should happen in the same process where the init_ready should have been signaled. Will dig further.

@jprendes
Copy link
Collaborator Author

jprendes commented Oct 4, 2023

I am seeing it get stuck on this line https://github.com/jprendes/youki/blob/runwasi-ci/crates/libcontainer/src/process/container_init_process.rs#L457

    set_supplementary_gids(proc.user(), &args.user_ns_config, syscall.as_ref()).map_err(|err| {
        tracing::error!(?err, "failed to set supplementary gids");
        err
    })?;

which seems really odd to me!

@jprendes
Copy link
Collaborator Author

jprendes commented Oct 4, 2023

It never makes it past https://github.com/jprendes/youki/blob/runwasi-ci/crates/libcontainer/src/syscall/linux.rs#L494

if let Err(err) = setgroups(groups) {

where setgroups is nix::unistd::setgroups.
I wonder if this is related to libseccomp.

@utam0k
Copy link
Member

utam0k commented Oct 5, 2023

@utam0k
Copy link
Member

utam0k commented Oct 5, 2023

How can I reproduce this error?

@jprendes
Copy link
Collaborator Author

jprendes commented Oct 5, 2023

I haven't been able to reproduce locally but it happens rather often in CI.

My current approach is to use log::info! to print debugging statements. Somehow log::...! ends up in the containerd logs regardless of which process is producing it (main, intermediate, init).

I'll see how much I can do in CI with the youki debugging guide.

@jprendes
Copy link
Collaborator Author

jprendes commented Oct 5, 2023

From what I've read, setgroups uses a thread synchronisation mechanism, as it needs to do the change in all threads, one by one. I'm currently leaning to something blocking there.

@jprendes
Copy link
Collaborator Author

jprendes commented Oct 5, 2023

Here you can see the locking in the musl implementation (although in CI it's the glibc the impl we are using):
https://github.com/bminor/musl/blob/79bdacff83a6bd5b70ff5ae5eb8b6de82c2f7c30/src/linux/setgroups.c#L29
https://github.com/bminor/musl/blob/79bdacff83a6bd5b70ff5ae5eb8b6de82c2f7c30/src/thread/synccall.c#L44

On the plus side, when the setgroups call is made in youki, the process only has one thread, so in theory we could just replace the libc implementation with a direct syscall, avoiding any blocking mechanism.

@jprendes
Copy link
Collaborator Author

jprendes commented Oct 5, 2023

I am still trying to get some logs using bpftrace.
It seems that just by tracing it the failure rate gets reduced. Yay for Heisenbugs!

@jprendes
Copy link
Collaborator Author

jprendes commented Oct 6, 2023

This is interesting.
When it fails in k3s, as far as I can see it's in the setgroups call.
When it fails in kind, as far as I can see it's in the setresuid call.
Both of these libc functions use the same thread synchronization strategy.

@jprendes
Copy link
Collaborator Author

jprendes commented Oct 6, 2023

@utam0k , this run has an example of both, a failure on k3s, and a failure on kind.
https://github.com/containerd/runwasi/actions/runs/6431903574?pr=348

k3s logs: https://github.com/containerd/runwasi/suites/16953613832/artifacts/968592411
kind logs: https://github.com/containerd/runwasi/suites/16953613832/artifacts/968592413

They include the bpftrace result (they have a lot of added prints starting with YYYYYY that I've added for debugging)
If you want to match the extra logging to locations in the code please see here: https://github.com/jprendes/youki/blob/runwasi-ci/crates/libcontainer/src/process/container_init_process.rs

My approach is counting the number of matches, e.g., in the k3s logs, in bpf_debug.txt there are:

  • 9 matches for YYYYYY libcontainer::syscall::set_groups 1
  • 8 matches for YYYYYY libcontainer::syscall::set_groups 4

That means that one of the processes didn't make it past the setgroups call.

@jprendes
Copy link
Collaborator Author

jprendes commented Oct 6, 2023

I switched my patched youki to use the raw syscall instead of the libc wrappers and I haven't seen it fail yet. I will try removing the bpftracing, as that did reduce the frequency of failures and see if that has any impact.

Using the raw syscalls is safe because the init process has only one thread when the calls are made.

@jprendes
Copy link
Collaborator Author

jprendes commented Oct 6, 2023

I wonder if this could be related
https://sourceware.org/bugzilla/show_bug.cgi?id=26371

@jprendes
Copy link
Collaborator Author

jprendes commented Oct 6, 2023

I think that's the root cause, as youki is using a raw syscall to perform clone3.
https://github.com/containers/youki/blob/fee46419300114bfe92aee71945929c6aaba0a53/crates/libcontainer/src/process/fork.rs#L117

@yihuaf
Copy link
Contributor

yihuaf commented Oct 6, 2023

See if the investigation in this issue helps: youki-dev/youki#2144

A little more context, we are in a gray area with using musl and container's way of forking processes. Officially, the libc authors do not support calling raw syscalls. The official stance is after calling a raw syscall, calling any calls in libc is categorized as undefined behavior and is not supported.

This is also the reason why youki implemented a clone fallback path since clone3 doesn't have a libc wrapper.

@jprendes
Copy link
Collaborator Author

jprendes commented Oct 6, 2023

Thanks @yihuaf
Please note that this is happening with glibc in these tests.
There's a glibc feature request from systemd, related to clone3 with explicit mentions of setgroups and friends.
Also, runwasi spawns multiple threads by the time youki calls clone3.

@yihuaf
Copy link
Contributor

yihuaf commented Oct 6, 2023

The underlying issue is calling raw syscall and then calling libc syscall becomes undefined behavior. While authors of both glibc and musl maintains this stance, in the field, glibc works most of the time whereas musl is more pronounced. If youki using clone3 syscall directly is the issue, may be try with the fallback path of clone and see if the issue persists.

There's a glibc feature request from systemd, related to clone3 with explicit mentions of setgroups and friends.

I think the request has been there since 2020s. It will take some times :)

@jprendes
Copy link
Collaborator Author

jprendes commented Oct 7, 2023

Yeah, sorry, I don't mean to wait for glibc to provide a wrapper, it was just as a reference to the behaviour we observe being reported on glibc (not just musl).

I will give the fallback path a go. If that works, could the clone3 path be gated by a feature flag?

@jprendes
Copy link
Collaborator Author

jprendes commented Oct 9, 2023

@yihuaf @YJDoc2, even with the change from clone3 to clone I'm still seeing failures. I'll investigate further.

@jprendes
Copy link
Collaborator Author

jprendes commented Oct 9, 2023

See logs here: https://github.com/containerd/runwasi/suites/17048867407/artifacts/973767783

In particular, see containerd.log
Debug printing corresponds to branch https://github.com/jprendes/youki/blob/runwasi-ci/crates/libcontainer/src/process/container_init_process.rs#L456

You can see [INFO] YYYYYY libcontainer::fork >> about to do normal clone, which means it is not using clone3, but libc::clone (debug printing from here: https://github.com/jprendes/youki/blob/runwasi-ci/crates/libcontainer/src/process/fork.rs#L60)

As before, it is getting stuck in setgroups.

@YJDoc2
Copy link

YJDoc2 commented Oct 10, 2023

@jprendes can you describe a way to run this locally, so we can iterate and figure out the issue more quickly? Also, when you say that it is getting stuck, is the CI getting canceled because of timeout? As far I can think, hanging in (essentially) a syscall is strange, as it should return error right away if there was any setup/ config issue when calling. Also, for this are we using the main branch or released v0.2.0 for crates? If we are using released version, and the errors have started only recently, it can be due to some underlying env config changes too.

In the containerd logs, is the context deadline exceeded error due to youki not starting the container/hanging?

@jprendes
Copy link
Collaborator Author

@YJDoc2 I haven't been able to reproduce locally, although I will give that a more thorough try.

That job in CI has a 5 min tomeout, and it's getting killed after that timeout.

I believe it is not hanging on the syscall itself, but on the glibc implementation which does container a thread synchronization mechanism (aparently the setgroups kernel syscall only affects the calling thread, and glibc takes care off calling it from every thread using signaling).

Initially I thought that since youki was using a raw syscall for clone3, glibc didn't have a change to do house keeping and update the list of running threads in the new process, and that switching to libc::clone would do the trick. But after the switch it still hangs in the same place.

The logs are using a forked version adding debugging prints, based off my PR to disable clone3, which is based off youki's main branch. But it also happens with v0.2.0 (which is why I started investigating).

I am not sure where the context deadline exceeded message is coming from, but I strongly believe it's caused by the hang in setgroups.

@YJDoc2
Copy link

YJDoc2 commented Oct 10, 2023

I haven't been able to reproduce locally, although I will give that a more thorough try.

Hey, I saw the previous conversation regarding this, that the issue is only seen on CI ; I am just asking, can you share the instructions to run the particular test/CI things locally, so we can also try to do this and see if we can reproduce.

That job in CI has a 5 min tomeout, and it's getting killed after that timeout.

👍

The logs are using a forked version adding debugging prints, based off my PR to disable clone3, which is based off youki's main branch. But it also happens with v0.2.0 (which is why I started investigating).

👍

I am not sure where the context deadline exceeded message is coming from, but I strongly believe it's caused by the hang in setgroups.

Yeah, I'd also think so. I have seen this error message (in proper crio/runc k8s) when something underlying k8s fails, such as image pulling and potentially runtime hanging up.

@jprendes
Copy link
Collaborator Author

You can try running:

make test/k3s-wasmedge
make test/k3s/clean

in a loop until it happens, that's what I was going to try :-)

@jprendes
Copy link
Collaborator Author

I made a simplified reproduction. Debugging with gdb on the halted process I get:
image

@jprendes
Copy link
Collaborator Author

The simplified reproduction is here: https://github.com/jprendes/clone-setgroups-reproduction

@jprendes
Copy link
Collaborator Author

jprendes commented Oct 10, 2023

Well, it looks like clone also messes up with printing to the console, and that's what we see in the previous backtrace.
Here's the backtrace without any printing, halted purely in setgroups
image

@utam0k
Copy link
Member

utam0k commented Oct 10, 2023

Sorry I haven't had much time to debug it lately 🙇 🙇
How about calling setgroups directly as in Go language?
It is strange that runc has not faced this problem.
https://cs.opensource.google/search?q=setgroups&sq=&ss=go%2Fx%2Fsys

Also, bpftrace or whatever, but I want to make sure first that the setgroups system call is actually called.

@jprendes
Copy link
Collaborator Author

@utam0k calling setgroups / setresgid / setresuid directly would workaround the issue.
I am slightly concerned because in normal youki / runc, we soon call execve, which starts a clean process.
In the case of runwasi, we don't, we keep running on the same tainted process.
From nix::clone and nix::fork, even memory allocation might be unsafe after the clone call (I assume it's platform / variant dependent).

@YJDoc2
Copy link

YJDoc2 commented Oct 10, 2023

I managed to produce this using @jprendes repro repo, even when the total thread count is 2, i.e. the loop is set for 2 iterations (although it takes more runs to hang). My backtrace is

#0  futex_wait (private=0, expected=4294967294, futex_word=0x7f2dc61fec5c) at ../sysdeps/nptl/futex-internal.h:146
#1  futex_wait_simple (private=0, expected=4294967294, futex_word=0x7f2dc61fec5c) at ../sysdeps/nptl/futex-internal.h:177
#2  setxid_mark_thread (t=t@entry=0x7f2dc61fe640, cmdp=0x7f2dc6789ea0) at ./nptl/nptl_setxid.c:105
#3  0x00007f2dc6491a35 in __nptl_setxid (cmdp=cmdp@entry=0x7f2dc6789ea0) at ./nptl/nptl_setxid.c:195
#4  0x00007f2dc64e7712 in __GI_setgroups (groups=<optimized out>, n=<optimized out>) at ../sysdeps/unix/sysv/linux/setgroups.c:33
#5  __GI_setgroups (n=<optimized out>, groups=<optimized out>) at ../sysdeps/unix/sysv/linux/setgroups.c:28
#6  0x00005614e1abbd7b in nix::unistd::setgroups (groups=...) at src/unistd.rs:1668
#7  0x00005614e1ab6fc0 in clone_repro::main::{closure#0}::{closure#0} () at src/main.rs:61
#8  0x00005614e1ab43fe in core::ops::function::FnOnce::call_once<clone_repro::main::{closure#0}::{closure_env#0}, ()> ()
    at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/core/src/ops/function.rs:250
#9  0x00005614e1ab42c8 in alloc::boxed::{impl#47}::call_once<(), dyn core::ops::function::FnOnce<(), Output=i32>, alloc::alloc::Global> (self=..., args=())
    at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/alloc/src/boxed.rs:1993
#10 0x00005614e1ab422f in alloc::boxed::{impl#47}::call_once<(), alloc::boxed::Box<dyn core::ops::function::FnOnce<(), Output=i32>, alloc::alloc::Global>, alloc::alloc::Global> (
    self=0x7f2dc0000ca0, args=()) at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/alloc/src/boxed.rs:1993
#11 0x00005614e1ab6122 in clone_repro::clone::main (data=0x7f2dc0000ca0) at src/main.rs:35
#12 0x00007f2dc6525bf4 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:100

Which roughly leads to line 105 here : https://elixir.bootlin.com/glibc/glibc-2.35/source/nptl/nptl_setxid.c
It seems that for some reason, the set... call is stuck in a live-lock /deadlock? Also found this, but have no idea if related or not : https://ewontfix.com/17/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants