Sporadic failures in E2E CI #347

jprendes · 2023-10-03T10:19:56Z

CI is sporadically failing on E2E tests, across all of them: smoke tests, kind, and k3s tests
For an example, see this run:
https://github.com/containerd/runwasi/actions/runs/6391393082/job/17346723004

jprendes · 2023-10-03T21:57:32Z

From the containerd logs from this CI run, which generated these logs artifact

[INFO] starting instance: 665173b26c87393303cfdd55a68427afc5025f083f2e6a56916d11005c02c5a1
[INFO] about to build container instance: 665173b26c87393303cfdd55a68427afc5025f083f2e6a56916d11005c02c5a1
[INFO] about to REALLY build the container instance: 665173b26c87393303cfdd55a68427afc5025f083f2e6a56916d11005c02c5a1
[INFO] cgroup manager V2 will be used
[WARN] Controller rdma is not yet implemented.
[WARN] Controller misc is not yet implemented.
[time="2023-10-03T21:33:53.554067718Z" level=info msg="StartContainer for \"33e8e760bae0ce1ed34b80d16e8e0cf187b51a641a82421fc1379a8a59526365\" returns successfully"
WARN] Controller rdma is not yet implemented.
[WARN] Controller misc is not yet implemented.
[INFO] starting instance: 25633cc2ec2915d7ab31407f5312d86a271498e784f25cd9268b8d300bf76f56
[INFO] about to build container instance: 25633cc2ec2915d7ab31407f5312d86a271498e784f25cd9268b8d300bf76f56
[INFO] about to REALLY build the container instance: 25633cc2ec2915d7ab31407f5312d86a271498e784f25cd9268b8d300bf76f56
[INFO] cgroup manager V2 will be used
[WARN] Controller rdma is not yet implemented.
[WARN] Controller misc is not yet implemented.
[WARN] Controller rdma is not yet implemented.
[WARN] Controller misc is not yet implemented.
[INFO] starting instance: c2f10adebbddfc541ac0eefbeb7382b428f9bd8c87af4fd675cbf49e7cd3bc16
[INFO] about to build container instance: c2f10adebbddfc541ac0eefbeb7382b428f9bd8c87af4fd675cbf49e7cd3bc16
[INFO] about to REALLY build the container instance: c2f10adebbddfc541ac0eefbeb7382b428f9bd8c87af4fd675cbf49e7cd3bc16
[INFO] cgroup manager V2 will be used
[WARN] Controller rdma is not yet implemented.
[WARN] Controller misc is not yet implemented.
[WARN] Controller rdma is not yet implemented.
[WARN] Controller misc is not yet implemented.
[INFO] XXXXXXXX validating container
[INFO] XXXXXXXX container executor get_or_init
[INFO] XXXXXXXX a linux container!
[WARN] intermediate process already reaped
[INFO] successfully built container instance: c2f10adebbddfc541ac0eefbeb7382b428f9bd8c87af4fd675cbf49e7cd3bc16
[INFO] container instance: c2f10adebbddfc541ac0eefbeb7382b428f9bd8c87af4fd675cbf49e7cd3bc16 has pid 5773
[INFO] XXXXXXXX executing container
[INFO] executing linux container
time="2023-10-03T21:33:53.864393966Z" level=info msg="StartContainer for \"c2f10adebbddfc541ac0eefbeb7382b428f9bd8c87af4fd675cbf49e7cd3bc16\" returns successfully"
[INFO] XXXXXXXX validating container
[INFO] XXXXXXXX container executor get_or_init
[INFO] XXXXXXXX a linux container!
[WARN] intermediate process already reaped
[INFO] successfully built container instance: 25633cc2ec2915d7ab31407f5312d86a271498e784f25cd9268b8d300bf76f56
[INFO] container instance: 25633cc2ec2915d7ab31407f5312d86a271498e784f25cd9268b8d300bf76f56 has pid 5758
[INFO] XXXXXXXX executing container
[INFO] executing linux container

There are 3 starting instance: ... corresponding to the 3 nginx linux containers.
There are 3 [INFO] about to REALLY build the container instance: ..., which is printed here just before calling build() on the libcontianer's ContainerBuilder.
The call to build() should call the validate(...) method in the executor here, however, thre are only 2 lines with XXXXXXXX validating container.

This means that we are getting stuck somewhere inside of the build() call.
@utam0k @yihuaf , do you have any clue on where in libcontainer we could be getting stuck?
Thanks!

jsturtevant · 2023-10-04T00:19:11Z

this might be the root cause of #322

utam0k · 2023-10-04T12:16:48Z

Is there anything I can help you?

jprendes · 2023-10-04T12:27:02Z

@utam0k , I am currently adding debugging print statements in a fork of youki, trying to find where we are stalling. If you have any pointer of good places to look at, that would be great.

jprendes · 2023-10-04T13:24:40Z

It looks like we are getting stuck in here: https://github.com/jprendes/youki/blob/runwasi-ci/crates/libcontainer/src/process/container_main_process.rs#L174

    main_receiver.wait_for_init_ready().map_err(|err| {
        tracing::error!("failed to wait for init ready: {}", err);
        err
    })?;

Also, it looks like we never get to the executor's validate(...), which IIUC should happen in the same process where the init_ready should have been signaled. Will dig further.

jprendes · 2023-10-04T13:58:12Z

I am seeing it get stuck on this line https://github.com/jprendes/youki/blob/runwasi-ci/crates/libcontainer/src/process/container_init_process.rs#L457

    set_supplementary_gids(proc.user(), &args.user_ns_config, syscall.as_ref()).map_err(|err| {
        tracing::error!(?err, "failed to set supplementary gids");
        err
    })?;

which seems really odd to me!

jprendes · 2023-10-04T14:54:39Z

It never makes it past https://github.com/jprendes/youki/blob/runwasi-ci/crates/libcontainer/src/syscall/linux.rs#L494

if let Err(err) = setgroups(groups) {

where setgroups is nix::unistd::setgroups.
I wonder if this is related to libseccomp.

utam0k · 2023-10-05T11:56:53Z

@jprendes FYI: https://containers.github.io/youki/developer/debugging.html

utam0k · 2023-10-05T12:06:03Z

How can I reproduce this error?

jprendes · 2023-10-05T13:00:58Z

I haven't been able to reproduce locally but it happens rather often in CI.

My current approach is to use log::info! to print debugging statements. Somehow log::...! ends up in the containerd logs regardless of which process is producing it (main, intermediate, init).

I'll see how much I can do in CI with the youki debugging guide.

jprendes · 2023-10-05T13:02:58Z

From what I've read, setgroups uses a thread synchronisation mechanism, as it needs to do the change in all threads, one by one. I'm currently leaning to something blocking there.

jprendes · 2023-10-05T20:45:53Z

Here you can see the locking in the musl implementation (although in CI it's the glibc the impl we are using):
https://github.com/bminor/musl/blob/79bdacff83a6bd5b70ff5ae5eb8b6de82c2f7c30/src/linux/setgroups.c#L29
https://github.com/bminor/musl/blob/79bdacff83a6bd5b70ff5ae5eb8b6de82c2f7c30/src/thread/synccall.c#L44

On the plus side, when the setgroups call is made in youki, the process only has one thread, so in theory we could just replace the libc implementation with a direct syscall, avoiding any blocking mechanism.

jprendes · 2023-10-05T20:47:37Z

I am still trying to get some logs using bpftrace.
It seems that just by tracing it the failure rate gets reduced. Yay for Heisenbugs!

jprendes · 2023-10-06T13:05:55Z

This is interesting.
When it fails in k3s, as far as I can see it's in the setgroups call.
When it fails in kind, as far as I can see it's in the setresuid call.
Both of these libc functions use the same thread synchronization strategy.

jprendes · 2023-10-06T13:14:18Z

@utam0k , this run has an example of both, a failure on k3s, and a failure on kind.
https://github.com/containerd/runwasi/actions/runs/6431903574?pr=348

k3s logs: https://github.com/containerd/runwasi/suites/16953613832/artifacts/968592411
kind logs: https://github.com/containerd/runwasi/suites/16953613832/artifacts/968592413

They include the bpftrace result (they have a lot of added prints starting with YYYYYY that I've added for debugging)
If you want to match the extra logging to locations in the code please see here: https://github.com/jprendes/youki/blob/runwasi-ci/crates/libcontainer/src/process/container_init_process.rs

My approach is counting the number of matches, e.g., in the k3s logs, in bpf_debug.txt there are:

9 matches for YYYYYY libcontainer::syscall::set_groups 1
8 matches for YYYYYY libcontainer::syscall::set_groups 4

That means that one of the processes didn't make it past the setgroups call.

jprendes · 2023-10-06T16:45:38Z

I switched my patched youki to use the raw syscall instead of the libc wrappers and I haven't seen it fail yet. I will try removing the bpftracing, as that did reduce the frequency of failures and see if that has any impact.

Using the raw syscalls is safe because the init process has only one thread when the calls are made.

jprendes · 2023-10-06T16:48:59Z

I wonder if this could be related
https://sourceware.org/bugzilla/show_bug.cgi?id=26371

jprendes · 2023-10-06T18:57:06Z

I think that's the root cause, as youki is using a raw syscall to perform clone3.
https://github.com/containers/youki/blob/fee46419300114bfe92aee71945929c6aaba0a53/crates/libcontainer/src/process/fork.rs#L117

yihuaf · 2023-10-06T19:01:55Z

See if the investigation in this issue helps: youki-dev/youki#2144

A little more context, we are in a gray area with using musl and container's way of forking processes. Officially, the libc authors do not support calling raw syscalls. The official stance is after calling a raw syscall, calling any calls in libc is categorized as undefined behavior and is not supported.

This is also the reason why youki implemented a clone fallback path since clone3 doesn't have a libc wrapper.

jprendes · 2023-10-06T20:23:34Z

Thanks @yihuaf
Please note that this is happening with glibc in these tests.
There's a glibc feature request from systemd, related to clone3 with explicit mentions of setgroups and friends.
Also, runwasi spawns multiple threads by the time youki calls clone3.

yihuaf · 2023-10-06T22:52:19Z

The underlying issue is calling raw syscall and then calling libc syscall becomes undefined behavior. While authors of both glibc and musl maintains this stance, in the field, glibc works most of the time whereas musl is more pronounced. If youki using clone3 syscall directly is the issue, may be try with the fallback path of clone and see if the issue persists.

There's a glibc feature request from systemd, related to clone3 with explicit mentions of setgroups and friends.

I think the request has been there since 2020s. It will take some times :)

jprendes · 2023-10-07T03:18:11Z

Yeah, sorry, I don't mean to wait for glibc to provide a wrapper, it was just as a reference to the behaviour we observe being reported on glibc (not just musl).

I will give the fallback path a go. If that works, could the clone3 path be gated by a feature flag?

jprendes · 2023-10-09T19:41:41Z

@yihuaf @YJDoc2, even with the change from clone3 to clone I'm still seeing failures. I'll investigate further.

jprendes · 2023-10-09T22:56:37Z

See logs here: https://github.com/containerd/runwasi/suites/17048867407/artifacts/973767783

In particular, see containerd.log
Debug printing corresponds to branch https://github.com/jprendes/youki/blob/runwasi-ci/crates/libcontainer/src/process/container_init_process.rs#L456

You can see [INFO] YYYYYY libcontainer::fork >> about to do normal clone, which means it is not using clone3, but libc::clone (debug printing from here: https://github.com/jprendes/youki/blob/runwasi-ci/crates/libcontainer/src/process/fork.rs#L60)

As before, it is getting stuck in setgroups.

YJDoc2 · 2023-10-10T10:31:16Z

@jprendes can you describe a way to run this locally, so we can iterate and figure out the issue more quickly? Also, when you say that it is getting stuck, is the CI getting canceled because of timeout? As far I can think, hanging in (essentially) a syscall is strange, as it should return error right away if there was any setup/ config issue when calling. Also, for this are we using the main branch or released v0.2.0 for crates? If we are using released version, and the errors have started only recently, it can be due to some underlying env config changes too.

In the containerd logs, is the context deadline exceeded error due to youki not starting the container/hanging?

jprendes · 2023-10-10T11:24:20Z

@YJDoc2 I haven't been able to reproduce locally, although I will give that a more thorough try.

That job in CI has a 5 min tomeout, and it's getting killed after that timeout.

I believe it is not hanging on the syscall itself, but on the glibc implementation which does container a thread synchronization mechanism (aparently the setgroups kernel syscall only affects the calling thread, and glibc takes care off calling it from every thread using signaling).

Initially I thought that since youki was using a raw syscall for clone3, glibc didn't have a change to do house keeping and update the list of running threads in the new process, and that switching to libc::clone would do the trick. But after the switch it still hangs in the same place.

The logs are using a forked version adding debugging prints, based off my PR to disable clone3, which is based off youki's main branch. But it also happens with v0.2.0 (which is why I started investigating).

I am not sure where the context deadline exceeded message is coming from, but I strongly believe it's caused by the hang in setgroups.

YJDoc2 · 2023-10-10T11:37:44Z

I haven't been able to reproduce locally, although I will give that a more thorough try.

Hey, I saw the previous conversation regarding this, that the issue is only seen on CI ; I am just asking, can you share the instructions to run the particular test/CI things locally, so we can also try to do this and see if we can reproduce.

That job in CI has a 5 min tomeout, and it's getting killed after that timeout.

👍

The logs are using a forked version adding debugging prints, based off my PR to disable clone3, which is based off youki's main branch. But it also happens with v0.2.0 (which is why I started investigating).

👍

I am not sure where the context deadline exceeded message is coming from, but I strongly believe it's caused by the hang in setgroups.

Yeah, I'd also think so. I have seen this error message (in proper crio/runc k8s) when something underlying k8s fails, such as image pulling and potentially runtime hanging up.

jprendes · 2023-10-10T11:39:57Z

You can try running:

make test/k3s-wasmedge
make test/k3s/clean

in a loop until it happens, that's what I was going to try :-)

jprendes · 2023-10-10T12:25:31Z

I made a simplified reproduction. Debugging with gdb on the halted process I get:

jprendes · 2023-10-10T12:36:00Z

The simplified reproduction is here: https://github.com/jprendes/clone-setgroups-reproduction

jprendes · 2023-10-10T12:40:02Z

Well, it looks like clone also messes up with printing to the console, and that's what we see in the previous backtrace.
Here's the backtrace without any printing, halted purely in setgroups

utam0k · 2023-10-10T13:02:15Z

Sorry I haven't had much time to debug it lately 🙇 🙇
How about calling setgroups directly as in Go language?
It is strange that runc has not faced this problem.
https://cs.opensource.google/search?q=setgroups&sq=&ss=go%2Fx%2Fsys

Also, bpftrace or whatever, but I want to make sure first that the setgroups system call is actually called.

jprendes · 2023-10-10T13:29:48Z

@utam0k calling setgroups / setresgid / setresuid directly would workaround the issue.
I am slightly concerned because in normal youki / runc, we soon call execve, which starts a clean process.
In the case of runwasi, we don't, we keep running on the same tainted process.
From nix::clone and nix::fork, even memory allocation might be unsafe after the clone call (I assume it's platform / variant dependent).

YJDoc2 · 2023-10-10T14:25:20Z

I managed to produce this using @jprendes repro repo, even when the total thread count is 2, i.e. the loop is set for 2 iterations (although it takes more runs to hang). My backtrace is

#0  futex_wait (private=0, expected=4294967294, futex_word=0x7f2dc61fec5c) at ../sysdeps/nptl/futex-internal.h:146
#1  futex_wait_simple (private=0, expected=4294967294, futex_word=0x7f2dc61fec5c) at ../sysdeps/nptl/futex-internal.h:177
#2  setxid_mark_thread (t=t@entry=0x7f2dc61fe640, cmdp=0x7f2dc6789ea0) at ./nptl/nptl_setxid.c:105
#3  0x00007f2dc6491a35 in __nptl_setxid (cmdp=cmdp@entry=0x7f2dc6789ea0) at ./nptl/nptl_setxid.c:195
#4  0x00007f2dc64e7712 in __GI_setgroups (groups=<optimized out>, n=<optimized out>) at ../sysdeps/unix/sysv/linux/setgroups.c:33
#5  __GI_setgroups (n=<optimized out>, groups=<optimized out>) at ../sysdeps/unix/sysv/linux/setgroups.c:28
#6  0x00005614e1abbd7b in nix::unistd::setgroups (groups=...) at src/unistd.rs:1668
#7  0x00005614e1ab6fc0 in clone_repro::main::{closure#0}::{closure#0} () at src/main.rs:61
#8  0x00005614e1ab43fe in core::ops::function::FnOnce::call_once<clone_repro::main::{closure#0}::{closure_env#0}, ()> ()
    at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/core/src/ops/function.rs:250
#9  0x00005614e1ab42c8 in alloc::boxed::{impl#47}::call_once<(), dyn core::ops::function::FnOnce<(), Output=i32>, alloc::alloc::Global> (self=..., args=())
    at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/alloc/src/boxed.rs:1993
#10 0x00005614e1ab422f in alloc::boxed::{impl#47}::call_once<(), alloc::boxed::Box<dyn core::ops::function::FnOnce<(), Output=i32>, alloc::alloc::Global>, alloc::alloc::Global> (
    self=0x7f2dc0000ca0, args=()) at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/alloc/src/boxed.rs:1993
#11 0x00005614e1ab6122 in clone_repro::clone::main (data=0x7f2dc0000ca0) at src/main.rs:35
#12 0x00007f2dc6525bf4 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:100

Which roughly leads to line 105 here : https://elixir.bootlin.com/glibc/glibc-2.35/source/nptl/nptl_setxid.c
It seems that for some reason, the set... call is stuck in a live-lock /deadlock? Also found this, but have no idea if related or not : https://ewontfix.com/17/

jsturtevant mentioned this issue Oct 6, 2023

Pull request jobs are flaking with k3s #322

Closed

YJDoc2 mentioned this issue Oct 9, 2023

[WIP] Put clone3 behind a feature youki-dev/youki#2418

Closed

jprendes mentioned this issue Oct 9, 2023

Add a no-clone3 feature to disable clone3 youki-dev/youki#2419

Closed

jprendes mentioned this issue Oct 10, 2023

Use raw syscalls to avoid sporadic hangs youki-dev/youki#2425

Merged

jsturtevant mentioned this issue Oct 10, 2023

Bump oci spec package #352

Merged

This was referenced Oct 11, 2023

Bump libcontainer to include UB fix #354

Merged

Running complex code after the process is cloned in libcontainer can lead to UB #357

Open

chore: Use latest kind and k8s release #356

Merged

Mossaka closed this as completed in #354 Oct 17, 2023

jprendes mentioned this issue Oct 27, 2023

wasmedge echo hangs on second execution #70

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sporadic failures in E2E CI #347

Sporadic failures in E2E CI #347

jprendes commented Oct 3, 2023

jprendes commented Oct 3, 2023

jsturtevant commented Oct 4, 2023

utam0k commented Oct 4, 2023

jprendes commented Oct 4, 2023

jprendes commented Oct 4, 2023

jprendes commented Oct 4, 2023

jprendes commented Oct 4, 2023

utam0k commented Oct 5, 2023

utam0k commented Oct 5, 2023

jprendes commented Oct 5, 2023

jprendes commented Oct 5, 2023

jprendes commented Oct 5, 2023 •

edited

Loading

jprendes commented Oct 5, 2023

jprendes commented Oct 6, 2023

jprendes commented Oct 6, 2023 •

edited

Loading

jprendes commented Oct 6, 2023

jprendes commented Oct 6, 2023

jprendes commented Oct 6, 2023

yihuaf commented Oct 6, 2023 •

edited

Loading

jprendes commented Oct 6, 2023

yihuaf commented Oct 6, 2023

jprendes commented Oct 7, 2023

jprendes commented Oct 9, 2023

jprendes commented Oct 9, 2023

YJDoc2 commented Oct 10, 2023

jprendes commented Oct 10, 2023

YJDoc2 commented Oct 10, 2023

jprendes commented Oct 10, 2023

jprendes commented Oct 10, 2023

jprendes commented Oct 10, 2023

jprendes commented Oct 10, 2023 •

edited

Loading

utam0k commented Oct 10, 2023 •

edited

Loading

jprendes commented Oct 10, 2023

YJDoc2 commented Oct 10, 2023 •

edited

Loading

Sporadic failures in E2E CI #347

Sporadic failures in E2E CI #347

Comments

jprendes commented Oct 3, 2023

jprendes commented Oct 3, 2023

jsturtevant commented Oct 4, 2023

utam0k commented Oct 4, 2023

jprendes commented Oct 4, 2023

jprendes commented Oct 4, 2023

jprendes commented Oct 4, 2023

jprendes commented Oct 4, 2023

utam0k commented Oct 5, 2023

utam0k commented Oct 5, 2023

jprendes commented Oct 5, 2023

jprendes commented Oct 5, 2023

jprendes commented Oct 5, 2023 • edited Loading

jprendes commented Oct 5, 2023

jprendes commented Oct 6, 2023

jprendes commented Oct 6, 2023 • edited Loading

jprendes commented Oct 6, 2023

jprendes commented Oct 6, 2023

jprendes commented Oct 6, 2023

yihuaf commented Oct 6, 2023 • edited Loading

jprendes commented Oct 6, 2023

yihuaf commented Oct 6, 2023

jprendes commented Oct 7, 2023

jprendes commented Oct 9, 2023

jprendes commented Oct 9, 2023

YJDoc2 commented Oct 10, 2023

jprendes commented Oct 10, 2023

YJDoc2 commented Oct 10, 2023

jprendes commented Oct 10, 2023

jprendes commented Oct 10, 2023

jprendes commented Oct 10, 2023

jprendes commented Oct 10, 2023 • edited Loading

utam0k commented Oct 10, 2023 • edited Loading

jprendes commented Oct 10, 2023

YJDoc2 commented Oct 10, 2023 • edited Loading

jprendes commented Oct 5, 2023 •

edited

Loading

jprendes commented Oct 6, 2023 •

edited

Loading

yihuaf commented Oct 6, 2023 •

edited

Loading

jprendes commented Oct 10, 2023 •

edited

Loading

utam0k commented Oct 10, 2023 •

edited

Loading

YJDoc2 commented Oct 10, 2023 •

edited

Loading