Try fixing random hangs in CI due to races #2615

YJDoc2 · 2024-01-04T10:04:14Z

Ref : #2144 , specifically #2144 (comment)

As per mentioned in the comment, I have added a different impl for the "main" function, that will leak the closure, and allow OS to collect memory at very end instead. For the sake of safety, I have added that function via cfg(test) , so the release code still gets the "proper" implementation. As mentioned in the comment, the races should every only be present in tests due to multi-threaded nature of test running, and actual real-world should not have this hang-up issue.

One downside is that because of this the code that is tested v/s the one that is executed is slightly different, and I'm not entirely fine with it ; but given that our CI is showing way more time-out flakes due to these hangs, I'm willing to try this solution.

I ran the test CI ~ 5 times on my repo, and tests did not hang any time : https://github.com/YJDoc2/youki/actions?query=branch%3Atests%2Ffix-for-2144++

However, one run did fail on one test - https://github.com/YJDoc2/youki/actions/runs/7407712843/job/20154439112 ;but as it still didn't hang, and the test itself might have flaked, so not considering it in particular.

codecov-commenter · 2024-01-04T10:08:33Z

Codecov Report

Merging #2615 (4e99152) into main (b60889d) will increase coverage by 0.01%.
Report is 9 commits behind head on main.
The diff coverage is 66.66%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2615      +/-   ##
==========================================
+ Coverage   65.88%   65.89%   +0.01%     
==========================================
  Files         133      133              
  Lines       16819    16834      +15     
==========================================
+ Hits        11081    11093      +12     
- Misses       5738     5741       +3

Signed-off-by: Yashodhan Joshi <yjdoc2@gmail.com>

utam0k · 2024-01-05T04:37:28Z

crates/libcontainer/src/process/fork.rs

+    #[cfg(not(test))]
    extern "C" fn main(data: *mut libc::c_void) -> libc::c_int {
        unsafe { Box::from_raw(data as *mut CloneCb)() }
    }
+    #[cfg(test)]
+    extern "C" fn main(data: *mut libc::c_void) -> libc::c_int {
+        let mut func = unsafe { Box::from_raw(data as *mut CloneCb) };
+        let ret = func();
+        Box::into_raw(func);
+        ret
+    }


Suggested change

#[cfg(not(test))]

extern "C" fn main(data: *mut libc::c_void) -> libc::c_int {

unsafe { Box::from_raw(data as *mut CloneCb)() }

}

#[cfg(test)]

extern "C" fn main(data: *mut libc::c_void) -> libc::c_int {

let mut func = unsafe { Box::from_raw(data as *mut CloneCb) };

let ret = func();

Box::into_raw(func);

ret

}

extern "C" fn main_impl(data: *mut libc::c_void) -> libc::c_int {

unsafe { Box::from_raw(data as *mut CloneCb)() }

}

#[cfg(not(test))]

extern "C" fn main(data: *mut libc::c_void) -> libc::c_int {

main_impl(data)

}

#[cfg(test)]

extern "C" fn main(data: *mut libc::c_void) -> libc::c_int {

let mut func = main_impl(data);

let ret = func();

Box::into_raw(func);

ret

}

We should also use #[cfg(any(target_env = "musl"))]

Hey,

We can extract the main logic in a main_impl function, but it will need to leak the box or return the box which then the caller will have to leak it. If we do it as you have suggested, i.e. calling the boxed function and returning the result, we still do the free call when main_impl returns and run into same potential race condition.

As per the last comment on that issues, runwasi was running into same hanging-up issue when running with libc. As far as I understand, the root cause if doing malloc/free after the fork, which is supposed to be undefined behavior. As this only affect in tests, I had added cfg test, and kept rest of the targets to the normal implementation. As per yihuaf 's comment, which I have linked in the desc, the hand issue should only be observed in tests, not in actual usage, so I feel only modifying the test target with cfg is better than modifying the whole musl compilation.

wdyt?

jprendes · 2024-01-08T16:01:18Z

Does the hang still happen after #2425?
Would you have links to hanging runs?

YJDoc2 · 2024-01-08T16:10:28Z

Does the hang still happen after #2425?
Would you have links to hanging runs?

Hey, yes it still occasionally hangs, and in fact also hangs on glib :

eg Runs :

https://github.com/containers/youki/actions/runs/7407675307/job/20154351955 (glibc)
https://github.com/containers/youki/actions/runs/7380997666/job/20079079193 (musl)
CI for one old commit in this PR https://github.com/containers/youki/actions/runs/7408099752

You can check https://github.com/containers/youki/actions/workflows/main.yml?query=is%3Afailure where runtime is >= 20 mins.

I think the issue is that even though that PR fixes some syscall issues, the box free syscall as investigated by yihuaf and mentioned in the comment in PR desc still exist.

jprendes · 2024-01-08T17:48:31Z

Thanks!

One downside is that because of this the code that is tested v/s the one that is executed is slightly different, and I'm not entirely fine with it

Why can't we use the test behaviour in release code as well?

I would be happy with leaving the cleanup to the OS in release mode as well.

YJDoc2 · 2024-01-09T05:45:18Z

Why can't we use the test behaviour in release code as well?
I would be happy with leaving the cleanup to the OS in release mode as well.

It is mostly my preference to not leak things unless needed. There are no code related or technical reason why both versions cannot be same, at worse we are leaking a single box there, so I think <100 bytes of memory. I just separated the code, because as per yihuaf 's investigation the issue was only occurring in test, and should not be present in normal usage. I am fine with using the same behavior in both cases, so if that feels better, let me know, I'll update accordingly.

jprendes · 2024-01-09T09:27:03Z

Another option could be setting --test-threads=1 at the expense of slightly increased test runtime.

cargo test -- --test-threads=1

Although I'm not 100% sure that this guarantees that the process will have only one thread.

utam0k · 2024-01-09T12:33:19Z

Another option could be setting --test-threads=1 at the expense of slightly increased test runtime.
cargo test -- --test-threads=1
Although I'm not 100% sure that this guarantees that the process will have only one thread.

Thanks for your suggestion but I don't prefer this workaround as I think it hides the problem too much and affects other unit tests.

crates/libcontainer/src/process/fork.rs

utam0k · 2024-01-09T12:43:06Z

This is caused by multiple threads, so it never happens in production because Youki itself always expects one thread.

Co-authored-by: Toru Komatsu <k0ma@utam0k.jp>

utam0k · 2024-01-11T11:53:16Z

@YJDoc2 Can I ask you if this failure is related to this PR?
https://github.com/containers/youki/actions/runs/7484741181/job/20372045989?pr=2615

YJDoc2 · 2024-01-12T09:01:37Z

Hmm it appears that even after the changes here, some tests can hang

test process::fork::test::test_container_err_fork has been running for over 60 seconds
test process::fork::test::test_container_fork has been running for over 60 seconds
test seccomp::tests::test_basic has been running for over 60 seconds
test seccomp::tests::test_moby has been running for over 60 seconds

Which is why the one CI has failed. Not sure what can be done further right now 😓

utam0k · 2024-01-12T11:58:11Z

hmm... @yihuaf May I ask you to look into this PR and CI?

utam0k · 2024-01-13T11:53:51Z

Another option could be setting --test-threads=1 at the expense of slightly increased test runtime.
cargo test -- --test-threads=1
Although I'm not 100% sure that this guarantees that the process will have only one thread.

Good option.

@YJDoc2
How about separating the test into two, not using clone and using clone by cfg feature flags?

YJDoc2 · 2024-01-21T15:29:58Z

How about separating the test into two, not using clone and using clone by cfg feature flags?

Hey @utam0k , I'm not sure what do you mean by this. Do you mean to split tests into two parts, one which use the clone path (such as fork, etc) and others which don't? Maybe then we can run the non-fork ones in parallel, and run the fork ones in serial. However, I feel there would still be a chance that because the way tests are executed, the runner would still have multiple threads (total), even if we run with --test-threads=1 . Still, it might be worth a try. Can you confirm if this is what you meant? Thanks!

utam0k · 2024-01-23T11:47:42Z

@YJDoc2 Yes, That's what I want to say. I believe features could separate unit tests into two parts, using clone(2) and not using it.

YJDoc2 · 2024-02-16T12:20:02Z

Closing this as done in #2685

YJDoc2 added the kind/test label Jan 4, 2024

YJDoc2 force-pushed the tests/fix-for-2144 branch from a8fd37e to 27e94a2 Compare January 4, 2024 10:05

Try implementing solution pointed by @yihuaf in youki-dev#2144 (comment)

4e99152

Signed-off-by: Yashodhan Joshi <yjdoc2@gmail.com>

YJDoc2 force-pushed the tests/fix-for-2144 branch from 27e94a2 to 4e99152 Compare January 4, 2024 10:09

YJDoc2 requested a review from a team January 4, 2024 10:16

utam0k reviewed Jan 5, 2024

View reviewed changes

utam0k reviewed Jan 9, 2024

View reviewed changes

crates/libcontainer/src/process/fork.rs Outdated Show resolved Hide resolved

utam0k approved these changes Jan 9, 2024

View reviewed changes

Update comment as per suggestion

284a4d5

Co-authored-by: Toru Komatsu <k0ma@utam0k.jp>

utam0k self-requested a review January 18, 2024 12:36

lengrongfu mentioned this pull request Jan 19, 2024

add schedule entity #2495

Merged

YJDoc2 mentioned this pull request Feb 15, 2024

Set '--test-threads' option to 1 in unit tests #2685

Merged

YJDoc2 closed this Feb 16, 2024

YJDoc2 deleted the tests/fix-for-2144 branch February 16, 2024 12:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Try fixing random hangs in CI due to races #2615

Try fixing random hangs in CI due to races #2615

YJDoc2 commented Jan 4, 2024

codecov-commenter commented Jan 4, 2024 •

edited

Loading

utam0k Jan 5, 2024

utam0k Jan 5, 2024

YJDoc2 Jan 8, 2024

jprendes commented Jan 8, 2024

YJDoc2 commented Jan 8, 2024

jprendes commented Jan 8, 2024

YJDoc2 commented Jan 9, 2024

jprendes commented Jan 9, 2024

utam0k commented Jan 9, 2024

utam0k commented Jan 9, 2024

utam0k commented Jan 11, 2024

YJDoc2 commented Jan 12, 2024

utam0k commented Jan 12, 2024

utam0k commented Jan 13, 2024

YJDoc2 commented Jan 21, 2024

utam0k commented Jan 23, 2024

YJDoc2 commented Feb 16, 2024

Try fixing random hangs in CI due to races #2615

Try fixing random hangs in CI due to races #2615

Conversation

YJDoc2 commented Jan 4, 2024

codecov-commenter commented Jan 4, 2024 • edited Loading

Codecov Report

utam0k Jan 5, 2024

Choose a reason for hiding this comment

utam0k Jan 5, 2024

Choose a reason for hiding this comment

YJDoc2 Jan 8, 2024

Choose a reason for hiding this comment

jprendes commented Jan 8, 2024

YJDoc2 commented Jan 8, 2024

jprendes commented Jan 8, 2024

YJDoc2 commented Jan 9, 2024

jprendes commented Jan 9, 2024

utam0k commented Jan 9, 2024

utam0k commented Jan 9, 2024

utam0k commented Jan 11, 2024

YJDoc2 commented Jan 12, 2024

utam0k commented Jan 12, 2024

utam0k commented Jan 13, 2024

YJDoc2 commented Jan 21, 2024

utam0k commented Jan 23, 2024

YJDoc2 commented Feb 16, 2024

codecov-commenter commented Jan 4, 2024 •

edited

Loading