-
Notifications
You must be signed in to change notification settings - Fork 346
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Try fixing random hangs in CI due to races #2615
Conversation
a8fd37e
to
27e94a2
Compare
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## main #2615 +/- ##
==========================================
+ Coverage 65.88% 65.89% +0.01%
==========================================
Files 133 133
Lines 16819 16834 +15
==========================================
+ Hits 11081 11093 +12
- Misses 5738 5741 +3 |
27e94a2
to
4e99152
Compare
#[cfg(not(test))] | ||
extern "C" fn main(data: *mut libc::c_void) -> libc::c_int { | ||
unsafe { Box::from_raw(data as *mut CloneCb)() } | ||
} | ||
#[cfg(test)] | ||
extern "C" fn main(data: *mut libc::c_void) -> libc::c_int { | ||
let mut func = unsafe { Box::from_raw(data as *mut CloneCb) }; | ||
let ret = func(); | ||
Box::into_raw(func); | ||
ret | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#[cfg(not(test))] | |
extern "C" fn main(data: *mut libc::c_void) -> libc::c_int { | |
unsafe { Box::from_raw(data as *mut CloneCb)() } | |
} | |
#[cfg(test)] | |
extern "C" fn main(data: *mut libc::c_void) -> libc::c_int { | |
let mut func = unsafe { Box::from_raw(data as *mut CloneCb) }; | |
let ret = func(); | |
Box::into_raw(func); | |
ret | |
} | |
extern "C" fn main_impl(data: *mut libc::c_void) -> libc::c_int { | |
unsafe { Box::from_raw(data as *mut CloneCb)() } | |
} | |
#[cfg(not(test))] | |
extern "C" fn main(data: *mut libc::c_void) -> libc::c_int { | |
main_impl(data) | |
} | |
#[cfg(test)] | |
extern "C" fn main(data: *mut libc::c_void) -> libc::c_int { | |
let mut func = main_impl(data); | |
let ret = func(); | |
Box::into_raw(func); | |
ret | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also use #[cfg(any(target_env = "musl"))]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey,
- We can extract the main logic in a
main_impl
function, but it will need to leak the box or return the box which then the caller will have to leak it. If we do it as you have suggested, i.e. calling the boxed function and returning the result, we still do the free call whenmain_impl
returns and run into same potential race condition. - As per the last comment on that issues, runwasi was running into same hanging-up issue when running with libc. As far as I understand, the root cause if doing malloc/free after the fork, which is supposed to be undefined behavior. As this only affect in tests, I had added cfg test, and kept rest of the targets to the normal implementation. As per yihuaf 's comment, which I have linked in the desc, the hand issue should only be observed in tests, not in actual usage, so I feel only modifying the test target with cfg is better than modifying the whole musl compilation.
wdyt?
Does the hang still happen after #2425? |
Hey, yes it still occasionally hangs, and in fact also hangs on glib : eg Runs :
You can check https://github.com/containers/youki/actions/workflows/main.yml?query=is%3Afailure where runtime is >= 20 mins. I think the issue is that even though that PR fixes some syscall issues, the box free syscall as investigated by yihuaf and mentioned in the comment in PR desc still exist. |
Thanks!
Why can't we use the test behaviour in release code as well? I would be happy with leaving the cleanup to the OS in release mode as well. |
It is mostly my preference to not leak things unless needed. There are no code related or technical reason why both versions cannot be same, at worse we are leaking a single box there, so I think <100 bytes of memory. I just separated the code, because as per yihuaf 's investigation the issue was only occurring in test, and should not be present in normal usage. I am fine with using the same behavior in both cases, so if that feels better, let me know, I'll update accordingly. |
Another option could be setting cargo test -- --test-threads=1 Although I'm not 100% sure that this guarantees that the process will have only one thread. |
Thanks for your suggestion but I don't prefer this workaround as I think it hides the problem too much and affects other unit tests. |
This is caused by multiple threads, so it never happens in production because Youki itself always expects one thread. |
Co-authored-by: Toru Komatsu <k0ma@utam0k.jp>
@YJDoc2 Can I ask you if this failure is related to this PR? |
Hmm it appears that even after the changes here, some tests can hang
Which is why the one CI has failed. Not sure what can be done further right now 😓 |
hmm... @yihuaf May I ask you to look into this PR and CI? |
Good option. @YJDoc2 |
Hey @utam0k , I'm not sure what do you mean by this. Do you mean to split tests into two parts, one which use the clone path (such as fork, etc) and others which don't? Maybe then we can run the non-fork ones in parallel, and run the fork ones in serial. However, I feel there would still be a chance that because the way tests are executed, the runner would still have multiple threads (total), even if we run with |
@YJDoc2 Yes, That's what I want to say. I believe |
Closing this as done in #2685 |
Ref : #2144 , specifically #2144 (comment)
As per mentioned in the comment, I have added a different impl for the "main" function, that will leak the closure, and allow OS to collect memory at very end instead. For the sake of safety, I have added that function via
cfg(test)
, so the release code still gets the "proper" implementation. As mentioned in the comment, the races should every only be present in tests due to multi-threaded nature of test running, and actual real-world should not have this hang-up issue.One downside is that because of this the code that is tested v/s the one that is executed is slightly different, and I'm not entirely fine with it ; but given that our CI is showing way more time-out flakes due to these hangs, I'm willing to try this solution.
I ran the test CI ~ 5 times on my repo, and tests did not hang any time : https://github.com/YJDoc2/youki/actions?query=branch%3Atests%2Ffix-for-2144++
However, one run did fail on one test - https://github.com/YJDoc2/youki/actions/runs/7407712843/job/20154439112 ;but as it still didn't hang, and the test itself might have flaked, so not considering it in particular.