-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] executors shutdown intermittently during integrations test parallel run #5979
Comments
there was no useful logs from executor side. For worker log,
Executor app-20220710122231-0000/3 finished with state EXITED message Command exited with code 134 exitStatus 134 |
as it failed intermittently, I guess there could be some memory leaks. We also saw this kind of failures in ub16 test pipeline and jdk11 test pipelines, I am not sure if all there failures were related |
some coredump log in UCX nightly test (jdk8), some coredump log in ubuntu16 nightly test (jdk8), some coredump logs in jdk11 nightly test, |
We found more pipeline failed the same reason if pytest run in parallel mode (xdist) since last Friday |
Seems it's not related to the commit #5955
Anyway, trigged a build after reverted this commit on |
@rwlee could this be related in any way to rapidsai/cudf#11153? |
The hs_err_pid files are quite consistent, always showing a segfault in libcuda.so.1 after
Most of the time it's |
Hmm. All of these failures are on map lookup. I wonder if there's a problem in Edit: These might be unrelated to the crash. The code under test isn't actually looking up the contents of the map column. |
None of the integration tests fail on my machine, even after multiple runs. They do however fail for @revans2, and he was able to localize the failure to a single integration test, We were able to generate a core file from one of the crashes. This appears to be a bug in libcudf that has been there a long time, but I cannot readily explain why it has only started failing recently. See rapidsai/cudf#11248. |
Deployed new spark-rapids-jni w/ the fix rapidsai/cudf#11254 Most of CI tests should pass as expected now, I will keep monitoring all other pipelines for a few days. |
Describe the bug
above cases started failing intermittently since last Friday in multiple pipelines
Executors got SIGABORT from pytest. Detailed pytest logs,
The text was updated successfully, but these errors were encountered: