-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] test_single_sort_in_part is failed in nightly UCX and AQE (no UCX) integration #2477
Comments
I'll attempt to repro/triage this one. |
Marking P1 as we don't know anything about it. |
I haven't been able to reproduce locally yet, with several executors running and UCX enabled. I verified I have the same cudf jar as in the failed job. I am trying to see how repeatable it is in the EGX environment. |
@sameerz I still haven't been able to reproduce, and the UCX job hasn't failed since this was filed. Will keep looking at it. |
Still no repro success, my summary so far:
@revans2 can you think of a dimension I haven't looked at here? |
I have absolutely no idea. The only thing I can think of is that it must be a race condition somewhere and it is very very rare to lose that race. |
Closing this as we have not have been able to reproduce. |
Opening again, since we saw another occurrence in a different CI job that runs with AQE, but not with UCX. |
Made some progress on this today. First, I can reproduce a similar error by killing an executor process during the test (which may or may not be the root cause). Second, in the test that failed last (the AQE log) I clearly see this behavior:
I am pretty convinced this is the reason. When an executor dies and likely comes back between the CPU and the GPU session, we are executing each sub-test against different clusters. In terms of the function we are testing One could "fix" this is by setting |
I confirmed that killing an executor changes the number of partitions in the tests, and in this case causes the sort to not match. I am adding a The task isn't done, but we should monitor executors failing via #2698, and focus on the test that actually cause the failure to begin with. |
Linking the spark issue that we see as the culprit in spark 3.0.2: apache/spark#31540 |
Closing this, as I think we handled this test. There could be other failures in CI that task retries mask, but we've disabled task retry for spark 3.1.1+, so we should be able to open a new issue if we see this again. |
The test_single_sort_in_part test failed with a comparison failure in a recent UCX EGX standalone integration test:
The text was updated successfully, but these errors were encountered: