-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add parallel support to nightly spark standalone tests #3264
Conversation
Signed-off-by: Peixin Li <pxli@nyu.edu>
jenkins/spark-tests.sh
Outdated
# integration tests | ||
if [[ $PARALLEL_TEST == "true" ]] && [ -x "$(command -v parallel)" ]; then | ||
# put most time-consuming tests at the head of queue | ||
time_consuming_tests="join_test.py generate_expr_test.py" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we were tagging the source to indicate they were slow as in #3241. Curious why this isn't leveraging that, e.g.: time_consuming_tests=$(grep -rl pytest.mark.slow_test "$SCRIPT_PATH"/src/main/python)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
slow_tag
there was for pre-merge parallel split (either time-consuming or high memory usage), mostly to balance test time of two parallel test stages. The tag naming looks ambiguous, I talked to Alex, he will help to rename this tag
In nightly cases, I want pick time-consuming ones only, and per my tests join_test (3000s) + generate_expr_test (1800s) were the only 2 tests that spent consistently over 15 mins, so I manually put them here to avoid randomly putting them in the middle or tail of task queue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we split the test cases in join_test
and generate_expr_test
so they spread more evenly even without using the parallel
hack?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, split in test-cases level that would be better. I have tested w/ a few non-serious cases-level split but not getting much benefits from it, and this could make it harder for developer to manage the test scenarios. I would like to get back to this if current setup does not fulfill our efficiency requirement
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks OK to me other than we may want a followup to split some of these expensive test cases to make it easier for pytest to make better decisions about running them in parallel on its own.
Thanks! I filed an issue #3279 to track the followup |
Signed-off-by: Peixin Li pxli@nyu.edu
fix #2802
To speed our nightly spark standalone integrations tests,
spark 3.0.x total time: ~3h 40m to ~1h 15m
spark 3.1.x total time: ~4h to ~1h 35m (include extra ParquetCachedBatchSerializer cache_test)
I am still doing more verification on other scenarios, submit first to collect feedbacks, thanks!
This requires to be enabled in nightly pipelines' Jenkinsfile separately