Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add parallel support to nightly spark standalone tests #3264

Merged
merged 3 commits into from
Aug 24, 2021

Conversation

pxLi
Copy link
Collaborator

@pxLi pxLi commented Aug 20, 2021

Signed-off-by: Peixin Li pxli@nyu.edu

fix #2802

To speed our nightly spark standalone integrations tests,
spark 3.0.x total time: ~3h 40m to ~1h 15m
spark 3.1.x total time: ~4h to ~1h 35m (include extra ParquetCachedBatchSerializer cache_test)

I am still doing more verification on other scenarios, submit first to collect feedbacks, thanks!

This requires to be enabled in nightly pipelines' Jenkinsfile separately

Signed-off-by: Peixin Li <pxli@nyu.edu>
@pxLi pxLi added the test Only impacts tests label Aug 20, 2021
@pxLi pxLi requested a review from GaryShen2008 as a code owner August 20, 2021 09:54
@pxLi pxLi marked this pull request as draft August 20, 2021 09:54
# integration tests
if [[ $PARALLEL_TEST == "true" ]] && [ -x "$(command -v parallel)" ]; then
# put most time-consuming tests at the head of queue
time_consuming_tests="join_test.py generate_expr_test.py"
Copy link
Member

@jlowe jlowe Aug 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we were tagging the source to indicate they were slow as in #3241. Curious why this isn't leveraging that, e.g.: time_consuming_tests=$(grep -rl pytest.mark.slow_test "$SCRIPT_PATH"/src/main/python)

Copy link
Collaborator Author

@pxLi pxLi Aug 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

slow_tag there was for pre-merge parallel split (either time-consuming or high memory usage), mostly to balance test time of two parallel test stages. The tag naming looks ambiguous, I talked to Alex, he will help to rename this tag
In nightly cases, I want pick time-consuming ones only, and per my tests join_test (3000s) + generate_expr_test (1800s) were the only 2 tests that spent consistently over 15 mins, so I manually put them here to avoid randomly putting them in the middle or tail of task queue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we split the test cases in join_test and generate_expr_test so they spread more evenly even without using the parallel hack?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, split in test-cases level that would be better. I have tested w/ a few non-serious cases-level split but not getting much benefits from it, and this could make it harder for developer to manage the test scenarios. I would like to get back to this if current setup does not fulfill our efficiency requirement

@pxLi pxLi marked this pull request as ready for review August 23, 2021 06:29
@pxLi
Copy link
Collaborator Author

pxLi commented Aug 23, 2021

build

@pxLi pxLi changed the title [REVIEW] Add parallel support to nightly spark standalone tests Add parallel support to nightly spark standalone tests Aug 23, 2021
Copy link
Member

@jlowe jlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks OK to me other than we may want a followup to split some of these expensive test cases to make it easier for pytest to make better decisions about running them in parallel on its own.

@pxLi
Copy link
Collaborator Author

pxLi commented Aug 24, 2021

Looks OK to me other than we may want a followup to split some of these expensive test cases to make it easier for pytest to make better decisions about running them in parallel on its own.

Thanks! I filed an issue #3279 to track the followup

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
test Only impacts tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update jenkins/start-tests.sh to use integration_tests/run_pyspark_from_build.sh
2 participants