Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend TagForReplaceMode to adapt Databricks runtime #3368

Merged
merged 7 commits into from
Sep 10, 2021

Conversation

sperlingxx
Copy link
Collaborator

@sperlingxx sperlingxx commented Sep 2, 2021

Signed-off-by: sperlingxx lovedreamf@gmail.com

Fixes #3339

Current PR is to fix the test failure of test_hash_groupby_collect_partial_replace_fallback on Databricks runtime. The root cause of the test failure is about different planning strategies for Aggregation between Spark and Databricks runtime. To support both Spark and Databricks runtime, we need a more expressive configuration for hashAggReplaceMode. In addition, with the extended TagForReplaceMode method, we are able to build tests on partial GPU replacement for Aggregate in a more preciser way.

P.S. It has been manually tested on DB_301.

Signed-off-by: sperlingxx <lovedreamf@gmail.com>
Signed-off-by: sperlingxx <lovedreamf@gmail.com>
@sperlingxx
Copy link
Collaborator Author

build

Signed-off-by: sperlingxx <lovedreamf@gmail.com>
@sperlingxx
Copy link
Collaborator Author

build

@sperlingxx
Copy link
Collaborator Author

build

@sperlingxx sperlingxx requested a review from tgravescs September 3, 2021 09:24
@tgravescs
Copy link
Collaborator

@revans2 could you review if you have time

revans2
revans2 previously approved these changes Sep 3, 2021
Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a nit

@@ -615,10 +661,9 @@ def test_hash_multiple_mode_query_avg_distincts(data_gen, conf):
@approximate_float
@ignore_order
@incompat
@pytest.mark.parametrize('data_gen', _init_list_no_nans, ids=idfn)
@pytest.mark.parametrize('data_gen', _init_list_no_nans[1:2], ids=idfn)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a slice for the data gen is really confusing here. At a minimum we need a comment explaining why we are slicing it. Preferably it is a separate value along with the explanation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, its my mistake. I added the slice to facilitate debugging, and I forgot to remove it before submission.

@sameerz sameerz added the task Work required that improves the product but is not user facing label Sep 3, 2021
Signed-off-by: sperlingxx <lovedreamf@gmail.com>
@sperlingxx
Copy link
Collaborator Author

build

revans2
revans2 previously approved these changes Sep 7, 2021
Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, but I would like to have a few more eyes look at it.

@@ -617,8 +663,7 @@ def test_hash_multiple_mode_query_avg_distincts(data_gen, conf):
@incompat
@pytest.mark.parametrize('data_gen', _init_list_no_nans, ids=idfn)
@pytest.mark.parametrize('conf', get_params(_confs, params_markers_for_confs), ids=idfn)
@pytest.mark.parametrize('parameterless', ['true', pytest.param('false', marks=pytest.mark.xfail(
condition=not is_before_spark_311(), reason="parameterless count not supported by default in Spark 3.1+"))])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why was this removed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am sorry. This option shouldn't be removed. I added it back.

abellina
abellina previously approved these changes Sep 8, 2021
Copy link
Collaborator

@abellina abellina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like a good approach to be able to express the varying nature of these aggregate plans and the prior method was limiting and wrong in some cases.

I'd like to see these patterns tied to a spark version/flavor (somehow), but that can come in a follow up. @sperlingxx what do you think? Main reason I bring this up is so that it's clearer from the pattern what Spark we are targeting.

# test with single Distinct
assert_cpu_and_gpu_are_equal_collect_with_capture(
lambda spark: gen_df(spark, data_gen, length=100)
.groupby('a')
.agg(f.sort_array(f.collect_list('b')),
f.sort_array(f.collect_set('b')),
f.countDistinct('c'),
f.count('c')),
f.countDistinct('c')),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did f.count get removed here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sperlingxx I did have this issue, I am not sure why this test content had to change. That said, I know we are trying to get some tests to pass, as long as we address this either on this PR or a subsequent one linked to this PR.

Copy link
Collaborator Author

@sperlingxx sperlingxx Sep 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@abellina f.count got removed because it looks redundant since we already had non-distinct aggregations: collect_list and collect_set. The test case can successfully run with the removed f.count.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that's not consistent. There are other tests with that combination and with collect*. I personally would rather have more aggregates computed than less, it is more chance to catch issues with the aggregation buffer/ordinal logic.

This test also has two assertions, and I recall distinctly having to remove one of them to debug the other. This is different than my previous comment, but it should really be two tests.

This is blocking other things, I am ok doing this as a follow up.

Signed-off-by: sperlingxx <lovedreamf@gmail.com>
@sperlingxx sperlingxx dismissed stale reviews from abellina and revans2 via 54ab703 September 10, 2021 03:09
@sperlingxx
Copy link
Collaborator Author

build

@sperlingxx
Copy link
Collaborator Author

sperlingxx commented Sep 10, 2021

It seems like a good approach to be able to express the varying nature of these aggregate plans and the prior method was limiting and wrong in some cases.

I'd like to see these patterns tied to a spark version/flavor (somehow), but that can come in a follow up. @sperlingxx what do you think? Main reason I bring this up is so that it's clearer from the pattern what Spark we are targeting.

Hi @abellina , I fully agree on the idea about listing common aggregation patterns of spark. Perhaps we can label them as constants? I filed a follow-up issue #3437 for your idea.

@sperlingxx sperlingxx requested a review from abellina September 10, 2021 10:33
@tgravescs tgravescs merged commit 0f30671 into NVIDIA:branch-21.10 Sep 10, 2021
@sperlingxx sperlingxx deleted the rework_replaceMode branch December 2, 2021 02:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
task Work required that improves the product but is not user facing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Databricks test fails test_hash_groupby_collect_partial_replace_fallback
5 participants