-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend TagForReplaceMode to adapt Databricks runtime #3368
Conversation
Signed-off-by: sperlingxx <lovedreamf@gmail.com>
build |
build |
build |
@revans2 could you review if you have time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a nit
@@ -615,10 +661,9 @@ def test_hash_multiple_mode_query_avg_distincts(data_gen, conf): | |||
@approximate_float | |||
@ignore_order | |||
@incompat | |||
@pytest.mark.parametrize('data_gen', _init_list_no_nans, ids=idfn) | |||
@pytest.mark.parametrize('data_gen', _init_list_no_nans[1:2], ids=idfn) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a slice for the data gen is really confusing here. At a minimum we need a comment explaining why we are slicing it. Preferably it is a separate value along with the explanation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, its my mistake. I added the slice to facilitate debugging, and I forgot to remove it before submission.
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, but I would like to have a few more eyes look at it.
@@ -617,8 +663,7 @@ def test_hash_multiple_mode_query_avg_distincts(data_gen, conf): | |||
@incompat | |||
@pytest.mark.parametrize('data_gen', _init_list_no_nans, ids=idfn) | |||
@pytest.mark.parametrize('conf', get_params(_confs, params_markers_for_confs), ids=idfn) | |||
@pytest.mark.parametrize('parameterless', ['true', pytest.param('false', marks=pytest.mark.xfail( | |||
condition=not is_before_spark_311(), reason="parameterless count not supported by default in Spark 3.1+"))]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why was this removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am sorry. This option shouldn't be removed. I added it back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like a good approach to be able to express the varying nature of these aggregate plans and the prior method was limiting and wrong in some cases.
I'd like to see these patterns tied to a spark version/flavor (somehow), but that can come in a follow up. @sperlingxx what do you think? Main reason I bring this up is so that it's clearer from the pattern what Spark we are targeting.
# test with single Distinct | ||
assert_cpu_and_gpu_are_equal_collect_with_capture( | ||
lambda spark: gen_df(spark, data_gen, length=100) | ||
.groupby('a') | ||
.agg(f.sort_array(f.collect_list('b')), | ||
f.sort_array(f.collect_set('b')), | ||
f.countDistinct('c'), | ||
f.count('c')), | ||
f.countDistinct('c')), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did f.count
get removed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sperlingxx I did have this issue, I am not sure why this test content had to change. That said, I know we are trying to get some tests to pass, as long as we address this either on this PR or a subsequent one linked to this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@abellina f.count
got removed because it looks redundant since we already had non-distinct aggregations: collect_list
and collect_set
. The test case can successfully run with the removed f.count
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, that's not consistent. There are other tests with that combination and with collect*. I personally would rather have more aggregates computed than less, it is more chance to catch issues with the aggregation buffer/ordinal logic.
This test also has two assertions, and I recall distinctly having to remove one of them to debug the other. This is different than my previous comment, but it should really be two tests.
This is blocking other things, I am ok doing this as a follow up.
build |
Hi @abellina , I fully agree on the idea about listing common aggregation patterns of spark. Perhaps we can label them as constants? I filed a follow-up issue #3437 for your idea. |
Signed-off-by: sperlingxx lovedreamf@gmail.com
Fixes #3339
Current PR is to fix the test failure of test_hash_groupby_collect_partial_replace_fallback on Databricks runtime. The root cause of the test failure is about different planning strategies for Aggregation between Spark and Databricks runtime. To support both Spark and Databricks runtime, we need a more expressive configuration for
hashAggReplaceMode
. In addition, with the extendedTagForReplaceMode
method, we are able to build tests on partial GPU replacement for Aggregate in a more preciser way.P.S. It has been manually tested on DB_301.