Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read the complete batch before returning when selectedAttributes is empty #2935

Merged
merged 6 commits into from
Jul 21, 2021

Conversation

razajafri
Copy link
Collaborator

@razajafri razajafri commented Jul 15, 2021

When the selectedAttributes was empty, there was an optimization that would return an empty batch of the same number of rows, but that wasn't playing well with how we are doing a row count and the count was always returned as 0 in the GPU case, but in the CPU case, it was returning a different number corresponding to the number of partitions.

There was also a bug in converting CatchedBatches to Columnar/InternalRow, the code was only reading the first batch before moving on to the next partition.

fixes #2891

Signed-off-by: Raza Jafri rjafri@nvidia.com

Signed-off-by: Raza Jafri <rjafri@nvidia.com>
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
@razajafri
Copy link
Collaborator Author

build

@razajafri razajafri requested review from revans2 and jlowe July 15, 2021 04:24
integration_tests/src/main/python/cache_test.py Outdated Show resolved Hide resolved
Comment on lines +312 to +314
# When running on the GPU with the DefaultCachedBatchSerializer, to project the results Spark adds a ColumnarToRowExec
# to be able to show the results which will cause this test to throw an exception as it's not on the GPU so we have to
# add that case to the `allowed` list. As of now there is no way for us to limit the scope of allow_non_gpu based on a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate a bit on what exactly is causing the column-to-row conversion? Normally a collect alone doesn't trigger this (although a show often does due to the string casts that applies). What type(s) are triggering this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is how I understand it, when ParquetCachedBatchSerializer isn't being used, the InMemoryTableScanExec is not on the GPU. Now if we set the spark.sql.inMemoryColumnarStorage.enableVectorizedReader to true, that causes the plan to use the columnar option of the serializer, that means that the output from the InMemoryTableScanExec is going to be columnar, therefore it needs to be converted to rows before we can collect it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still confused here because just about every test uses collect() to grab the results, but I don't see most tests needing to exclude ColumnarToRowExec in order to perform the collect without a failure. Do we really understand this, or was it just a workaround to get the test to pass? Why don't other tests need this when they collect?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our tests don't cache, this is strictly for InMemoryTableScanExec. This is how I understand this. I would love to know what @revans2 thinks about my explanation


conf = enable_vectorized_conf.copy()
conf.update(allow_negative_scale_of_decimal_conf)
conf.update({"spark.rapids.sql.test.batchsize": "100"})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect this setting to be passed in by the multi-batch test rather than forced on every test type, otherwise we're not testing the single-batch scenario.

@jlowe
Copy link
Member

jlowe commented Jul 15, 2021

This PR makes multiple changes to the way we were handling case when selectedAttributes was empty.

Can you elaborate a bit more in the PR description what the changes are? This essentially says, "stuff changed when selectedAttributes is empty" which doesn't tell me anything I didn't already know from the PR headline.

@sameerz sameerz added the bug Something isn't working label Jul 15, 2021
@sameerz sameerz added this to the July 5 - July 16 milestone Jul 15, 2021
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
@razajafri
Copy link
Collaborator Author

build

…bleIterator

Signed-off-by: Raza Jafri <rjafri@nvidia.com>
@razajafri
Copy link
Collaborator Author

build

@razajafri razajafri marked this pull request as draft July 16, 2021 19:13
@razajafri
Copy link
Collaborator Author

I have found a problem that will cause the nightly build to fail. Let me fix that before this is merged.

Signed-off-by: Raza Jafri <rjafri@nvidia.com>
@razajafri
Copy link
Collaborator Author

build

@razajafri
Copy link
Collaborator Author

Not sure whats going on with the CI but other PRs are also failing the CI @pxLi are you aware of this?

@pxLi
Copy link
Collaborator

pxLi commented Jul 19, 2021

build

@razajafri razajafri marked this pull request as ready for review July 19, 2021 18:34
@razajafri razajafri requested a review from jlowe July 19, 2021 23:08
@@ -56,7 +56,7 @@ def test_passing_gpuExpr_as_Expr(enable_vectorized_conf):
pytest.param(DoubleGen(special_cases=double_special_cases), marks=[incompat]),
BooleanGen(), DateGen(), TimestampGen()] + decimal_gens

@pytest.mark.parametrize('data_gen', all_gen, ids=idfn)
@pytest.mark.parametrize('data_gen', [BooleanGen()], ids=idfn)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would we stop testing most of the data types for this test?

Signed-off-by: Raza Jafri <rjafri@nvidia.com>
@razajafri
Copy link
Collaborator Author

Build

@razajafri razajafri merged commit 84a82f9 into NVIDIA:branch-21.08 Jul 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Discrepancy in getting count before and after caching
4 participants