Read the complete batch before returning when selectedAttributes is empty #2935

razajafri · 2021-07-15T04:22:40Z

When the selectedAttributes was empty, there was an optimization that would return an empty batch of the same number of rows, but that wasn't playing well with how we are doing a row count and the count was always returned as 0 in the GPU case, but in the CPU case, it was returning a different number corresponding to the number of partitions.

There was also a bug in converting CatchedBatches to Columnar/InternalRow, the code was only reading the first batch before moving on to the next partition.

fixes #2891

Signed-off-by: Raza Jafri rjafri@nvidia.com

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

razajafri · 2021-07-15T04:23:59Z

build

integration_tests/src/main/python/cache_test.py

jlowe · 2021-07-15T13:33:20Z

integration_tests/src/main/python/cache_test.py

+# When running on the GPU with the DefaultCachedBatchSerializer, to project the results Spark adds a ColumnarToRowExec
+# to be able to show the results which will cause this test to throw an exception as it's not on the GPU so we have to
+# add that case to the `allowed` list. As of now there is no way for us to limit the scope of allow_non_gpu based on a


Can you elaborate a bit on what exactly is causing the column-to-row conversion? Normally a collect alone doesn't trigger this (although a show often does due to the string casts that applies). What type(s) are triggering this?

So this is how I understand it, when ParquetCachedBatchSerializer isn't being used, the InMemoryTableScanExec is not on the GPU. Now if we set the spark.sql.inMemoryColumnarStorage.enableVectorizedReader to true, that causes the plan to use the columnar option of the serializer, that means that the output from the InMemoryTableScanExec is going to be columnar, therefore it needs to be converted to rows before we can collect it.

I'm still confused here because just about every test uses collect() to grab the results, but I don't see most tests needing to exclude ColumnarToRowExec in order to perform the collect without a failure. Do we really understand this, or was it just a workaround to get the test to pass? Why don't other tests need this when they collect?

Our tests don't cache, this is strictly for InMemoryTableScanExec. This is how I understand this. I would love to know what @revans2 thinks about my explanation

jlowe · 2021-07-15T13:35:58Z

integration_tests/src/main/python/cache_test.py

+
+    conf = enable_vectorized_conf.copy()
+    conf.update(allow_negative_scale_of_decimal_conf)
+    conf.update({"spark.rapids.sql.test.batchsize": "100"})


I would expect this setting to be passed in by the multi-batch test rather than forced on every test type, otherwise we're not testing the single-batch scenario.

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala

...311/src/main/scala/com/nvidia/spark/rapids/shims/spark311/ParquetCachedBatchSerializer.scala

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala

jlowe · 2021-07-15T15:45:42Z

This PR makes multiple changes to the way we were handling case when selectedAttributes was empty.

Can you elaborate a bit more in the PR description what the changes are? This essentially says, "stuff changed when selectedAttributes is empty" which doesn't tell me anything I didn't already know from the PR headline.

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

razajafri · 2021-07-15T23:57:26Z

build

...311/src/main/scala/com/nvidia/spark/rapids/shims/spark311/ParquetCachedBatchSerializer.scala

…bleIterator Signed-off-by: Raza Jafri <rjafri@nvidia.com>

razajafri · 2021-07-16T18:50:35Z

build

razajafri · 2021-07-16T19:14:11Z

I have found a problem that will cause the nightly build to fail. Let me fix that before this is merged.

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

razajafri · 2021-07-17T09:01:51Z

build

razajafri · 2021-07-18T03:24:58Z

Not sure whats going on with the CI but other PRs are also failing the CI @pxLi are you aware of this?

pxLi · 2021-07-19T06:54:11Z

build

jlowe · 2021-07-20T13:29:52Z

integration_tests/src/main/python/cache_test.py

@@ -56,7 +56,7 @@ def test_passing_gpuExpr_as_Expr(enable_vectorized_conf):
           pytest.param(DoubleGen(special_cases=double_special_cases), marks=[incompat]),
           BooleanGen(), DateGen(), TimestampGen()] + decimal_gens

-@pytest.mark.parametrize('data_gen', all_gen, ids=idfn)
+@pytest.mark.parametrize('data_gen', [BooleanGen()], ids=idfn)


Why would we stop testing most of the data types for this test?

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

razajafri · 2021-07-21T01:57:20Z

Build

razajafri added 2 commits July 14, 2021 10:43

Read the cache even if the selected attributes is empty

fcd1bf9

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

read the entire batch

01efdcb

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

razajafri requested review from revans2 and jlowe July 15, 2021 04:24

jlowe reviewed Jul 15, 2021

View reviewed changes

sameerz added the bug Something isn't working label Jul 15, 2021

sameerz added this to the July 5 - July 16 milestone Jul 15, 2021

addressed feedback

ce26386

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

jlowe reviewed Jul 16, 2021

View reviewed changes

...311/src/main/scala/com/nvidia/spark/rapids/shims/spark311/ParquetCachedBatchSerializer.scala Outdated Show resolved Hide resolved

jlowe reviewed Jul 16, 2021

View reviewed changes

...311/src/main/scala/com/nvidia/spark/rapids/shims/spark311/ParquetCachedBatchSerializer.scala Outdated Show resolved Hide resolved

Removed the catch exception clause and wrapped the Iterator in Closea…

7e9f022

…bleIterator Signed-off-by: Raza Jafri <rjafri@nvidia.com>

razajafri marked this pull request as draft July 16, 2021 19:13

Added a task listener to CurrentBatchIterator

ee91b1e

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

sameerz modified the milestones: July 5 - July 16, July 19 - July 30 Jul 16, 2021

razajafri marked this pull request as ready for review July 19, 2021 18:34

razajafri requested a review from jlowe July 19, 2021 23:08

jlowe reviewed Jul 20, 2021

View reviewed changes

revert change to the test that limits the DataTypes

bbd7a4f

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

jlowe approved these changes Jul 21, 2021

View reviewed changes

razajafri merged commit 84a82f9 into NVIDIA:branch-21.08 Jul 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read the complete batch before returning when selectedAttributes is empty #2935

Read the complete batch before returning when selectedAttributes is empty #2935

razajafri commented Jul 15, 2021 •

edited

Loading

razajafri commented Jul 15, 2021

jlowe Jul 15, 2021

razajafri Jul 15, 2021

jlowe Jul 16, 2021

razajafri Jul 16, 2021

jlowe Jul 15, 2021

jlowe commented Jul 15, 2021

razajafri commented Jul 15, 2021

razajafri commented Jul 16, 2021

razajafri commented Jul 16, 2021

razajafri commented Jul 17, 2021

razajafri commented Jul 18, 2021

pxLi commented Jul 19, 2021

jlowe Jul 20, 2021

razajafri commented Jul 21, 2021

Read the complete batch before returning when selectedAttributes is empty #2935

Read the complete batch before returning when selectedAttributes is empty #2935

Conversation

razajafri commented Jul 15, 2021 • edited Loading

razajafri commented Jul 15, 2021

jlowe Jul 15, 2021

Choose a reason for hiding this comment

razajafri Jul 15, 2021

Choose a reason for hiding this comment

jlowe Jul 16, 2021

Choose a reason for hiding this comment

razajafri Jul 16, 2021

Choose a reason for hiding this comment

jlowe Jul 15, 2021

Choose a reason for hiding this comment

jlowe commented Jul 15, 2021

razajafri commented Jul 15, 2021

razajafri commented Jul 16, 2021

razajafri commented Jul 16, 2021

razajafri commented Jul 17, 2021

razajafri commented Jul 18, 2021

pxLi commented Jul 19, 2021

jlowe Jul 20, 2021

Choose a reason for hiding this comment

razajafri commented Jul 21, 2021

razajafri commented Jul 15, 2021 •

edited

Loading