-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Read the complete batch before returning when selectedAttributes is empty #2935
Conversation
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
build |
# When running on the GPU with the DefaultCachedBatchSerializer, to project the results Spark adds a ColumnarToRowExec | ||
# to be able to show the results which will cause this test to throw an exception as it's not on the GPU so we have to | ||
# add that case to the `allowed` list. As of now there is no way for us to limit the scope of allow_non_gpu based on a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate a bit on what exactly is causing the column-to-row conversion? Normally a collect
alone doesn't trigger this (although a show
often does due to the string casts that applies). What type(s) are triggering this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is how I understand it, when ParquetCachedBatchSerializer
isn't being used, the InMemoryTableScanExec
is not on the GPU. Now if we set the spark.sql.inMemoryColumnarStorage.enableVectorizedReader
to true, that causes the plan to use the columnar option of the serializer, that means that the output from the InMemoryTableScanExec
is going to be columnar, therefore it needs to be converted to rows before we can collect it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still confused here because just about every test uses collect()
to grab the results, but I don't see most tests needing to exclude ColumnarToRowExec
in order to perform the collect without a failure. Do we really understand this, or was it just a workaround to get the test to pass? Why don't other tests need this when they collect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our tests don't cache, this is strictly for InMemoryTableScanExec
. This is how I understand this. I would love to know what @revans2 thinks about my explanation
|
||
conf = enable_vectorized_conf.copy() | ||
conf.update(allow_negative_scale_of_decimal_conf) | ||
conf.update({"spark.rapids.sql.test.batchsize": "100"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect this setting to be passed in by the multi-batch test rather than forced on every test type, otherwise we're not testing the single-batch scenario.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala
Outdated
Show resolved
Hide resolved
...311/src/main/scala/com/nvidia/spark/rapids/shims/spark311/ParquetCachedBatchSerializer.scala
Outdated
Show resolved
Hide resolved
...311/src/main/scala/com/nvidia/spark/rapids/shims/spark311/ParquetCachedBatchSerializer.scala
Outdated
Show resolved
Hide resolved
...311/src/main/scala/com/nvidia/spark/rapids/shims/spark311/ParquetCachedBatchSerializer.scala
Outdated
Show resolved
Hide resolved
...311/src/main/scala/com/nvidia/spark/rapids/shims/spark311/ParquetCachedBatchSerializer.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala
Outdated
Show resolved
Hide resolved
Can you elaborate a bit more in the PR description what the changes are? This essentially says, "stuff changed when selectedAttributes is empty" which doesn't tell me anything I didn't already know from the PR headline. |
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
build |
...311/src/main/scala/com/nvidia/spark/rapids/shims/spark311/ParquetCachedBatchSerializer.scala
Outdated
Show resolved
Hide resolved
...311/src/main/scala/com/nvidia/spark/rapids/shims/spark311/ParquetCachedBatchSerializer.scala
Outdated
Show resolved
Hide resolved
…bleIterator Signed-off-by: Raza Jafri <rjafri@nvidia.com>
build |
I have found a problem that will cause the nightly build to fail. Let me fix that before this is merged. |
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
build |
Not sure whats going on with the CI but other PRs are also failing the CI @pxLi are you aware of this? |
build |
@@ -56,7 +56,7 @@ def test_passing_gpuExpr_as_Expr(enable_vectorized_conf): | |||
pytest.param(DoubleGen(special_cases=double_special_cases), marks=[incompat]), | |||
BooleanGen(), DateGen(), TimestampGen()] + decimal_gens | |||
|
|||
@pytest.mark.parametrize('data_gen', all_gen, ids=idfn) | |||
@pytest.mark.parametrize('data_gen', [BooleanGen()], ids=idfn) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would we stop testing most of the data types for this test?
Signed-off-by: Raza Jafri <rjafri@nvidia.com>
Build |
When the
selectedAttributes
was empty, there was an optimization that would return an empty batch of the same number of rows, but that wasn't playing well with how we are doing a row count and the count was always returned as0
in the GPU case, but in the CPU case, it was returning a different number corresponding to the number of partitions.There was also a bug in converting
CatchedBatch
es to Columnar/InternalRow, the code was only reading the first batch before moving on to the next partition.fixes #2891
Signed-off-by: Raza Jafri rjafri@nvidia.com