Added ParquetCachedBatchSerializer support for Databricks #2880

razajafri · 2021-07-07T08:35:39Z

This PR adds ParquetCachedBatchSerializer support for Databricks and added nightly tests

fixes #2856

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

razajafri · 2021-07-07T08:36:44Z

@NvTimLiu can you especially take a look at the run-tests.py?

jenkins/databricks/run-tests.py

jenkins/databricks/test.sh

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

razajafri · 2021-07-07T19:47:47Z

build

viadea · 2021-07-07T23:08:58Z

@razajafri As per tests using the fix jar, GpuInMemoryTableScan is in place.
However when comparing the Databricks' plan vs Standalone's plan, there is an extra GpuColumnarToRow right before InMemoryRelation.

For example,

Databricks plan:

== Physical Plan ==
GpuColumnarToRow false
+- GpuHashAggregate(keys=[], functions=[gpucount(distinct _gen_alias_217#217)]), filters=List(None))
   +- GpuShuffleCoalesce 2147483647
      +- GpuColumnarExchange gpusinglepartitioning$(), ENSURE_REQUIREMENTS, [id=#963]
         +- GpuHashAggregate(keys=[], functions=[partial_gpucount(distinct _gen_alias_217#217)]), filters=List(None))
            +- GpuHashAggregate(keys=[_gen_alias_217#217], functions=[]), filters=List())
               +- GpuShuffleCoalesce 2147483647
                  +- GpuColumnarExchange gpuhashpartitioning(_gen_alias_217#217, 200), ENSURE_REQUIREMENTS, [id=#952]
                     +- GpuHashAggregate(keys=[_gen_alias_217#217], functions=[]), filters=List())
                        +- GpuProject [col#152.name.firstname AS _gen_alias_217#217]
                           +- GpuInMemoryTableScan [col#152]
                                 +- InMemoryRelation [col#152], StorageLevel(disk, memory, deserialized, 1 replicas)
                                       +- GpuColumnarToRow false
                                          +- GpuProject [named_struct(name, name#57, newname, named_struct(firstname, name#57.firstname, lastname, name#57.lastname)) AS col#152]
                                             +- GpuFileGpuScan parquet [name#57] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[dbfs:/tmp/testparquet], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<name:struct<firstname:string,middlename:string,lastname:string>

Standalone cluster's plan:

== Physical Plan ==
GpuColumnarToRowTransition false
+- GpuHashAggregate(keys=[], functions=[gpucount(distinct _gen_alias_117#117)]), filters=List(None))
   +- GpuShuffleCoalesce 2147483647
      +- GpuColumnarExchange gpusinglepartitioning$(), ENSURE_REQUIREMENTS, [id=#266]
         +- GpuHashAggregate(keys=[], functions=[partial_gpucount(distinct _gen_alias_117#117)]), filters=List(None))
            +- GpuHashAggregate(keys=[_gen_alias_117#117], functions=[]), filters=List())
               +- GpuShuffleCoalesce 2147483647
                  +- GpuColumnarExchange gpuhashpartitioning(_gen_alias_117#117, 200), ENSURE_REQUIREMENTS, [id=#255]
                     +- GpuHashAggregate(keys=[_gen_alias_117#117], functions=[]), filters=List())
                        +- GpuProject [col#62.name.firstname AS _gen_alias_117#117]
                           +- GpuInMemoryTableScan [col#62]
                                 +- InMemoryRelation [col#62], StorageLevel(disk, memory, deserialized, 1 replicas)
                                       +- GpuProject [named_struct(name, name#16, newname, named_struct(firstname, name#16.firstname, lastname, name#16.lastname)) AS col#62]
                                          +- GpuFileGpuScan parquet [name#16] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/tmp/testparquet], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<name:struct<firstname:string,middlename:string,lastname:string>>

Should we worry about that extra GpuColumnarToRow?

razajafri · 2021-07-08T03:28:14Z

@razajafri As per tests using the fix jar, GpuInMemoryTableScan is in place.
However when comparing the Databricks' plan vs Standalone's plan, there is an extra GpuColumnarToRow right before InMemoryRelation.

For example,

Databricks plan:

== Physical Plan ==
GpuColumnarToRow false
+- GpuHashAggregate(keys=[], functions=[gpucount(distinct _gen_alias_217#217)]), filters=List(None))
   +- GpuShuffleCoalesce 2147483647
      +- GpuColumnarExchange gpusinglepartitioning$(), ENSURE_REQUIREMENTS, [id=#963]
         +- GpuHashAggregate(keys=[], functions=[partial_gpucount(distinct _gen_alias_217#217)]), filters=List(None))
            +- GpuHashAggregate(keys=[_gen_alias_217#217], functions=[]), filters=List())
               +- GpuShuffleCoalesce 2147483647
                  +- GpuColumnarExchange gpuhashpartitioning(_gen_alias_217#217, 200), ENSURE_REQUIREMENTS, [id=#952]
                     +- GpuHashAggregate(keys=[_gen_alias_217#217], functions=[]), filters=List())
                        +- GpuProject [col#152.name.firstname AS _gen_alias_217#217]
                           +- GpuInMemoryTableScan [col#152]
                                 +- InMemoryRelation [col#152], StorageLevel(disk, memory, deserialized, 1 replicas)
                                       +- GpuColumnarToRow false
                                          +- GpuProject [named_struct(name, name#57, newname, named_struct(firstname, name#57.firstname, lastname, name#57.lastname)) AS col#152]
                                             +- GpuFileGpuScan parquet [name#57] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[dbfs:/tmp/testparquet], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<name:struct<firstname:string,middlename:string,lastname:string>

Standalone cluster's plan:

== Physical Plan ==
GpuColumnarToRowTransition false
+- GpuHashAggregate(keys=[], functions=[gpucount(distinct _gen_alias_117#117)]), filters=List(None))
   +- GpuShuffleCoalesce 2147483647
      +- GpuColumnarExchange gpusinglepartitioning$(), ENSURE_REQUIREMENTS, [id=#266]
         +- GpuHashAggregate(keys=[], functions=[partial_gpucount(distinct _gen_alias_117#117)]), filters=List(None))
            +- GpuHashAggregate(keys=[_gen_alias_117#117], functions=[]), filters=List())
               +- GpuShuffleCoalesce 2147483647
                  +- GpuColumnarExchange gpuhashpartitioning(_gen_alias_117#117, 200), ENSURE_REQUIREMENTS, [id=#255]
                     +- GpuHashAggregate(keys=[_gen_alias_117#117], functions=[]), filters=List())
                        +- GpuProject [col#62.name.firstname AS _gen_alias_117#117]
                           +- GpuInMemoryTableScan [col#62]
                                 +- InMemoryRelation [col#62], StorageLevel(disk, memory, deserialized, 1 replicas)
                                       +- GpuProject [named_struct(name, name#16, newname, named_struct(firstname, name#16.firstname, lastname, name#16.lastname)) AS col#62]
                                          +- GpuFileGpuScan parquet [name#16] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/tmp/testparquet], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<name:struct<firstname:string,middlename:string,lastname:string>>

Should we worry about that extra GpuColumnarToRow?

@viadea This is definitely something that I will look into. Can we file an issue for it?

There is a mismatch of param numbers in the Databricks test script, when running Databricks nightly test job or nightly build job. The test job takes 1 to 3 params, while the build job requires only 1 to 2 params. This makes it hard to match up params between the test job and the build jobs. To fix the issue, we explicitly export vars for the test script instead of taking shell params. This method also makes it easier to extend vars/params for the test script. Signed-off-by: Tim Liu <timl@nvidia.com>

razajafri · 2021-07-08T17:56:43Z

LGTM, I can't approve because I am the author, @tgravescs will have to do the honors if he thinks this is good.

@tgravescs are we intentionally not calling super.getExecs + at the start of getExec of this class? If we call super.getExec we will get all the execs from 311 and the InMemoryTableScanExec change will not be necessary.

tgravescs · 2021-07-08T18:28:21Z

yes it was intentional, I'd have to go back and verify but I don't think there was any overlap, I think data bricks override everything with its own version so didn't do any good.

jlowe · 2021-07-08T18:35:05Z

I think data bricks override everything with its own version so didn't do any good.

Even if Databricks overrode everything, wouldn't we want to still try to pick up the base version in case a new shim for an exec appears that we can just reuse in the future?

tgravescs · 2021-07-08T18:49:44Z

sure, then we I think can remove the ParquetCachedBatchSerializer spark311db version all together.

razajafri · 2021-07-09T00:34:05Z

sure, then we I think can remove the ParquetCachedBatchSerializer spark311db version all together.

OK, I will push another update to this PR where we will get everything from the parent just like in all other shims

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

razajafri · 2021-07-09T16:07:29Z

build

jlowe

Looks OK to me, assuming this was tested on Databricks.

tgravescs · 2021-07-09T17:18:34Z

...b/src/main/scala/com/nvidia/spark/rapids/shims/spark311db/ParquetCachedBatchSerializer.scala

+
+import com.nvidia.spark.rapids.shims
+
+class ParquetCachedBatchSerializer extends shims.spark311.ParquetCachedBatchSerializer {


this file shouldn't be needed now right? the 311 shim version of this works.
Or is the intention to keep the db one so that the user specifies this one? https://nvidia.github.io/spark-rapids/docs/additional-functionality/cache-serializer.html. If that is the case we need to update the docs.

originally the intention was that since the user has to specify it have it match the version of spark they are using so that its hopefully least confusing. I don't know how much it betters if they specify spark311 vs spark311db

if its possible it would be nice to have one class that just loaded the proper shim version, but that is a separate issue.

this file shouldn't be needed now right? the 311 shim version of this works.
Or is the intention to keep the db one so that the user specifies this one? https://nvidia.github.io/spark-rapids/docs/additional-functionality/cache-serializer.html. If that is the case we need to update the docs.

originally the intention was that since the user has to specify it have it match the version of spark they are using so that its hopefully least confusing. I don't know how much it betters if they specify spark311 vs spark311db

If we add spark311db to the documentation we will then also have to add spark311cdh. I almost feel like they are all spark 311 and we aren't doing anything specific in the extended version of their PCBS we should just get rid of them.

if its possible it would be nice to have one class that just loaded the proper shim version, but that is a separate issue.

This is a good idea, I can look into it as a follow-on

@jlowe thoughts?

if its possible it would be nice to have one class that just loaded the proper shim version, but that is a separate issue.

☝️ This. We should not have Spark-specific versions of user-visible classes unless they are truly required (e.g.: as in the shuffle case, unfortunately). If we know one class will work going forward, as is the case with the main executor plugin, then we should strive to use a common class name without a Spark version in it. If this is indeed possible, we should deprecate the old 311 version and eventually remove it.

So it really all comes down to that question. If we can have a common version, my vote is to use the one class. We can change the package name and deprecate the existing spark311 package version in a new PR if it's too tricky to do in this one.

OK. I can do this as a follow-on. In the interim, do we just update the doc or remove the spark311db or spark311cdh versions of the serializer? I feel removing the db and cdh versions of the serializer should be the way as we will do more work as part of the follow-on

If we're planning on removing these classes in the near future then we should not document them only to rip them out immediately afterward. Let's keep the number of classes to deprecate to a minimum.

jlowe

Looks OK to me assuming the other Spark-specific cache serializers (e.g.: spark312) will be removed in a followup.

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

razajafri · 2021-07-09T21:36:04Z

build

.../spark311cdh/src/main/scala/com/nvidia/spark/rapids/shims/spark311cdh/Spark311CDHShims.scala

shims/spark311db/src/main/scala/com/nvidia/spark/rapids/shims/spark311db/Spark311dbShims.scala

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

razajafri · 2021-07-12T17:35:16Z

build

jlowe

Looks like some more unneeded imports in the Databricks shim.

shims/spark311db/src/main/scala/com/nvidia/spark/rapids/shims/spark311db/Spark311dbShims.scala

…spark311db/Spark311dbShims.scala Co-authored-by: Jason Lowe <jlowe@nvidia.com>

razajafri · 2021-07-12T18:00:31Z

build

pxLi · 2021-07-24T08:26:55Z

#3016, and made a fix at #3018

razajafri added 2 commits July 6, 2021 16:52

Added PCBS Caching support to Databricks shim

1555f60

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

added tests

4f03b82

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

razajafri requested review from GaryShen2008, jlowe, NvTimLiu, revans2 and tgravescs as code owners July 7, 2021 08:35

tgravescs previously approved these changes Jul 7, 2021

View reviewed changes

NvTimLiu reviewed Jul 7, 2021

View reviewed changes

jenkins/databricks/run-tests.py Outdated Show resolved Hide resolved

NvTimLiu reviewed Jul 7, 2021

View reviewed changes

jenkins/databricks/test.sh Show resolved Hide resolved

added the missing placeholder

7d02e9b

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

razajafri dismissed tgravescs’s stale review via 7d02e9b July 7, 2021 19:39

call the parent's method before adding custom overrides

2c3a4c9

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

razajafri requested a review from tgravescs July 9, 2021 16:07

jlowe previously approved these changes Jul 9, 2021

View reviewed changes

sameerz added the feature request New feature or request label Jul 9, 2021

tgravescs reviewed Jul 9, 2021

View reviewed changes

jlowe previously approved these changes Jul 9, 2021

View reviewed changes

removed the db serializer from test.sh

807d34d

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

razajafri dismissed jlowe’s stale review via 807d34d July 9, 2021 21:20

razajafri requested review from tgravescs, NvTimLiu and jlowe July 12, 2021 14:58

tgravescs reviewed Jul 12, 2021

View reviewed changes

.../spark311cdh/src/main/scala/com/nvidia/spark/rapids/shims/spark311cdh/Spark311CDHShims.scala Outdated Show resolved Hide resolved

tgravescs reviewed Jul 12, 2021

View reviewed changes

shims/spark311db/src/main/scala/com/nvidia/spark/rapids/shims/spark311db/Spark311dbShims.scala Outdated Show resolved Hide resolved

removed unused imports

07c93bf

Signed-off-by: Raza Jafri <rjafri@nvidia.com>

razajafri requested a review from tgravescs July 12, 2021 17:35

tgravescs previously approved these changes Jul 12, 2021

View reviewed changes

jlowe reviewed Jul 12, 2021

View reviewed changes

shims/spark311db/src/main/scala/com/nvidia/spark/rapids/shims/spark311db/Spark311dbShims.scala Outdated Show resolved Hide resolved

shims/spark311db/src/main/scala/com/nvidia/spark/rapids/shims/spark311db/Spark311dbShims.scala Outdated Show resolved Hide resolved

Update shims/spark311db/src/main/scala/com/nvidia/spark/rapids/shims/…

13e5aec

…spark311db/Spark311dbShims.scala Co-authored-by: Jason Lowe <jlowe@nvidia.com>

razajafri dismissed tgravescs’s stale review via 13e5aec July 12, 2021 18:00

Update shims/spark311db/src/main/scala/com/nvidia/spark/rapids/shims/…

179c435

…spark311db/Spark311dbShims.scala Co-authored-by: Jason Lowe <jlowe@nvidia.com>

jlowe approved these changes Jul 12, 2021

View reviewed changes

razajafri merged commit 434b911 into NVIDIA:branch-21.08 Jul 12, 2021

razajafri deleted the db_cache branch July 15, 2021 01:17

pxLi mentioned this pull request Jul 24, 2021

[BUG] databricks script failed to return correct exit code #3016

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added ParquetCachedBatchSerializer support for Databricks #2880

Added ParquetCachedBatchSerializer support for Databricks #2880

razajafri commented Jul 7, 2021 •

edited

Loading

razajafri commented Jul 7, 2021

razajafri commented Jul 7, 2021

viadea commented Jul 7, 2021

razajafri commented Jul 8, 2021

razajafri commented Jul 8, 2021 •

edited

Loading

tgravescs commented Jul 8, 2021 •

edited

Loading

jlowe commented Jul 8, 2021 •

edited

Loading

tgravescs commented Jul 8, 2021

razajafri commented Jul 9, 2021

razajafri commented Jul 9, 2021

jlowe left a comment

tgravescs Jul 9, 2021

tgravescs Jul 9, 2021

razajafri Jul 9, 2021

razajafri Jul 9, 2021

jlowe Jul 9, 2021

razajafri Jul 9, 2021

jlowe Jul 9, 2021

jlowe left a comment

razajafri commented Jul 9, 2021

razajafri commented Jul 12, 2021

jlowe left a comment

razajafri commented Jul 12, 2021

pxLi commented Jul 24, 2021


		import com.nvidia.spark.rapids.shims

		class ParquetCachedBatchSerializer extends shims.spark311.ParquetCachedBatchSerializer {

Added ParquetCachedBatchSerializer support for Databricks #2880

Added ParquetCachedBatchSerializer support for Databricks #2880

Conversation

razajafri commented Jul 7, 2021 • edited Loading

razajafri commented Jul 7, 2021

razajafri commented Jul 7, 2021

viadea commented Jul 7, 2021

razajafri commented Jul 8, 2021

razajafri commented Jul 8, 2021 • edited Loading

tgravescs commented Jul 8, 2021 • edited Loading

jlowe commented Jul 8, 2021 • edited Loading

tgravescs commented Jul 8, 2021

razajafri commented Jul 9, 2021

razajafri commented Jul 9, 2021

jlowe left a comment

Choose a reason for hiding this comment

tgravescs Jul 9, 2021

Choose a reason for hiding this comment

tgravescs Jul 9, 2021

Choose a reason for hiding this comment

razajafri Jul 9, 2021

Choose a reason for hiding this comment

razajafri Jul 9, 2021

Choose a reason for hiding this comment

jlowe Jul 9, 2021

Choose a reason for hiding this comment

razajafri Jul 9, 2021

Choose a reason for hiding this comment

jlowe Jul 9, 2021

Choose a reason for hiding this comment

jlowe left a comment

Choose a reason for hiding this comment

razajafri commented Jul 9, 2021

razajafri commented Jul 12, 2021

jlowe left a comment

Choose a reason for hiding this comment

razajafri commented Jul 12, 2021

pxLi commented Jul 24, 2021

razajafri commented Jul 7, 2021 •

edited

Loading

razajafri commented Jul 8, 2021 •

edited

Loading

tgravescs commented Jul 8, 2021 •

edited

Loading

jlowe commented Jul 8, 2021 •

edited

Loading