You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
A clear and concise description of what the bug is.
cache of struct does not work on databricks 8.2ML.
When setting spark.sql.cache.serializer com.nvidia.spark.rapids.shims.spark311.ParquetCachedBatchSerializer, it falls back on CPU.
When setting spark.sql.cache.serializer com.nvidia.spark.rapids.shims.spark311db.ParquetCachedBatchSerializer, it fails with : ClassNotFoundException: com.nvidia.spark.rapids.shims.spark311db.ParquetCachedBatchSerializer
I also checked the shim layers and could not find this ParquetCachedBatchSerializer in spark311db shims:
$ grep -r ParquetCachedBatchSerializer *
spark311/src/main/scala/org/apache/spark/sql/rapids/shims/spark311/GpuInMemoryTableScanExec.scala:import com.nvidia.spark.rapids.shims.spark311.ParquetCachedBatchSerializer
spark311/src/main/scala/org/apache/spark/sql/rapids/shims/spark311/GpuInMemoryTableScanExec.scala: relation.cacheBuilder.serializer.asInstanceOf[ParquetCachedBatchSerializer]
spark311/src/main/scala/com/nvidia/spark/rapids/shims/spark311/Spark311Shims.scala: if (!scan.relation.cacheBuilder.serializer.isInstanceOf[ParquetCachedBatchSerializer]) {
spark311/src/main/scala/com/nvidia/spark/rapids/shims/spark311/Spark311Shims.scala: willNotWorkOnGpu("ParquetCachedBatchSerializer is not being used")
spark311/src/main/scala/com/nvidia/spark/rapids/shims/spark311/Spark311Shims.scala: if (serClass == classOf[ParquetCachedBatchSerializer]) {
spark311/src/main/scala/com/nvidia/spark/rapids/shims/spark311/ParquetCachedBatchSerializer.scala:class ParquetCachedBatchSerializer extends CachedBatchSerializer with Arm {
spark311cdh/src/main/scala/com/nvidia/spark/rapids/shims/spark311cdh/Spark311CDHShims.scala: if (!scan.relation.cacheBuilder.serializer.isInstanceOf[ParquetCachedBatchSerializer]) {
spark311cdh/src/main/scala/com/nvidia/spark/rapids/shims/spark311cdh/Spark311CDHShims.scala: willNotWorkOnGpu("ParquetCachedBatchSerializer is not being used")
spark311cdh/src/main/scala/com/nvidia/spark/rapids/shims/spark311cdh/Spark311CDHShims.scala: if (serClass == classOf[ParquetCachedBatchSerializer]) {
spark311cdh/src/main/scala/com/nvidia/spark/rapids/shims/spark311cdh/ParquetCachedBatchSerializer.scala:class ParquetCachedBatchSerializer extends CachedBatchSerializer with Arm {
spark312/src/main/scala/com/nvidia/spark/rapids/shims/spark312/ParquetCachedBatchSerializer.scala:class ParquetCachedBatchSerializer extends shims.spark311.ParquetCachedBatchSerializer {
spark313/src/main/scala/com/nvidia/spark/rapids/shims/spark313/ParquetCachedBatchSerializer.scala:class ParquetCachedBatchSerializer extends shims.spark312.ParquetCachedBatchSerializer {
spark320/src/main/scala/com/nvidia/spark/rapids/shims/spark320/ParquetCachedBatchSerializer.scala:class ParquetCachedBatchSerializer extends shims.spark311.ParquetCachedBatchSerializer {
Steps/Code to reproduce bug
Please provide a list of steps or a code sample to reproduce the issue.
Avoid posting private or sensitive data.
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
val data = Seq(
Row(Row("Adam ","","Green"),"1","M",1000),
Row(Row("Bob ","Middle","Green"),"2","M",2000),
Row(Row("Cathy ","","Green"),"3","F",3000)
)
val schema = (new StructType()
.add("name",new StructType()
.add("firstname",StringType)
.add("middlename",StringType)
.add("lastname",StringType))
.add("id",StringType)
.add("gender",StringType)
.add("salary",IntegerType))
val df = spark.createDataFrame(spark.sparkContext.parallelize(data),schema)
df.write.format("parquet").mode("overwrite").save("/tmp/testparquet")
val df2 = spark.read.parquet("/tmp/testparquet")
df2.createOrReplaceTempView("df2")
val df3=spark.sql("select struct(name, struct(name.firstname, name.lastname) as newname) as col from df2").cache
df3.createOrReplaceTempView("df3")
spark.sql("select count(distinct col.name.firstname) from df3").show
spark.sql("select count(distinct col.name.firstname) from df3").explain
Describe the bug
A clear and concise description of what the bug is.
cache of struct does not work on databricks 8.2ML.
When setting
spark.sql.cache.serializer com.nvidia.spark.rapids.shims.spark311.ParquetCachedBatchSerializer
, it falls back on CPU.When setting
spark.sql.cache.serializer com.nvidia.spark.rapids.shims.spark311db.ParquetCachedBatchSerializer
, it fails with :ClassNotFoundException: com.nvidia.spark.rapids.shims.spark311db.ParquetCachedBatchSerializer
I also checked the shim layers and could not find this ParquetCachedBatchSerializer in spark311db shims:
Steps/Code to reproduce bug
Please provide a list of steps or a code sample to reproduce the issue.
Avoid posting private or sensitive data.
Below plan is shown:
Expected behavior
A clear and concise description of what you expected to happen.
Correct plan should be:
Environment details (please complete the following information)
Databricks 8.2ML GPU with spark 3.1.1
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: