[BUG] legacy struct cast to string crashes on a two field struct #2315

gerashegalov · 2021-04-30T17:58:55Z

Describe the bug
In Spark 3.0.x or in Spark 3.1.+ with spark.sql.legacy.castComplexTypesToString.enabled=true queries from an RDD source may crash with

java.lang.AssertionError:  value at 15 is null
        at ai.rapids.cudf.HostColumnVectorCore.assertsForGet(HostColumnVectorCore.java:228)
        at ai.rapids.cudf.HostColumnVectorCore.getUTF8(HostColumnVectorCore.java:355)
        at com.nvidia.spark.rapids.RapidsHostColumnVectorCore.getUTF8String(RapidsHostColumnVectorCore.java:177)
        at org.apache.spark.sql.vectorized.ColumnarBatchRow.getUTF8String(ColumnarBatch.java:220)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
        at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:346)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)

Steps/Code to reproduce bug
Minimum repro:

import sys
sys.path.append(r'/path/to/localgit/spark-rapids/integration_tests/src/main/python')

import pyspark.sql.functions as F
import pytest
from data_gen import *

key_data_gen = StructGen([
        ('a', IntegerGen(min_val=0, max_val=4)),
        ('b', IntegerGen(min_val=5, max_val=9)),
    ], nullable=False)
val_data_gen = IntegerGen()
df = two_col_df(spark, key_data_gen, val_data_gen)

# For Spark 3.1+
spark.conf.set('spark.sql.legacy.castComplexTypesToString.enabled', True)
df.select(df.a.cast(StringType())).filter(df.b > 1).collect()

GPU plan exhibiting the crash:

21/04/30 17:37:16 WARN GpuOverrides: 
*Exec <ProjectExec> will run on GPU
  *Expression <Alias> cast(a#0 as string) AS a#34 will run on GPU
    *Expression <Cast> cast(a#0 as string) will run on GPU
  *Exec <FilterExec> will run on GPU
    *Expression <And> (isnotnull(b#1) AND (b#1 > 1)) will run on GPU
      *Expression <IsNotNull> isnotnull(b#1) will run on GPU
      *Expression <GreaterThan> (b#1 > 1) will run on GPU
    !NOT_FOUND <RDDScanExec> cannot run on GPU because no GPU enabled version of operator class org.apache.spark.sql.execution.RDDScanExec could be found
      @Expression <AttributeReference> a#0 could run on GPU
      @Expression <AttributeReference> b#1 could run on GPU

Expected behavior
Cast should work the same as on CPU

Environment details (please complete the following information)
local REPL is sufficient to reproduce

Additional context
bug found while working on #2274 .

Interestingly saving the synthetic df to parquet and reading it back yields the correct result without a crash with the FileSourceScanExec plan:

*Exec <ProjectExec> will run on GPU
  *Expression <Alias> cast(a#38 as string) AS a#42 will run on GPU
    *Expression <Cast> cast(a#38 as string) will run on GPU
  *Exec <FilterExec> will run on GPU
    *Expression <And> (isnotnull(b#39) AND (b#39 > 1)) will run on GPU
      *Expression <IsNotNull> isnotnull(b#39) will run on GPU
      *Expression <GreaterThan> (b#39 > 1) will run on GPU
    *Exec <FileSourceScanExec> will run on GPU

The text was updated successfully, but these errors were encountered:

Fixes NVIDIA#2309 and NVIDIA#2315 Signed-off-by: Gera Shegalov <gera@apache.org>

Refactors struct cast to string such that there no need for a dedicated method handling the legacy mode cast. Fixes #2309 and #2315 Signed-off-by: Gera Shegalov gera@apache.org

Refactors struct cast to string such that there no need for a dedicated method handling the legacy mode cast. Fixes NVIDIA#2309 and NVIDIA#2315 Signed-off-by: Gera Shegalov gera@apache.org

gerashegalov added bug Something isn't working ? - Needs Triage Need team to review and classify P1 Nice to have for release labels Apr 30, 2021

sameerz added P0 Must have for release and removed ? - Needs Triage Need team to review and classify P1 Nice to have for release labels May 4, 2021

sameerz assigned gerashegalov May 4, 2021

sameerz added this to the May 10 - May 21 milestone May 4, 2021

gerashegalov added a commit to gerashegalov/spark-rapids that referenced this issue May 11, 2021

Unify legacy and current struct cast logic

d6f0911

Fixes NVIDIA#2309 and NVIDIA#2315 Signed-off-by: Gera Shegalov <gera@apache.org>

gerashegalov mentioned this issue May 11, 2021

Unify legacy and 3.1.x struct cast implementations #2395

Merged

gerashegalov linked a pull request May 11, 2021 that will close this issue

Unify legacy and 3.1.x struct cast implementations #2395

Merged

gerashegalov closed this as completed in #2395 May 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] legacy struct cast to string crashes on a two field struct #2315

[BUG] legacy struct cast to string crashes on a two field struct #2315

gerashegalov commented Apr 30, 2021

[BUG] legacy struct cast to string crashes on a two field struct #2315

[BUG] legacy struct cast to string crashes on a two field struct #2315

Comments

gerashegalov commented Apr 30, 2021