Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] legacy struct cast to string crashes on a two field struct #2315

Closed
gerashegalov opened this issue Apr 30, 2021 · 0 comments · Fixed by #2395
Closed

[BUG] legacy struct cast to string crashes on a two field struct #2315

gerashegalov opened this issue Apr 30, 2021 · 0 comments · Fixed by #2395
Assignees
Labels
bug Something isn't working P0 Must have for release

Comments

@gerashegalov
Copy link
Collaborator

Describe the bug
In Spark 3.0.x or in Spark 3.1.+ with spark.sql.legacy.castComplexTypesToString.enabled=true queries from an RDD source may crash with

java.lang.AssertionError:  value at 15 is null
        at ai.rapids.cudf.HostColumnVectorCore.assertsForGet(HostColumnVectorCore.java:228)
        at ai.rapids.cudf.HostColumnVectorCore.getUTF8(HostColumnVectorCore.java:355)
        at com.nvidia.spark.rapids.RapidsHostColumnVectorCore.getUTF8String(RapidsHostColumnVectorCore.java:177)
        at org.apache.spark.sql.vectorized.ColumnarBatchRow.getUTF8String(ColumnarBatch.java:220)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
        at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
        at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:346)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)

Steps/Code to reproduce bug
Minimum repro:

import sys
sys.path.append(r'/path/to/localgit/spark-rapids/integration_tests/src/main/python')

import pyspark.sql.functions as F
import pytest
from data_gen import *

key_data_gen = StructGen([
        ('a', IntegerGen(min_val=0, max_val=4)),
        ('b', IntegerGen(min_val=5, max_val=9)),
    ], nullable=False)
val_data_gen = IntegerGen()
df = two_col_df(spark, key_data_gen, val_data_gen)

# For Spark 3.1+
spark.conf.set('spark.sql.legacy.castComplexTypesToString.enabled', True)
df.select(df.a.cast(StringType())).filter(df.b > 1).collect()

GPU plan exhibiting the crash:

21/04/30 17:37:16 WARN GpuOverrides: 
*Exec <ProjectExec> will run on GPU
  *Expression <Alias> cast(a#0 as string) AS a#34 will run on GPU
    *Expression <Cast> cast(a#0 as string) will run on GPU
  *Exec <FilterExec> will run on GPU
    *Expression <And> (isnotnull(b#1) AND (b#1 > 1)) will run on GPU
      *Expression <IsNotNull> isnotnull(b#1) will run on GPU
      *Expression <GreaterThan> (b#1 > 1) will run on GPU
    !NOT_FOUND <RDDScanExec> cannot run on GPU because no GPU enabled version of operator class org.apache.spark.sql.execution.RDDScanExec could be found
      @Expression <AttributeReference> a#0 could run on GPU
      @Expression <AttributeReference> b#1 could run on GPU

Expected behavior
Cast should work the same as on CPU

Environment details (please complete the following information)
local REPL is sufficient to reproduce

Additional context
bug found while working on #2274 .

Interestingly saving the synthetic df to parquet and reading it back yields the correct result without a crash with the FileSourceScanExec plan:

*Exec <ProjectExec> will run on GPU
  *Expression <Alias> cast(a#38 as string) AS a#42 will run on GPU
    *Expression <Cast> cast(a#38 as string) will run on GPU
  *Exec <FilterExec> will run on GPU
    *Expression <And> (isnotnull(b#39) AND (b#39 > 1)) will run on GPU
      *Expression <IsNotNull> isnotnull(b#39) will run on GPU
      *Expression <GreaterThan> (b#39 > 1) will run on GPU
    *Exec <FileSourceScanExec> will run on GPU
@gerashegalov gerashegalov added bug Something isn't working ? - Needs Triage Need team to review and classify P1 Nice to have for release labels Apr 30, 2021
@sameerz sameerz added P0 Must have for release and removed ? - Needs Triage Need team to review and classify P1 Nice to have for release labels May 4, 2021
@sameerz sameerz added this to the May 10 - May 21 milestone May 4, 2021
gerashegalov added a commit to gerashegalov/spark-rapids that referenced this issue May 11, 2021
Fixes NVIDIA#2309 and NVIDIA#2315

Signed-off-by: Gera Shegalov <gera@apache.org>
@gerashegalov gerashegalov linked a pull request May 11, 2021 that will close this issue
gerashegalov added a commit that referenced this issue May 14, 2021
Refactors struct cast to string such that there no need for a dedicated method handling the legacy mode cast. Fixes #2309 and #2315

Signed-off-by: Gera Shegalov gera@apache.org
nartal1 pushed a commit to nartal1/spark-rapids that referenced this issue Jun 9, 2021
Refactors struct cast to string such that there no need for a dedicated method handling the legacy mode cast. Fixes NVIDIA#2309 and NVIDIA#2315

Signed-off-by: Gera Shegalov gera@apache.org
nartal1 pushed a commit to nartal1/spark-rapids that referenced this issue Jun 9, 2021
Refactors struct cast to string such that there no need for a dedicated method handling the legacy mode cast. Fixes NVIDIA#2309 and NVIDIA#2315

Signed-off-by: Gera Shegalov gera@apache.org
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P0 Must have for release
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants