Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] data mess up reading from ORC #3007

Closed
wbo4958 opened this issue Jul 23, 2021 · 6 comments
Closed

[BUG] data mess up reading from ORC #3007

wbo4958 opened this issue Jul 23, 2021 · 6 comments
Assignees
Labels
bug Something isn't working P0 Must have for release

Comments

@wbo4958
Copy link
Collaborator

wbo4958 commented Jul 23, 2021

Assume the file schema of an ORC file is

Type: struct<name:string,number:int,english:float,math:int,history:float>

Below reading code will result data mess-up

      val schema = StructType(Array(
        StructField("number", IntegerType),
        StructField("name", StringType),
      ))
     val df = spark.read.schema(schema).orc("xxxxxx")
     df.show()

while below code works fine

      val schema = StructType(Array(
        StructField("name", StringType),
        StructField("number", IntegerType)
      ))
     val df = spark.read.schema(schema).orc("xxxxxx")
     df.show()

That's because we are messing up the orc data with the read schema when re-constructing the orc files.

we're reading data according to the columns' sequence in orc file, while we are writing the read schema into the footer.
So, as long as the read schema is not following the sequence of orc file schema, then issue will happen.

This bug can be repro in latest 21.08-SNAPTHOT and in 21.06 release. I didnt' test the before release. It seems this bug always exist.

@wbo4958 wbo4958 added bug Something isn't working ? - Needs Triage Need team to review and classify P0 Must have for release labels Jul 23, 2021
@tgravescs
Copy link
Collaborator

full repro case:

// create a file like schema above
case class Testing(name: String, number: Int, english: Float, math: Int, history: Float)

val x = sc.parallelize(Array(Testing("three", 23, 3, 3, 4)))
val df = spark.createDataFrame(x)
df.write.orc("testing.orc")
val dfread = spark.read.orc("testing.orc")
dfread.printSchema()
root
 |-- name: string (nullable = true)
 |-- number: integer (nullable = true)
 |-- english: float (nullable = true)
 |-- math: integer (nullable = true)
 |-- history: float (nullable = true)

// now read it

import org.apache.spark.sql.types._
val schema = StructType(Array(
        StructField("number", IntegerType),
        StructField("name", StringType),
      ))
val df = spark.read.schema(schema).orc("testing.orc")
df.show()
df.printSchema

Spark seems to just flip the columns, so it doesn't force the schema read onto the column, it just dictates the order.

spark:

scala> df.show()
+------+-----+
|number| name|
+------+-----+
|    23|three|
+------+-----+

scala> df.printSchema
root
 |-- number: integer (nullable = true)
 |-- name: string (nullable = true)

With the rapids plugin enabled we get:

scala> df.printSchema()
root
 |-- number: integer (nullable = true)
 |-- name: string (nullable = true)
scala> df.show()
+------+----+
|number|name|
+------+----+
|     0|    |
+------+----+

@tgravescs
Copy link
Collaborator

It looks like we are writing the wrong schema to the in memory orc file before sending it to CUDF. We are writing the requested schema vs the actual stripe schema. So in this case the file strip schema is (name: String, number: Int) but we are writing the footer with (number: Int, name: String) and things get mangled.

@firestarman
Copy link
Collaborator

firestarman commented Jul 26, 2021

When fixing this bug, I would prefer to change the file schema to align with read schema for the issue #463.
@wbo4958 has such a fix.

Currently nested schema pruning seems to be basically supported by buildOutputStripe, but it has the same issue related to the different column orders. cuDF does not support pruning nested columns, and will read the whole struct column intactly. So if we fix this bug in that way, then nested schema pruning will be fully supported without changes in cudf.

@Salonijain27 Salonijain27 removed the ? - Needs Triage Need team to review and classify label Jul 27, 2021
@wbo4958
Copy link
Collaborator Author

wbo4958 commented Jul 28, 2021

close this issue

@wbo4958
Copy link
Collaborator Author

wbo4958 commented Jul 28, 2021

Liangcai has found the same issue for the schema which can't be pruned. I just repro it with below code

      val df = Seq(Testing(1, "hello", 2021)).toDF
      df.printSchema()
      // root
      // |-- _col1: integer (nullable = false)
      // |-- _col2: string (nullable = true)
      // |-- _col3: long (nullable = false)
      df.show()
      // +-----+-----+-----+
      //|_col1|_col2|_col3|
      //+-----+-----+-----+
      //|    1|hello| 2021|
      //+-----+-----+-----+
      df.write.mode("overwrite").orc(resource1)

      val schema = StructType(
        Seq(
          StructField("_col2", StringType),
          StructField("_col3", LongType),
          StructField("_col1", IntegerType),
          ))
      val dfRead = spark.read.schema(schema).orc(resource1)
      dfRead.show()

The GPU output is

+-----+-----+-----+
|_col2|_col3|_col1|
+-----+-----+-----+
|     |    1|    5|
+-----+-----+-----+

while the CPU output is

+-----+-----+-----+
|_col2|_col3|_col1|
+-----+-----+-----+
|    1| null| 2021|
+-----+-----+-----+

Looks like there is and issue for CPU reading ORC

@wbo4958
Copy link
Collaborator Author

wbo4958 commented Jul 28, 2021

Filed following issue #3060, just close this issue

@wbo4958 wbo4958 closed this as completed Jul 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P0 Must have for release
Projects
None yet
Development

No branches or pull requests

5 participants