[BUG] data mess up reading from ORC #3007

wbo4958 · 2021-07-23T12:23:32Z

Assume the file schema of an ORC file is

Type: struct<name:string,number:int,english:float,math:int,history:float>

Below reading code will result data mess-up

      val schema = StructType(Array(
        StructField("number", IntegerType),
        StructField("name", StringType),
      ))
     val df = spark.read.schema(schema).orc("xxxxxx")
     df.show()

while below code works fine

      val schema = StructType(Array(
        StructField("name", StringType),
        StructField("number", IntegerType)
      ))
     val df = spark.read.schema(schema).orc("xxxxxx")
     df.show()

That's because we are messing up the orc data with the read schema when re-constructing the orc files.

we're reading data according to the columns' sequence in orc file, while we are writing the read schema into the footer.
So, as long as the read schema is not following the sequence of orc file schema, then issue will happen.

This bug can be repro in latest 21.08-SNAPTHOT and in 21.06 release. I didnt' test the before release. It seems this bug always exist.

The text was updated successfully, but these errors were encountered:

tgravescs · 2021-07-23T14:47:25Z

full repro case:

// create a file like schema above
case class Testing(name: String, number: Int, english: Float, math: Int, history: Float)

val x = sc.parallelize(Array(Testing("three", 23, 3, 3, 4)))
val df = spark.createDataFrame(x)
df.write.orc("testing.orc")
val dfread = spark.read.orc("testing.orc")
dfread.printSchema()
root
 |-- name: string (nullable = true)
 |-- number: integer (nullable = true)
 |-- english: float (nullable = true)
 |-- math: integer (nullable = true)
 |-- history: float (nullable = true)

// now read it

import org.apache.spark.sql.types._
val schema = StructType(Array(
        StructField("number", IntegerType),
        StructField("name", StringType),
      ))
val df = spark.read.schema(schema).orc("testing.orc")
df.show()
df.printSchema

Spark seems to just flip the columns, so it doesn't force the schema read onto the column, it just dictates the order.

spark:

scala> df.show()
+------+-----+
|number| name|
+------+-----+
|    23|three|
+------+-----+

scala> df.printSchema
root
 |-- number: integer (nullable = true)
 |-- name: string (nullable = true)

With the rapids plugin enabled we get:

scala> df.printSchema()
root
 |-- number: integer (nullable = true)
 |-- name: string (nullable = true)
scala> df.show()
+------+----+
|number|name|
+------+----+
|     0|    |
+------+----+

tgravescs · 2021-07-23T20:19:41Z

It looks like we are writing the wrong schema to the in memory orc file before sending it to CUDF. We are writing the requested schema vs the actual stripe schema. So in this case the file strip schema is (name: String, number: Int) but we are writing the footer with (number: Int, name: String) and things get mangled.

firestarman · 2021-07-26T02:16:41Z

When fixing this bug, I would prefer to change the file schema to align with read schema for the issue #463.
@wbo4958 has such a fix.

Currently nested schema pruning seems to be basically supported by buildOutputStripe, but it has the same issue related to the different column orders. cuDF does not support pruning nested columns, and will read the whole struct column intactly. So if we fix this bug in that way, then nested schema pruning will be fully supported without changes in cudf.

wbo4958 · 2021-07-28T03:58:42Z

close this issue

wbo4958 · 2021-07-28T10:05:46Z

Liangcai has found the same issue for the schema which can't be pruned. I just repro it with below code

      val df = Seq(Testing(1, "hello", 2021)).toDF
      df.printSchema()
      // root
      // |-- _col1: integer (nullable = false)
      // |-- _col2: string (nullable = true)
      // |-- _col3: long (nullable = false)
      df.show()
      // +-----+-----+-----+
      //|_col1|_col2|_col3|
      //+-----+-----+-----+
      //|    1|hello| 2021|
      //+-----+-----+-----+
      df.write.mode("overwrite").orc(resource1)

      val schema = StructType(
        Seq(
          StructField("_col2", StringType),
          StructField("_col3", LongType),
          StructField("_col1", IntegerType),
          ))
      val dfRead = spark.read.schema(schema).orc(resource1)
      dfRead.show()

The GPU output is

+-----+-----+-----+
|_col2|_col3|_col1|
+-----+-----+-----+
|     |    1|    5|
+-----+-----+-----+

while the CPU output is

+-----+-----+-----+
|_col2|_col3|_col1|
+-----+-----+-----+
|    1| null| 2021|
+-----+-----+-----+

Looks like there is and issue for CPU reading ORC

wbo4958 · 2021-07-28T10:10:49Z

Filed following issue #3060, just close this issue

wbo4958 added bug Something isn't working ? - Needs Triage Need team to review and classify P0 Must have for release labels Jul 23, 2021

wbo4958 assigned jlowe, tgravescs and wbo4958 Jul 23, 2021

tgravescs unassigned wbo4958 and jlowe Jul 23, 2021

wbo4958 mentioned this issue Jul 24, 2021

Fix ORC read error when read schema reorders file schema columns #3015

Merged

wbo4958 mentioned this issue Jul 27, 2021

Add disorder read schema test case for Parquet #3032

Merged

Salonijain27 removed the ? - Needs Triage Need team to review and classify label Jul 27, 2021

wbo4958 mentioned this issue Jul 28, 2021

[BUG] ORC read can corrupt data when specified schema does not match file schema ordering #3060

Closed

wbo4958 closed this as completed Jul 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] data mess up reading from ORC #3007

[BUG] data mess up reading from ORC #3007

wbo4958 commented Jul 23, 2021

tgravescs commented Jul 23, 2021

tgravescs commented Jul 23, 2021

firestarman commented Jul 26, 2021 •

edited

Loading

wbo4958 commented Jul 28, 2021

wbo4958 commented Jul 28, 2021 •

edited

Loading

wbo4958 commented Jul 28, 2021

[BUG] data mess up reading from ORC #3007

[BUG] data mess up reading from ORC #3007

Comments

wbo4958 commented Jul 23, 2021

tgravescs commented Jul 23, 2021

tgravescs commented Jul 23, 2021

firestarman commented Jul 26, 2021 • edited Loading

wbo4958 commented Jul 28, 2021

wbo4958 commented Jul 28, 2021 • edited Loading

wbo4958 commented Jul 28, 2021

firestarman commented Jul 26, 2021 •

edited

Loading

wbo4958 commented Jul 28, 2021 •

edited

Loading