-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] data mess up reading from ORC #3007
Comments
full repro case:
Spark seems to just flip the columns, so it doesn't force the schema read onto the column, it just dictates the order. spark:
With the rapids plugin enabled we get:
|
It looks like we are writing the wrong schema to the in memory orc file before sending it to CUDF. We are writing the requested schema vs the actual stripe schema. So in this case the file strip schema is (name: String, number: Int) but we are writing the footer with (number: Int, name: String) and things get mangled. |
When fixing this bug, I would prefer to change the file schema to align with read schema for the issue #463. Currently nested schema pruning seems to be basically supported by buildOutputStripe, but it has the same issue related to the different column orders. cuDF does not support pruning nested columns, and will read the whole struct column intactly. So if we fix this bug in that way, then nested schema pruning will be fully supported without changes in cudf. |
close this issue |
Liangcai has found the same issue for the schema which can't be pruned. I just repro it with below code val df = Seq(Testing(1, "hello", 2021)).toDF
df.printSchema()
// root
// |-- _col1: integer (nullable = false)
// |-- _col2: string (nullable = true)
// |-- _col3: long (nullable = false)
df.show()
// +-----+-----+-----+
//|_col1|_col2|_col3|
//+-----+-----+-----+
//| 1|hello| 2021|
//+-----+-----+-----+
df.write.mode("overwrite").orc(resource1)
val schema = StructType(
Seq(
StructField("_col2", StringType),
StructField("_col3", LongType),
StructField("_col1", IntegerType),
))
val dfRead = spark.read.schema(schema).orc(resource1)
dfRead.show() The GPU output is +-----+-----+-----+
|_col2|_col3|_col1|
+-----+-----+-----+
| | 1| 5|
+-----+-----+-----+ while the CPU output is +-----+-----+-----+
|_col2|_col3|_col1|
+-----+-----+-----+
| 1| null| 2021|
+-----+-----+-----+ Looks like there is and issue for CPU reading ORC |
Filed following issue #3060, just close this issue |
Assume the file schema of an ORC file is
Below reading code will result data mess-up
while below code works fine
That's because we are messing up the orc data with the read schema when re-constructing the orc files.
we're reading data according to the columns' sequence in orc file, while we are writing the read schema into the footer.
So, as long as the read schema is not following the sequence of orc file schema, then issue will happen.
This bug can be repro in latest 21.08-SNAPTHOT and in 21.06 release. I didnt' test the before release. It seems this bug always exist.
The text was updated successfully, but these errors were encountered: