-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set OrcConf.INCLUDE_COLUMNS for ORC reading #4933
Set OrcConf.INCLUDE_COLUMNS for ORC reading #4933
Conversation
Signed-off-by: sperlingxx <lovedreamf@gmail.com>
build |
LGTM BTW, does our current rapids plugin read all ORC columns? |
I don't think so. There is a specialized helper function /**
* Compute an array of booleans, one for each column in the ORC file, indicating whether the
* corresponding ORC column ID should be included in the file to be loaded by the GPU.
*
* @param evolution ORC schema evolution instance
* @return per-column inclusion flags
*/
private def calcOrcFileIncluded(evolution: SchemaEvolution): Array[Boolean] = {
if (requestedMapping.isDefined) {
// ORC schema has no column names, so need to filter based on index
val orcSchema = orcReader.getSchema
val topFields = orcSchema.getChildren
val numFlattenedCols = orcSchema.getMaximumId
val included = new Array[Boolean](numFlattenedCols + 1)
util.Arrays.fill(included, false)
// first column is the top-level schema struct, always add it
included(0) = true
// find each top-level column requested by top-level index and add it and all child columns
requestedMapping.get.foreach { colIdx =>
val field = topFields.get(colIdx)
(field.getId to field.getMaximumId).foreach { i =>
included(i) = true
}
}
included
} else {
evolution.getFileIncluded
}
} |
// any difference. Just in case it might be important for the ORC methods called by us, | ||
// either today or in the future. | ||
val includeColumns = requestedColIds.filter(_ != -1).sorted.mkString(",") | ||
conf.set(OrcConf.INCLUDE_COLUMNS.getAttribute, includeColumns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious why this is conditional on !canPruneCols
? The equivalent Spark code setting this is not similarly conditional. We're calling buildOrcReader
in either case which will examine the INCLUDE_COLUMNS
setting, so it seems prudent to set it whether or not we can prune, minimally to help keep this code in sync with the Spark version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also simplified the requestedColumnIds
method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also simplified the requestedColumnIds method.
Please revert this change. The code was intentionally trying to be similar to the Apache Spark version. The same is true for the code calling this function.
I found that we should add this conf on the opposite conditional branch, which canPruneCols is TRUE. Because canPruneCols represents whether trusts ORC to prune columns or do it by ourselves.
IMO we need to do what Spark is doing with respect to when and how it sets up INCLUDE_COLUMNS
. Looking at the Spark code, it is always setting INCLUDE_COLUMNS
as long as requestedColPruneInfo
is non-empty. We should do the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jlowe, I reverted it. But I found a bigger problem. It seems that the setting of OrcConf.INCLUDE_COLUMNS
is NOT compatible with the schema pruning applied in checkSchemaCompatibility
, which leads to the failure when constructing SchemaEvolution
: https://github.com/NVIDIA/spark-rapids/blob/branch-22.04/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOrcScanBase.scala#L951.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checkSchemaCompatibility
also is derived from Apache Spark, so does it's version too have issues with INCLUDE_COLUMNS when pruning is applied?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jlowe I failed to find checkSchemaCompatibility
or similiar things in Apache Spark. Alternatively, I made some change on the checkSchemaCompatibility
. Let it prune include status array by the field ID of pruned fields from read schema, which takes place simultaneously with the prune of read schema.
build |
This reverts commit 315c99c.
build |
build |
Signed-off-by: sperlingxx lovedreamf@gmail.com
Closes #3026
Following SPARK-35783, set OrcConf.INCLUDE_COLUMNS. Just in case it might be important for the ORC methods called by us, either today or in the future