Set OrcConf.INCLUDE_COLUMNS for ORC reading #4933

sperlingxx · 2022-03-11T02:57:29Z

Signed-off-by: sperlingxx lovedreamf@gmail.com

Following SPARK-35783, set OrcConf.INCLUDE_COLUMNS. Just in case it might be important for the ORC methods called by us, either today or in the future

Signed-off-by: sperlingxx <lovedreamf@gmail.com>

sperlingxx · 2022-03-11T02:58:05Z

build

wbo4958 · 2022-03-11T06:30:35Z

LGTM

BTW, does our current rapids plugin read all ORC columns?

sperlingxx · 2022-03-11T07:35:02Z

LGTM

BTW, does our current rapids plugin read all ORC columns?

I don't think so. There is a specialized helper function calOrcFileIncluded to exclude unnecessary columns in terms of files:

    /**
     * Compute an array of booleans, one for each column in the ORC file, indicating whether the
     * corresponding ORC column ID should be included in the file to be loaded by the GPU.
     *
     * @param evolution ORC schema evolution instance
     * @return per-column inclusion flags
     */
    private def calcOrcFileIncluded(evolution: SchemaEvolution): Array[Boolean] = {
      if (requestedMapping.isDefined) {
        // ORC schema has no column names, so need to filter based on index
        val orcSchema = orcReader.getSchema
        val topFields = orcSchema.getChildren
        val numFlattenedCols = orcSchema.getMaximumId
        val included = new Array[Boolean](numFlattenedCols + 1)
        util.Arrays.fill(included, false)
        // first column is the top-level schema struct, always add it
        included(0) = true
        // find each top-level column requested by top-level index and add it and all child columns
        requestedMapping.get.foreach { colIdx =>
          val field = topFields.get(colIdx)
          (field.getId to field.getMaximumId).foreach { i =>
            included(i) = true
          }
        }
        included
      } else {
        evolution.getFileIncluded
      }
    }

jlowe · 2022-03-14T13:37:22Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOrcScanBase.scala

+          // any difference. Just in case it might be important for the ORC methods called by us,
+          // either today or in the future.
+          val includeColumns = requestedColIds.filter(_ != -1).sorted.mkString(",")
+          conf.set(OrcConf.INCLUDE_COLUMNS.getAttribute, includeColumns)


Curious why this is conditional on !canPruneCols? The equivalent Spark code setting this is not similarly conditional. We're calling buildOrcReader in either case which will examine the INCLUDE_COLUMNS setting, so it seems prudent to set it whether or not we can prune, minimally to help keep this code in sync with the Spark version.

Hi @jlowe, following your question, I found that we should add this conf on the opposite conditional branch, which canPruneCols is TRUE. Because canPruneCols represents whether trusts ORC to prune columns or do it by ourselves.

@wbo4958 please correct me if I am wrong. Thank you!

I also simplified the requestedColumnIds method.

I also simplified the requestedColumnIds method.

Please revert this change. The code was intentionally trying to be similar to the Apache Spark version. The same is true for the code calling this function.

I found that we should add this conf on the opposite conditional branch, which canPruneCols is TRUE. Because canPruneCols represents whether trusts ORC to prune columns or do it by ourselves.

IMO we need to do what Spark is doing with respect to when and how it sets up INCLUDE_COLUMNS. Looking at the Spark code, it is always setting INCLUDE_COLUMNS as long as requestedColPruneInfo is non-empty. We should do the same.

Hi @jlowe, I reverted it. But I found a bigger problem. It seems that the setting of OrcConf.INCLUDE_COLUMNS is NOT compatible with the schema pruning applied in checkSchemaCompatibility, which leads to the failure when constructing SchemaEvolution: https://github.com/NVIDIA/spark-rapids/blob/branch-22.04/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOrcScanBase.scala#L951.

checkSchemaCompatibility also is derived from Apache Spark, so does it's version too have issues with INCLUDE_COLUMNS when pruning is applied?

@jlowe I failed to find checkSchemaCompatibility or similiar things in Apache Spark. Alternatively, I made some change on the checkSchemaCompatibility. Let it prune include status array by the field ID of pruned fields from read schema, which takes place simultaneously with the prune of read schema.

…include_column

sperlingxx · 2022-03-16T10:09:07Z

build

This reverts commit 315c99c.

Signed-off-by: sperlingxx <lovedreamf@gmail.com>

sperlingxx · 2022-03-17T02:49:35Z

build

sperlingxx · 2022-03-18T09:03:34Z

build

set_INCLUDE_COLUMNS_for_ORC_reading

8ff19fd

Signed-off-by: sperlingxx <lovedreamf@gmail.com>

sperlingxx requested review from jlowe and wbo4958 and removed request for jlowe March 11, 2022 02:57

sameerz added the audit_3.2.0 label Mar 11, 2022

sameerz added this to the Feb 28 - Mar 18 milestone Mar 11, 2022

wbo4958 previously approved these changes Mar 14, 2022

View reviewed changes

jlowe reviewed Mar 14, 2022

View reviewed changes

Merge remote-tracking branch 'origin/branch-22.04' into add_orc_conf_…

61a4f4a

…include_column

sperlingxx dismissed wbo4958’s stale review via 315c99c March 16, 2022 10:07

update

315c99c

sperlingxx requested review from wbo4958 and jlowe March 16, 2022 10:14

sperlingxx added 2 commits March 17, 2022 10:15

Revert "update"

f819fcf

This reverts commit 315c99c.

revert

2ec24c4

Signed-off-by: sperlingxx <lovedreamf@gmail.com>

update

22ae09d

jlowe approved these changes Mar 18, 2022

View reviewed changes

sperlingxx merged commit ae8f21b into NVIDIA:branch-22.04 Mar 19, 2022

sperlingxx deleted the add_orc_conf_include_column branch March 19, 2022 00:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set OrcConf.INCLUDE_COLUMNS for ORC reading #4933

Set OrcConf.INCLUDE_COLUMNS for ORC reading #4933

sperlingxx commented Mar 11, 2022

sperlingxx commented Mar 11, 2022

wbo4958 commented Mar 11, 2022

sperlingxx commented Mar 11, 2022

jlowe Mar 14, 2022

sperlingxx Mar 16, 2022

sperlingxx Mar 16, 2022

jlowe Mar 16, 2022

sperlingxx Mar 17, 2022

jlowe Mar 17, 2022

sperlingxx Mar 18, 2022

sperlingxx commented Mar 16, 2022

sperlingxx commented Mar 17, 2022

sperlingxx commented Mar 18, 2022

Set OrcConf.INCLUDE_COLUMNS for ORC reading #4933

Set OrcConf.INCLUDE_COLUMNS for ORC reading #4933

Conversation

sperlingxx commented Mar 11, 2022

sperlingxx commented Mar 11, 2022

wbo4958 commented Mar 11, 2022

sperlingxx commented Mar 11, 2022

jlowe Mar 14, 2022

Choose a reason for hiding this comment

sperlingxx Mar 16, 2022

Choose a reason for hiding this comment

sperlingxx Mar 16, 2022

Choose a reason for hiding this comment

jlowe Mar 16, 2022

Choose a reason for hiding this comment

sperlingxx Mar 17, 2022

Choose a reason for hiding this comment

jlowe Mar 17, 2022

Choose a reason for hiding this comment

sperlingxx Mar 18, 2022

Choose a reason for hiding this comment

sperlingxx commented Mar 16, 2022

sperlingxx commented Mar 17, 2022

sperlingxx commented Mar 18, 2022