Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve over-estimating for ORC coalescing reading #3275

Merged
merged 1 commit into from
Aug 24, 2021
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 25 additions & 4 deletions sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOrcScan.scala
Original file line number Diff line number Diff line change
Expand Up @@ -1602,6 +1602,19 @@ class MultiFileOrcPartitionReader(
implicit def toOrcExtraInfo(in: ExtraInfo): OrcExtraInfo =
in.asInstanceOf[OrcExtraInfo]

// Estimate the size of StripeInformation with the worst case.
// The serialized size may be different because of the different values.
// Here set most of values to "Long.MaxValue" to get the worst case.
lazy val sizeOfStripeInformation = {
OrcProto.StripeInformation.newBuilder()
.setOffset(Long.MaxValue)
.setIndexLength(0) // Index stream is pruned
.setDataLength(Long.MaxValue)
.setFooterLength(Int.MaxValue) // StripeFooter size should be small
.setNumberOfRows(Long.MaxValue)
.build().getSerializedSize
}

// The runner to copy stripes to the offset of HostMemoryBuffer and update
// the StripeInformation to construct the file Footer
class OrcCopyStripesRunner(
Expand Down Expand Up @@ -1712,12 +1725,20 @@ class MultiFileOrcPartitionReader(
stripes.foreach { stripeMeta =>
// account for the size of every stripe including index + data + stripe footer
size += stripeMeta.getBlockSize

// add StripeInformation size in advance which should be calculated in Footer
size += sizeOfStripeInformation
}
// the ctx is the same for all stripes
val ctx = stripes(0).ctx
// the original file's footer should be worst-case
size += ctx.fileTail.getPostscript.getFooterLength
}

val blockIter = filesAndBlocks.valuesIterator
if (blockIter.hasNext) {
val blocks = blockIter.next()

// add the first orc file's footer length to cover ORC schema and other information
size += blocks(0).ctx.fileTail.getPostscript.getFooterLength
}

// Per ORC v1 spec, the size of Postscript must be less than 256 bytes.
size += 256
// finally the single-byte postscript length at the end of the file
Expand Down