Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add CometRowToColumnar operator #206

Merged
merged 5 commits into from
Apr 10, 2024

Conversation

advancedxy
Copy link
Contributor

@advancedxy advancedxy commented Mar 13, 2024

Which issue does this PR close?

This closes #119 and partially resolves #137

Rationale for this change

For ease testing with RangeExec operator in the short term.
In the long term, this PR introduce a general way to enable Comet with row-based source exec nodes

What changes are included in this PR?

  1. Introduce CometRowToColumnarExec to transform Spark's InternalRow into ColumnarBatch
  2. CometArrowConverters and its corresponding ArrowWriters
  3. Glue code to apply RowToColumnar transition

How are these changes tested?

  1. added tests and existing test code
  2. verify plan transition in the Spark UI

@codecov-commenter
Copy link

codecov-commenter commented Mar 13, 2024

Codecov Report

Attention: Patch coverage is 15.70513% with 263 lines in your changes are missing coverage. Please review.

Project coverage is 33.36%. Comparing base (aa6ddc5) to head (fb49a88).
Report is 1 commits behind head on main.

Files Patch % Lines
...spark/sql/comet/execution/arrow/ArrowWriters.scala 0.00% 198 Missing ⚠️
...l/comet/execution/arrow/CometArrowConverters.scala 0.00% 42 Missing ⚠️
.../scala/org/apache/spark/sql/comet/util/Utils.scala 0.00% 8 Missing ⚠️
...ain/scala/org/apache/comet/vector/NativeUtil.scala 0.00% 6 Missing ⚠️
...org/apache/comet/CometSparkSessionExtensions.scala 78.57% 0 Missing and 3 partials ⚠️
...pache/spark/sql/comet/CometRowToColumnarExec.scala 90.32% 1 Missing and 2 partials ⚠️
...n/scala/org/apache/spark/sql/comet/operators.scala 33.33% 1 Missing and 1 partial ⚠️
...n/scala/org/apache/comet/vector/StreamReader.scala 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main     #206      +/-   ##
============================================
- Coverage     33.48%   33.36%   -0.13%     
- Complexity      776      791      +15     
============================================
  Files           108      111       +3     
  Lines         37178    37479     +301     
  Branches       8146     8192      +46     
============================================
+ Hits          12448    12503      +55     
- Misses        22107    22351     +244     
- Partials       2623     2625       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -78,6 +83,9 @@ object Utils {
case _: ArrowType.FixedSizeBinary => BinaryType
case d: ArrowType.Decimal => DecimalType(d.getPrecision, d.getScale)
case date: ArrowType.Date if date.getUnit == DateUnit.DAY => DateType
case ts: ArrowType.Timestamp
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the TimestampNTZType case is missed.

@advancedxy advancedxy marked this pull request as ready for review March 15, 2024 14:30
@advancedxy advancedxy changed the title [WIP] feat: Add CometRowToColumnar operator feat: Add CometRowToColumnar operator Mar 15, 2024
import org.apache.spark.sql.errors.QueryExecutionErrors
import org.apache.spark.sql.types._

private[arrow] object ArrowWriter {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's mostly from Spark's side.
Since we are shading Arrow in Comet, we cannot use this code directly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be useful to add some comments on which Spark class this is from, to help it get better maintained in future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, let me add some comments.

@advancedxy
Copy link
Contributor Author

Gently ping @sunchao and @viirya

@sunchao
Copy link
Member

sunchao commented Mar 22, 2024

Sorry for the delay @advancedxy . I'll try to take a look soon.

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good, I just have some minor comments so far.

import org.apache.spark.sql.errors.QueryExecutionErrors
import org.apache.spark.sql.types._

private[arrow] object ArrowWriter {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be useful to add some comments on which Spark class this is from, to help it get better maintained in future.

common/src/main/scala/org/apache/comet/CometConf.scala Outdated Show resolved Hide resolved
case (BinaryType, vector: LargeVarBinaryVector) => new LargeBinaryWriter(vector)
case (DateType, vector: DateDayVector) => new DateWriter(vector)
case (TimestampType, vector: TimeStampMicroTZVector) => new TimestampWriter(vector)
case (TimestampNTZType, vector: TimeStampMicroVector) => new TimestampNTZWriter(vector)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not compatible with Spark 3.2?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, although the TimestampNTZType is removed in Spark 3.2 after this PR: https://github.com/apache/spark/pull/33444. It's still possible to access TimestampNTZType type, which makes the code a lot cleaner for later versions of Spark: we don't have to add special directories for spark 3.2/3.3/3.4 etc.

Per my understanding, since Spark 3.2 will not produce schema with TimestampNTZType, this pattern match case is effective a no-op case. And it could be effective for spark 3.3 and spark 3.4.

try {
if (!closed) {
if (currentBatch != null) {
arrowWriter.reset()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to close arrowWriter too? for example close all the ValueVectors in the writer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no close method for arrowWriter. The ColumnarBatch shall be closed to close all the ValueVectors, which is already achieved by root.close

}

override def hasNext: Boolean = rowIter.hasNext || {
close(false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why we need to call close here and whether just calling close in the TaskCompletionListener is sufficient. Will this iterator be used again once it drains all the rows from the input iterator?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why we need to call close here and whether just calling close in the TaskCompletionListener is sufficient.

It might not be sufficient to close in TaskCompletionListener as the task might live much longer than the iterator, for example, a task contains Range -> CometRowToColumnar --> Sort --> ShuffleWrite, the CometRowToColumnar will be drained much earlier than Sort or ShuffleWrite. So we need to close the iterator earlier to make sure there's no memory buffering/(leaking).

However due to comments in https://github.com/apache/arrow-datafusion-comet/pull/206/files#diff-04037044481f9a656275a63ebb6a3a63badf866f19700d4a6909d2e17c8d7b72R37-R46, we cannot close the allocator after the iterator is consumed. It's already exported in the native side, it might be dropped later than the iterator consumption. So I add the allocator.close in the task completion callback.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, make sense.

}
new StructWriter(vector, children.toArray)
case (NullType, vector: NullVector) => new NullWriter(vector)
case (_: YearMonthIntervalType, vector: IntervalYearVector) =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these are not supported yet for dictionary vector, see CometDictionary on the check of minor type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I believe this ArrowWriter doesn't produce dictionary encoded ColumnVectors.
But I do believe we should match how org.apache.spark.sql.comet.util.Utils#toArrowType matches SparkType to ArrowType. Let me comment out the following pattern match cases.

@advancedxy
Copy link
Contributor Author

@sunchao would you mind to take a look at this again, I should address most of your comments, please let me know if you have any other comments.

And sorry for the late update, I wasn't feeling well last week.

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for late on the review @advancedxy . Overall LGTM. Please rebase the PR.

}

override def hasNext: Boolean = rowIter.hasNext || {
close(false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, make sense.

"spark.comet.rowToColumnar.enabled")
.internal()
.doc("Whether to enable row to columnar conversion in Comet. When this is turned on, " +
"Comet will convert row-based operators in spark.comet.rowToColumnar.sourceNodeList into " +
Copy link
Member

@viirya viirya Apr 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Comet will convert row-based operators in spark.comet.rowToColumnar.sourceNodeList into " +
"Comet will convert row-based data scan operators in spark.comet.rowToColumnar.sourceNodeList into " +

Copy link
Contributor Author

@advancedxy advancedxy Apr 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, actually CometRowToColumnar is general enough that it can be used to convert any row-based operator to a columnar one.

@@ -238,6 +239,11 @@ class CometSparkSessionExtensions
val nativeOp = QueryPlanSerde.operator2Proto(op).get
CometScanWrapper(nativeOp, op)

case op if shouldApplyRowToColumnar(conf, op) =>
val cometOp = CometRowToColumnarExec(op)
val nativeOp = QueryPlanSerde.operator2Proto(cometOp).get
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, is the source always able to be converted to Comet scan? If there are unsupported types, it will return None.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldApplyRowToColumnar already checks the output type of op. Only supported types are allowed.

Comment on lines 587 to 607
// 2. Consecutive operators of CometRowToColumnarExec and ColumnarToRowExec, which might be
// possible for Comet to add a `CometRowToColumnarExec` for row-based operators first, then
// Spark only requests row-based output.
Copy link
Member

@viirya viirya Apr 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you actually mean:

Comet adds `CometRowToColumnarExec` on top of row-based data scan operators, but the
 downstream operator is Spark operator which takes row-based input. So Spark adds another 
`ColumnarToRowExec` after `CometRowToColumnarExec`. In this case, we remove the pair of
`CometRowToColumnarExec` and `ColumnarToRowExec`.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but the
downstream operator is Spark operator which takes row-based input

hmm, this is another possibility, let me update the comment to include this one.
The case I described above is that Spark only requests row-based at the end of the operator, the row-based requirement might be passed down to the CometRowToColumnarExec and then we have a pair of CometRowToColumnarExec and ColumnarToRowExec.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refined the comments, hopefully it clarifies things. Please let me know if you have any other comments.

@sunchao sunchao merged commit 60fe431 into apache:main Apr 10, 2024
28 checks passed
@sunchao
Copy link
Member

sunchao commented Apr 10, 2024

Merged, thanks @advancedxy ! If any comments from @viirya are not addressed, we can do it in a separate PR. This PR has been open for too long :)

@advancedxy
Copy link
Contributor Author

If any comments from @viirya are not addressed, we can do it in a separate PR.

Of course.

Thanks for @sunchao and @viirya's review, really appreciate that.

himadripal pushed a commit to himadripal/datafusion-comet that referenced this pull request Sep 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for InMemoryRelation Add CometRowToColumnar operator
4 participants