Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Support Broadcast HashJoin #211

Merged
merged 16 commits into from
Mar 26, 2024
Merged

Conversation

viirya
Copy link
Member

@viirya viirya commented Mar 16, 2024

Which issue does this PR close?

Closes #202.

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Copy link
Member Author

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is based on #194.

@viirya viirya force-pushed the broadcast_hash_join branch 3 times, most recently from c3ed3ee to 1b50512 Compare March 16, 2024 18:54
Comment on lines +39 to +45
// Release the previous batch.
// If it is not released, when closing the reader, arrow library will complain about
// memory leak.
if (currentBatch != null) {
currentBatch.close()
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to release the batch before loading next batch. Because ArrowStreamReader loads data into same vectors of root internally. After loading next batch, close will release the just loaded batch instead of previous batch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this related to the memory leak we saw?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because ArrowStreamReader loads data into same vectors of root internally. After loading next batch, close will release the just loaded batch instead of previous batch.

This sounds like a data corruption problem. If the just loaded batch is closed/released, the just loaded ColumnarBatch would be corrupted? But it seems like that the CI passes without any issue previously.

When working on #206, I also found out it might be inconvenient to use Arrow Java's memory API. It requires extra caution to allocate and release ArrowBuf correctly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this related to the memory leak we saw?

It's not, although I suspected it before too. For shuffle, a channel only contains one batch, so ArrowReaderIterator doesn't hit this issue.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like a data corruption problem. If the just loaded batch is closed/released, the just loaded ColumnarBatch would be corrupted? But it seems like that the CI passes without any issue previously.

When working on #206, I also found out it might be inconvenient to use Arrow Java's memory API. It requires extra caution to allocate and release ArrowBuf correctly.

Due to #211 (comment), this issue is not exposed before.

I feel that Arrow Java API is hard to use and somehow counter-intuitive, especially compared with arrow-rs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I feel the same pain when using Java Arrow. I think in the long term we'd better to switch away from it. It should be relatively easy except the Java Arrow Flight feature.

@codecov-commenter
Copy link

codecov-commenter commented Mar 16, 2024

Codecov Report

Attention: Patch coverage is 49.33333% with 38 lines in your changes are missing coverage. Please review.

Project coverage is 33.38%. Comparing base (ce63ff8) to head (de73821).

Files Patch % Lines
...n/scala/org/apache/spark/sql/comet/operators.scala 60.71% 2 Missing and 9 partials ⚠️
...org/apache/comet/CometSparkSessionExtensions.scala 61.53% 4 Missing and 6 partials ⚠️
.../java/org/apache/comet/CometArrowStreamWriter.java 0.00% 6 Missing ⚠️
...ain/scala/org/apache/comet/vector/NativeUtil.scala 0.00% 5 Missing ⚠️
.../scala/org/apache/comet/serde/QueryPlanSerde.scala 42.85% 1 Missing and 3 partials ⚠️
.../comet/execution/shuffle/ArrowReaderIterator.scala 0.00% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main     #211      +/-   ##
============================================
+ Coverage     33.32%   33.38%   +0.06%     
- Complexity      769      776       +7     
============================================
  Files           107      108       +1     
  Lines         37037    37099      +62     
  Branches       8106     8129      +23     
============================================
+ Hits          12342    12386      +44     
- Misses        22098    22099       +1     
- Partials       2597     2614      +17     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@viirya
Copy link
Member Author

viirya commented Mar 21, 2024

cc @sunchao

Comment on lines +39 to +45
// Release the previous batch.
// If it is not released, when closing the reader, arrow library will complain about
// memory leak.
if (currentBatch != null) {
currentBatch.close()
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this related to the memory leak we saw?

var rowCount = 0

batches.foreach { batch =>
val (fieldVectors, batchProviderOpt) = getBatchFieldVectors(batch)
val root = schemaRoot.getOrElse(new VectorSchemaRoot(fieldVectors.asJava))
val root = new VectorSchemaRoot(fieldVectors.asJava)
val provider = batchProviderOpt.getOrElse(dictionaryProvider)
Copy link
Contributor

@advancedxy advancedxy Mar 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One related question what if incoming batches have different dictionary provider?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the batch has its provider, it should be returned in batchProviderOpt ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the writer is reused. Once the writer is created, new dictionary provider(if different from previous one) from new batches is never used/ written?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see. I suppose that the dictionary provider is same across batches. This seems to be the reason why there is dictionary provider, i.e. to store dictionary values for arrays/batches.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. It seems getBatchFieldVectors only checks same dictionary provider across arrays but not batches. Maybe we should add that too? Anyway, it's kind of out of this PR's scope. Maybe in a separate issue to track that.

Copy link
Contributor

@advancedxy advancedxy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

var rowCount = 0

batches.foreach { batch =>
val (fieldVectors, batchProviderOpt) = getBatchFieldVectors(batch)
val root = schemaRoot.getOrElse(new VectorSchemaRoot(fieldVectors.asJava))
val root = new VectorSchemaRoot(fieldVectors.asJava)
val provider = batchProviderOpt.getOrElse(dictionaryProvider)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. It seems getBatchFieldVectors only checks same dictionary provider across arrays but not batches. Maybe we should add that too? Anyway, it's kind of out of this PR's scope. Maybe in a separate issue to track that.

@viirya
Copy link
Member Author

viirya commented Mar 26, 2024

@sunchao any more comments?

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Can we merge #210 first and trigger another CI run?

@viirya
Copy link
Member Author

viirya commented Mar 26, 2024

Let me take a look at #210 tomorrow (today is late, I might not be able to finish it).

I'll hold this until #210 is merged.

@viirya viirya merged commit 0826772 into apache:main Mar 26, 2024
28 checks passed
@viirya
Copy link
Member Author

viirya commented Mar 26, 2024

Merged. Thanks.

wangyum pushed a commit to wangyum/datafusion-comet that referenced this pull request Mar 28, 2024
* feat: Support HashJoin

* Add comment

* Clean up test

* Fix join filter

* Fix clippy

* Use consistent function with sort merge join

* Add note about left semi and left anti joins

* feat: Support BroadcastHashJoin

* Move tests

* Remove unused import

* Add function to parse join parameters

* Remove duplicate code

* For review
snmvaughan pushed a commit to snmvaughan/arrow-datafusion-comet that referenced this pull request Apr 4, 2024
* feat: Support HashJoin

* Add comment

* Clean up test

* Fix join filter

* Fix clippy

* Use consistent function with sort merge join

* Add note about left semi and left anti joins

* feat: Support BroadcastHashJoin

* Move tests

* Remove unused import

* Add function to parse join parameters

* Remove duplicate code

* For review
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support BroadcastHashJoinExec
4 participants