Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Spilling of Distinct Aggregations #7791

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

aditi-pandit
Copy link
Collaborator

@aditi-pandit aditi-pandit commented Nov 29, 2023

HashAggregation supports spilling in general. However, it didn't support spilling for a DISTINCT aggregation. e.g. Like SQL query

SELECT c1, count(DISTINCT c0) FROM tmp GROUP BY c1

Distinct aggregations work by capturing the column parameter values in a SetAccumulator to track only the DISTINCT values. The outer aggregation is computed over the contents of these SetAccumulators.

To support spilling in this context multiple changes are needed:

  • Enhance the RowContainer/GroupingSet code for spill extraction and re-building aggregation contents from the spill to perform these functions for the DistinctAggregations as well.
  • Add capabilities to extract spill contents and rebuild accumulators from previous spill in the SetAccumulators.
  • SetAccumulators are of 3 types : i) For fixed-width types ii) String types iii) Complex types. Each of these are enhanced to extract spill and add from spill contents. The spill extraction involves serializing to a blob the set accumulator contents in the order of their indices captured from the input streams. String SetAccumulators serialize (length + contents) for each entry. ComplexType SetAccumulators serialize (length + contents) as well.
  • Serializing ComplexTypes involves using an AddressableNonNullValueList structure.
    -- This class needed new methods to copy what appears externally a contiguous set of bytes into a ByteStream to extract a spill from it.
    -- A new method to append a stream of previously serialized contents from the spill was also added.
    -- Since we need to serialize the length of each ComplexType entry in the SetAccumulator the append method was enhanced as well to return the number of serialized bytes for it as well.
  • Exhaustive tests for the SetAccumulator spill extraction and reconstruction are added.

Copy link

netlify bot commented Nov 29, 2023

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 37751f9
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/671bdf123c99940009b58156

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 29, 2023
@aditi-pandit aditi-pandit marked this pull request as draft November 29, 2023 16:45
@aditi-pandit aditi-pandit force-pushed the spill_distinct branch 5 times, most recently from 8941cdc to ef8c5f6 Compare November 30, 2023 07:26
@aditi-pandit aditi-pandit force-pushed the spill_distinct branch 7 times, most recently from 7822210 to 641ff8c Compare January 8, 2024 18:02
@aditi-pandit aditi-pandit marked this pull request as ready for review January 8, 2024 18:13
@mbasmanova
Copy link
Contributor

@aditi-pandit Aditi, I see CI failures. Is this PR ready for review? How is it different from #8287 ?

@aditi-pandit
Copy link
Collaborator Author

@aditi-pandit Aditi, I see CI failures. Is this PR ready for review? How is it different from #8287 ?

@mbasmanova : This PR is ready for review. Looking at the test failure.

The code in this PR seemed simpler in both the DistinctAggregations and serialization logic for maintaining lengths. Also has more tests for ComplexType serialization. My apologies to @supermem613, this had been in the works for some time and I should've linked it in the issue.

@aditi-pandit aditi-pandit force-pushed the spill_distinct branch 5 times, most recently from a50bf39 to 75a6721 Compare January 9, 2024 06:20
@mbasmanova
Copy link
Contributor

@aditi-pandit

The code in this PR seemed simpler in both the DistinctAggregations and serialization logic for maintaining lengths.

Would you clarify a bit more how this PR works and how it is different / better than the other PR? PR description doesn't have any details and it will take significant amount of time and effort to reverse engineer the design from the code itself.

@aditi-pandit aditi-pandit force-pushed the spill_distinct branch 2 times, most recently from 60a3b98 to 3b99863 Compare January 10, 2024 18:31
@aditi-pandit
Copy link
Collaborator Author

@aditi-pandit

The code in this PR seemed simpler in both the DistinctAggregations and serialization logic for maintaining lengths.

Would you clarify a bit more how this PR works and how it is different / better than the other PR? PR description doesn't have any details and it will take significant amount of time and effort to reverse engineer the design from the code itself.

@mbasmanova : Have added a detailed description of the changes I made.

@aditi-pandit
Copy link
Collaborator Author

@mbasmanova : Have addressed your review comments. PTAL.

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aditi-pandit Aditi, thank you for adding a description. This is very helpful. It sounds like it would be helpful to extract changes to SetAccumulator and AddressableNonNullValueList into 2 separate PRs.

void addNullIndex(const char* buffer) {
VELOX_CHECK(!nullIndex.has_value());
vector_size_t serializedNullIndex;
memcpy(&serializedNullIndex, buffer, kVectorSizeT);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are InputByteStream and OutputByteStream classes in velox/common/base/IOUtils.h which can be used here to make the code easier to read.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to use InputByteStream for deserialization functions. But since we random access during serialization(write), I didn't use OutputByteStream.

velox/exec/SetAccumulator.h Outdated Show resolved Hide resolved
HashStringAllocator* /*allocator*/) {
VELOX_CHECK(!vector.isNullAt(i));

// The serialized value is the nullOffset (kNoNullIndex if no null is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to document the serialization format in one place and avoid repeating it in comments.

// The serialized value is the nullOffset (kNoNullIndex if no null is
// present) followed by the unique values ordered by index.
auto serialized = vector.valueAt(i);
auto size = serialized.size();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps, include number of entries in the serialized data for extra safety

velox/exec/SetAccumulator.h Outdated Show resolved Hide resolved
Copy link
Contributor

@Yuhta Yuhta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, stamp the wrong PR

facebook-github-bot pushed a commit that referenced this pull request Feb 7, 2024
…f bytes (#8653)

Summary:
This is the first in a set of PRs to add support for spilling distinct aggregations (see full version in #7791).

Spilling distinct aggregations needs support to spill SetAccumulators in which input values are cumulated. ComplexTypeSetAccumulators use AddressableNonNullValueList to serialize complex types. This PR adds new APIs to AddressableNonNullValueList so that it can copy/append a stream of bytes corresponding to a ComplexType value (array, map, struct).

Pull Request resolved: #8653

Reviewed By: Yuhta

Differential Revision: D53497000

Pulled By: mbasmanova

fbshipit-source-id: 66d44d02a2c3bd5775725c8b8559feaed17c0813
FelixYBW pushed a commit to FelixYBW/velox that referenced this pull request Feb 10, 2024
…f bytes (facebookincubator#8653)

Summary:
This is the first in a set of PRs to add support for spilling distinct aggregations (see full version in facebookincubator#7791).

Spilling distinct aggregations needs support to spill SetAccumulators in which input values are cumulated. ComplexTypeSetAccumulators use AddressableNonNullValueList to serialize complex types. This PR adds new APIs to AddressableNonNullValueList so that it can copy/append a stream of bytes corresponding to a ComplexType value (array, map, struct).

Pull Request resolved: facebookincubator#8653

Reviewed By: Yuhta

Differential Revision: D53497000

Pulled By: mbasmanova

fbshipit-source-id: 66d44d02a2c3bd5775725c8b8559feaed17c0813
FelixYBW pushed a commit to FelixYBW/velox that referenced this pull request Feb 10, 2024
…f bytes (facebookincubator#8653)

Summary:
This is the first in a set of PRs to add support for spilling distinct aggregations (see full version in facebookincubator#7791).

Spilling distinct aggregations needs support to spill SetAccumulators in which input values are cumulated. ComplexTypeSetAccumulators use AddressableNonNullValueList to serialize complex types. This PR adds new APIs to AddressableNonNullValueList so that it can copy/append a stream of bytes corresponding to a ComplexType value (array, map, struct).

Pull Request resolved: facebookincubator#8653

Reviewed By: Yuhta

Differential Revision: D53497000

Pulled By: mbasmanova

fbshipit-source-id: 66d44d02a2c3bd5775725c8b8559feaed17c0813
FelixYBW pushed a commit to FelixYBW/velox that referenced this pull request Feb 12, 2024
…f bytes (facebookincubator#8653)

Summary:
This is the first in a set of PRs to add support for spilling distinct aggregations (see full version in facebookincubator#7791).

Spilling distinct aggregations needs support to spill SetAccumulators in which input values are cumulated. ComplexTypeSetAccumulators use AddressableNonNullValueList to serialize complex types. This PR adds new APIs to AddressableNonNullValueList so that it can copy/append a stream of bytes corresponding to a ComplexType value (array, map, struct).

Pull Request resolved: facebookincubator#8653

Reviewed By: Yuhta

Differential Revision: D53497000

Pulled By: mbasmanova

fbshipit-source-id: 66d44d02a2c3bd5775725c8b8559feaed17c0813
Copy link

stale bot commented May 28, 2024

This pull request has been automatically marked as stale because it has not had recent activity. If you'd still like this PR merged, please comment on the PR, make sure you've addressed reviewer comments, and rebase on the latest main. Thank you for your contributions!

@aditi-pandit
Copy link
Collaborator Author

@xiaoxmeng, @Yuhta : This is the PR for Spilling of Distinct Aggregations. Lets continue the review here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants