Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add spilling support for aggregations with distinct. #7454

Open
spershin opened this issue Nov 7, 2023 · 5 comments
Open

Add spilling support for aggregations with distinct. #7454

spershin opened this issue Nov 7, 2023 · 5 comments
Assignees
Labels
aggregates enhancement New feature or request

Comments

@spershin
Copy link
Contributor

spershin commented Nov 7, 2023

Description

Currently we don't support spilling if aggregation nodes has aggregations with 'distinct' like this:

SELECT count(distinct c0) FROM tmp GROUP BY c1;

Need to add the support if we see queries with this are breaching memory limits.

@spershin spershin added enhancement New feature or request aggregates labels Nov 7, 2023
@aditi-pandit aditi-pandit self-assigned this Nov 10, 2023
@zhztheplayer
Copy link
Contributor

zhztheplayer commented Nov 10, 2023

@spershin @xiaoxmeng @aditi-pandit Partial aggregation doesn't yet support spilling either. Do we have a plan on fixing that?

@zhztheplayer
Copy link
Contributor

@spershin @xiaoxmeng @aditi-pandit Partial aggregation doesn't yet support spilling either. Do we have a plan on fixing that?

Discussion link #7511 (comment)

@mbasmanova
Copy link
Contributor

DistinctAggregations uses SetAccumulator to accumulate distinct inputs. This state can be spilled as regular vector or as an array of serialized buffers (ARRAY(VARBINARY)) similar to SortedAggregations.

To spill serialized data, we would need to extend SetAccumulator to add extractSerialized and addSerialized APIs.

@mbasmanova
Copy link
Contributor

The necessary changes would be similar to #7526

supermem613 added a commit to supermem613/velox that referenced this issue Jan 5, 2024
Queries with distinct aggregations cannot spill. With this change, they are now capable of doing so.

SetAccumulator accumulates the distinct inputs and will now expose an API showing what the maximum spill data is (maxSpillSize), which then can be used with extractForSpill to extract the serialized data. The clear method allows is then to drop the memory usage. Once it is time to bring the data back from spill, the addFromSpill API is used.

The AddressableNonNullValueList subordinate structure, used for complex types (e.g.: ROW, MAP) also is extended with serialization / deserialization capabilities via the new getSerializedSize, copySerializedTo and appendSerialized methods, which follow the same pattern of getting a size, then getting the data and being able to put it back. Its free method is made to leave the structure in a re-usable state as well.

Tests are added to AggregationsTest for the overall spill with distinct aggregation, both when it is on or off and including the VARCHAR case so get coverage beyond a scalar case.

Additionally, a new test suite is added with SetAccumulatorTest to cover all the serialization / deserialization logic added for SetAccumulator in all its implementation cases, including VARCHAR and ComplexType.

Finally, also added a test for serialization / deserialization for AddressableNonNullValueList.

Fixes facebookincubator#7454
@aditi-pandit
Copy link
Collaborator

aditi-pandit commented Jan 8, 2024

@supermem613: I had a slighty different variation of this code as well #7791.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
aggregates enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants