Add spilling support for aggregations with distinct. #7454

spershin · 2023-11-07T20:16:32Z

Description

Currently we don't support spilling if aggregation nodes has aggregations with 'distinct' like this:

SELECT count(distinct c0) FROM tmp GROUP BY c1;

Need to add the support if we see queries with this are breaching memory limits.

zhztheplayer · 2023-11-10T05:28:51Z

@spershin @xiaoxmeng @aditi-pandit Partial aggregation doesn't yet support spilling either. Do we have a plan on fixing that?

zhztheplayer · 2023-11-14T01:20:29Z

@spershin @xiaoxmeng @aditi-pandit Partial aggregation doesn't yet support spilling either. Do we have a plan on fixing that?

Discussion link #7511 (comment)

mbasmanova · 2023-11-14T17:35:41Z

DistinctAggregations uses SetAccumulator to accumulate distinct inputs. This state can be spilled as regular vector or as an array of serialized buffers (ARRAY(VARBINARY)) similar to SortedAggregations.

To spill serialized data, we would need to extend SetAccumulator to add extractSerialized and addSerialized APIs.

mbasmanova · 2023-11-14T17:36:27Z

The necessary changes would be similar to #7526

Queries with distinct aggregations cannot spill. With this change, they are now capable of doing so. SetAccumulator accumulates the distinct inputs and will now expose an API showing what the maximum spill data is (maxSpillSize), which then can be used with extractForSpill to extract the serialized data. The clear method allows is then to drop the memory usage. Once it is time to bring the data back from spill, the addFromSpill API is used. The AddressableNonNullValueList subordinate structure, used for complex types (e.g.: ROW, MAP) also is extended with serialization / deserialization capabilities via the new getSerializedSize, copySerializedTo and appendSerialized methods, which follow the same pattern of getting a size, then getting the data and being able to put it back. Its free method is made to leave the structure in a re-usable state as well. Tests are added to AggregationsTest for the overall spill with distinct aggregation, both when it is on or off and including the VARCHAR case so get coverage beyond a scalar case. Additionally, a new test suite is added with SetAccumulatorTest to cover all the serialization / deserialization logic added for SetAccumulator in all its implementation cases, including VARCHAR and ComplexType. Finally, also added a test for serialization / deserialization for AddressableNonNullValueList. Fixes facebookincubator#7454

aditi-pandit · 2024-01-08T01:55:11Z

@supermem613: I had a slighty different variation of this code as well #7791.

spershin added enhancement New feature or request aggregates labels Nov 7, 2023

aditi-pandit self-assigned this Nov 10, 2023

supermem613 mentioned this issue Jan 7, 2024

Support Spill for Distinct Aggregation #8287

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add spilling support for aggregations with distinct. #7454

Add spilling support for aggregations with distinct. #7454

spershin commented Nov 7, 2023

zhztheplayer commented Nov 10, 2023 •

edited

Loading

zhztheplayer commented Nov 14, 2023

mbasmanova commented Nov 14, 2023

mbasmanova commented Nov 14, 2023

aditi-pandit commented Jan 8, 2024 •

edited

Loading

Add spilling support for aggregations with distinct. #7454

Add spilling support for aggregations with distinct. #7454

Comments

spershin commented Nov 7, 2023

Description

zhztheplayer commented Nov 10, 2023 • edited Loading

zhztheplayer commented Nov 14, 2023

mbasmanova commented Nov 14, 2023

mbasmanova commented Nov 14, 2023

aditi-pandit commented Jan 8, 2024 • edited Loading

zhztheplayer commented Nov 10, 2023 •

edited

Loading

aditi-pandit commented Jan 8, 2024 •

edited

Loading