-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add spilling support for aggregations with distinct. #7454
Comments
@spershin @xiaoxmeng @aditi-pandit Partial aggregation doesn't yet support spilling either. Do we have a plan on fixing that? |
Discussion link #7511 (comment) |
DistinctAggregations uses SetAccumulator to accumulate distinct inputs. This state can be spilled as regular vector or as an array of serialized buffers (ARRAY(VARBINARY)) similar to SortedAggregations. To spill serialized data, we would need to extend SetAccumulator to add extractSerialized and addSerialized APIs. |
The necessary changes would be similar to #7526 |
Queries with distinct aggregations cannot spill. With this change, they are now capable of doing so. SetAccumulator accumulates the distinct inputs and will now expose an API showing what the maximum spill data is (maxSpillSize), which then can be used with extractForSpill to extract the serialized data. The clear method allows is then to drop the memory usage. Once it is time to bring the data back from spill, the addFromSpill API is used. The AddressableNonNullValueList subordinate structure, used for complex types (e.g.: ROW, MAP) also is extended with serialization / deserialization capabilities via the new getSerializedSize, copySerializedTo and appendSerialized methods, which follow the same pattern of getting a size, then getting the data and being able to put it back. Its free method is made to leave the structure in a re-usable state as well. Tests are added to AggregationsTest for the overall spill with distinct aggregation, both when it is on or off and including the VARCHAR case so get coverage beyond a scalar case. Additionally, a new test suite is added with SetAccumulatorTest to cover all the serialization / deserialization logic added for SetAccumulator in all its implementation cases, including VARCHAR and ComplexType. Finally, also added a test for serialization / deserialization for AddressableNonNullValueList. Fixes facebookincubator#7454
@supermem613: I had a slighty different variation of this code as well #7791. |
Description
Currently we don't support spilling if aggregation nodes has aggregations with 'distinct' like this:
SELECT count(distinct c0) FROM tmp GROUP BY c1;
Need to add the support if we see queries with this are breaching memory limits.
The text was updated successfully, but these errors were encountered: