Consolidate GroupByHash implementations `row_hash.rs` and `hash.rs` (remove duplication) #2723

alamb · 2022-06-12T10:30:50Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
As part of the transition to the faster "row format" (#1861 ) , @yjshen implemented a Row based Hash Aggregate implementation in #2375 ❤️

However, the implementation currently implements support for a subset of the data types that DataFusion supports. This made the code significantly faster for some cases but has some downsides:

Not all data types benefit from the row format performance
There are two parallel similar but not the same implementations of hash aggregate -- row_hash.rs and hash.rs

You can already see the potential challenge in PRs like #2716 where test coverage may miss one of the hash aggregate implementations by accident

Describe the solution you'd like
I would like to consolidate the hash aggregate implementations -- success is to delete hash.rs by adding the additional remaining type support to row_hash.rs

I think this would be a nice project for someone new to DataFusion to work on as the pattern is already defined, the outcome will be better performance, and they will get good experience with the code.

It will also increase the type support for row format and make it easier to roll out through the rest of the codebase

Describe alternatives you've considered
N/A

Additional context
More context about the ongoing row format conversion is #1861

The text was updated successfully, but these errors were encountered:

ming535 · 2022-06-15T16:53:50Z

I went through the codebase and some related github issue. So it seems that we need to:

Support all DataType's encoding/decoding including nested types such as List, Struct.
Go through all type of AggregateExpr and implement related RowAccumulator if it doesn't have one.
Benchmark the above row based aggregation and compare it with the original column one.
delete create_accumulator in AggregateExpr and always use create_row_accumulator.

Is there anything I am missing? I think I need to do some research on how to implement List, Struct.

alamb · 2022-06-15T21:39:05Z

Thanks @ming535 ! That list looks good to me

Is there anything I am missing? I think I need to do some research on how to implement List, Struct.

I am not sure DataFusion supports aggregating on List and Struct -- so I am not sure we need to support a row format for them. If that turns out not to be the case maybe we will 🤔

yjshen · 2022-06-16T02:07:19Z

I agree we need List support in Row since it's used by ApproxPercentileCont, ArrayAgg, Distinct*, etc., as state fields. Struct is not currently used as a state for existing accumulators, but we will need it soon for other purposes.

List and Struct would not be too complicated to implement in the row format:

For a struct, we could regard it as a variable-length-field(varlena), and interpret/serialize the data as another row.
List would need one more size field denoting its length.

You could refer to the Spark repo to lend some ideas from its UnsafeRow implementations.

On the other hand, I suggest we evaluate the Distinct* aggregates through double aggregation (agg logic rewrite through an optimizer rule) for more predictable/reasonable memory usage, the current "collecting values into list" logic would OOM fast with hot groupings.

This optimizer rule would rewrites an aggregate query with distinct aggregations into an expanded double aggregation in which the regular aggregation expressions and every distinct clause is aggregated in a separate group. The results are then combined in a second aggregate. Check this Spark optimizer rule

yjshen · 2022-06-16T02:32:32Z

Note on RowType::WordAligned, which is used as the grouping state for hash aggregation:
Since the varlena field would expand its width as new updates come in, fields after varlena should be moved accordingly to make them readable/updatable after a varlena update.

We may consider an "out of place" store for these varlenas for RowType::WordAligned, to relieve the repeated need for moving/copying fixed-sized fields stored after valena fields. And put them in place when the update is finalized later.

ming535 · 2022-06-23T14:44:59Z

I had a prototype working for recursive List now. I will submit different pull requests for this issue.

ming535 · 2022-06-28T02:26:38Z

@yjshen Hi, do you know why Distinct*, for example DistinctCount's state_fields returns a vector DataType::List instead just a vector of DataType? I have ran a few examples and all of them actually using List of length 1.

My understanding is that DistinctCount is used in queries like: SELECT COUNT(DISTINCT c1) from foo; .

The logical plan is Aggregate: groupBy=[[]], aggr=[[COUNT(DISTINCT #foo.c1)]. The aggr in the plan will be translated into DistinctCount (when SingleDistinctToGroupBy is disabled). Since COUNT is not valid when there are multiple columns, I don't understand why DistinctCount's state_fields is a vector of DataType::List rather than just a vector of DataType.

alamb · 2022-06-28T19:07:48Z

I think Distinct basically serializes partial results as Lists when combining partial aggregates

ming535 · 2022-06-28T22:10:20Z

I think Distinct basically serializes partial results as Lists when combining partial aggregates

Does this mean that the design was meant to support some thing like: COUNT(DISTINCT (c1, c2)) which counts the distinct combination of c1 and c2?

alamb · 2022-06-29T23:33:05Z

Does this mean that the design was meant to support some thing like: COUNT(DISTINCT (c1, c2)) which counts the distinct combination of c1 and c2?

I am not totally sure to be honest

crepererum · 2022-11-23T10:59:39Z

Grant Plan

So @tustvold and I came up with a plan for this. Let me illustrate.

Prior Art

Let's first quickly illustrate what the current state is and what the problem is. We have two group-by implementations:

V1 (`hash.rs`)

This is the original, fully-featured implementation. It uses Box<[ScalarValue]> as a group key and for each group a Box<dyn Accumulator> to perform the aggregation and store the per-group state (so there's one Accumulator per group). hash.rs manages the entire group->state mapping. The grouping is is calculated once per record batch, then a take kernel reshuffles the batch and we pass slices of that to the individual Accumulators.

While this is nice from a flexibility PoV, this is also very slow, because:

Box<[ScalarValue]> is not exactly fast to compute
There's one dynamic-dispatched object per group

V1 currently does NOT support dynamic memory tracking, see #3940.

V2 (`row_hash.rs`)

This uses a row format ... or two to be precise. The group key is stored as DataFusion compact row format. There's one Box<dyn RowAccumulator> per aggregation target (NOT per group) and the per-group state is stored in the DataFusion word-aligned row format. row_hash.rs manages the mapping from group key to state and passes a RowAccessor to the RowAccumulator. The processing is similar to V1: groups are calculated once per batch, then a take kernel reshuffles everything and slices are passed to the RowAccumulator.

V2 is faster than V1, but in our opinion the reasoning is blurry: there are two differences (group key calculation and state storage) and we believe that esp. the first once is responsible for the improvement.

The DF word-aligned row format will never work for some aggregations like non-approximative median. So this is somewhat a dead end.

The good thing is that V2 support memory tracking (see #3940).

Row Formats

See #4179 -- Tl;Dr;: we have 3 row formats in our stack and we should focus on one.

Proposal

We think that V2 has some good ideas but will never get finished and could also be improved. As weird as it may sound (and please read the implementation plan before arguing against it) we propose a V3.

Design

Use the arrow row format as group key. I (= @crepererum ) have some early benchmarks that change alone would improve the V2 performance. Ditch the word-aligned row format for the state management and change the aggregator to:

trait Aggregator: MemoryConsumer {
    /// Update aggregator state for given groups.
    fn update_batch(&mut self, keys: Rows, batch: &RecordBatch)) -> Result<()>;
    ...
}

Hence the aggregator will be dyn-dispatched ONCE per record batch and will keep its own internal state. This moves the key->state map from the [row_]hash.rs to the aggregators. We will provide tooling (macros or generics) to simplify the implementation and to avoid boilerplate code as much as possible.

Note that this also removes the take kernel since we think that it doesn't provide any performance improvements over iterating over the hashable rows and perform the aggregation row-by-row. We may bring the take handling back (as a "perform take if at least one aggregator wants that) if we can proof (via benchmarks) that this is desirable for certain aggregators, but we leave this out for now.

This also moves the memory consumer handling into the aggregator since the aggregator knows best how to split states and spill data.

Implementation Plan

arrow row format prep: Fill the gaps that the arrow row format currently has:
- size estimation: So we can hook it into the DF memory management
- types: Support all the arrow types (except for very few that are so weird that nobody cares)
row format use: Use the arrow row format as a group key in V2. This will strictly allow more group-bys to pass through V2 because we now allow more group key types.
V2 state management: Change V2 state management to be like the V3 version above. We don't want to migrate all accumulators in one go, so we'll provide an adapter that uses the new Aggregator interface described above with the RowAggregator internally. This may result in a small performance regression for the current V2, but we think it will be very small (if even noticeable, it may even improve perf due to reduced dyn dispatch). Feature support for V2 will stay the same
aggregator migration: Migrate one aggregator at the time to support the new interface. Notice that in contrast to V2 RowAggregator, the V3 proposal is flexible enough to support all aggregation (even the ones that require dynamic allocations).
remove V1: V2 is now basically V3 but with full feature support. Delete V1.

alamb · 2022-11-23T13:49:04Z

I general, other than the fact that this proposal sounds like a lot of work, I think it sounds wonderful 🏆

I did have a question about the proposed trait implementation:

trait Aggregator: MemoryConsumer {
    /// Update aggregator state for given groups.
    fn update_batch(&mut self, keys: Rows, batch: &RecordBatch)) -> Result<()>;
    ...
}

Is the idea that each aggregator would contain a HashMap or some other way to map keys --> intermediates (rather than the hash map being in the aggregator? This seems like it would result in a fair amount of duplication

I would have expected something like the following (basically push the take into the aggregator)

trait Aggregator: MemoryConsumer {
    /// Update aggregator state for given rows
    fn update_batch(&mut self, indices: Vec<usize>, batch: &RecordBatch)) -> Result<()>;
    ...
}

The big danger of the plan initially seems like that it wouldn't be finished and then we are in an even worse state (3 implementations!) but I think your idea to incrementally rewrite V2 to V3 and then remove V1 sounds like a good mitigation strategy

crepererum · 2022-11-23T13:53:32Z

Is the idea that each aggregator would contain a HashMap or some other way to map keys --> intermediates (rather than the hash map being in the aggregator? This seems like it would result in a fair amount of duplication

Sure, but it reduces dyn dispatch by a lot (once per batch instead once per group), removes the take kernel and the duplication can be hidden by careful macros/generics.

I would have expected something like the following (basically push the take into the aggregator)

The aggregator CAN perform a take but doesn't have to. Our position is that in most cases it doesn't make sense. I also don't fully understand how your interface should work because there is no group key in your signature and we want to avoid having an aggregator per group.

The big danger of the plan initially seems like that it wouldn't be finished and then we are in an even worse state (3 implementations!) but I think your idea to incrementally rewrite V2 to V3 and then remove V1 sounds like a good mitigation strategy

There will never be a 3rd implementation. We morph V2 into V3 and than delete V1.

alamb · 2022-11-23T13:58:42Z

Sure, but it reduces dyn dispatch by a lot (once per batch instead once per group), removes the take kernel and the duplication can be hidden by careful macros/generics.

I understand your point. I would probably have to see a prototype to really understand how complicated it would be in practice. It doesn't feel right to me .

Another thing to consider is other potential aggregation algorithms:

Externalization (how would aggregator state be dumped / read into external files)? Ideally this wouldn't have to be implemented and tested per algorithm
GroupBy Merge (where the data is sorted by group keys, so all values for each group are contiguous in the input) -- this is sometimes used as part of externalized group by hash (to avoid rehashing inputs)

tustvold · 2022-11-23T14:03:50Z

Ideally this wouldn't have to be implemented and tested per algorithm

Ideally we would follow the approach I've increasingly been applying to arrow-rs, and recently applied to InList. Namely use concrete types, and only use dyn-dispatch to type erase on dispatching the batches. It would be fairly trivial to implement a HashMapAggregator that takes a concrete aggregation function, or something

alamb · 2022-11-28T16:49:53Z

To be clear, I think the outcome of implementing the plan described by @crepererum in #2723 (comment) will be:

Faster group by hash performance
More accurate memory usage accounting for grouping
Set us up for out-of-core grouping (aka spilling group by)
Make the code easier to work with

For apache#2723. This has two effects: - **wider feature support:** We now use the V2 aggregator for all group-column types. The arrow row format support is sufficient for that. V1 will only be used if the aggregator itself doesn't support V2 (and these are quite a few at the moment). We'll improve on that front in follow-up PRs. - **more speed:** Turns out the arrow row format is also faster (see below). Perf results (mind the noise in the benchmarks that are actually not even touched by this code change): ```text ❯ cargo bench -p datafusion --bench aggregate_query_sql -- --baseline issue2723a-pre ... Running benches/aggregate_query_sql.rs (target/release/deps/aggregate_query_sql-fdbe5671f9c3019b) aggregate_query_no_group_by 15 12 time: [779.28 µs 782.77 µs 786.28 µs] change: [+2.1375% +2.7672% +3.4171%] (p = 0.00 < 0.05) Performance has regressed. Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high mild aggregate_query_no_group_by_min_max_f64 time: [712.96 µs 715.90 µs 719.14 µs] change: [+0.8379% +1.7648% +2.6345%] (p = 0.00 < 0.05) Change within noise threshold. Found 10 outliers among 100 measurements (10.00%) 3 (3.00%) low mild 6 (6.00%) high mild 1 (1.00%) high severe Benchmarking aggregate_query_no_group_by_count_distinct_wide: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.7s, enable flat sampling, or reduce sample count to 50. aggregate_query_no_group_by_count_distinct_wide time: [1.7297 ms 1.7399 ms 1.7503 ms] change: [-34.647% -33.908% -33.165%] (p = 0.00 < 0.05) Performance has improved. Found 5 outliers among 100 measurements (5.00%) 5 (5.00%) high mild Benchmarking aggregate_query_no_group_by_count_distinct_narrow: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.7s, enable flat sampling, or reduce sample count to 60. aggregate_query_no_group_by_count_distinct_narrow time: [1.0984 ms 1.1045 ms 1.1115 ms] change: [-38.581% -38.076% -37.569%] (p = 0.00 < 0.05) Performance has improved. Found 6 outliers among 100 measurements (6.00%) 1 (1.00%) low mild 5 (5.00%) high mild Benchmarking aggregate_query_group_by: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 9.1s, enable flat sampling, or reduce sample count to 50. aggregate_query_group_by time: [1.7810 ms 1.7925 ms 1.8057 ms] change: [-25.252% -24.127% -22.737%] (p = 0.00 < 0.05) Performance has improved. Found 9 outliers among 100 measurements (9.00%) 2 (2.00%) low mild 5 (5.00%) high mild 2 (2.00%) high severe Benchmarking aggregate_query_group_by_with_filter: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.1s, enable flat sampling, or reduce sample count to 60. aggregate_query_group_by_with_filter time: [1.2068 ms 1.2119 ms 1.2176 ms] change: [+2.2847% +3.0857% +3.8789%] (p = 0.00 < 0.05) Performance has regressed. Found 10 outliers among 100 measurements (10.00%) 1 (1.00%) low mild 7 (7.00%) high mild 2 (2.00%) high severe Benchmarking aggregate_query_group_by_u64 15 12: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.7s, enable flat sampling, or reduce sample count to 50. aggregate_query_group_by_u64 15 12 time: [1.6762 ms 1.6848 ms 1.6942 ms] change: [-29.598% -28.603% -27.400%] (p = 0.00 < 0.05) Performance has improved. Found 8 outliers among 100 measurements (8.00%) 1 (1.00%) low mild 1 (1.00%) high mild 6 (6.00%) high severe Benchmarking aggregate_query_group_by_with_filter_u64 15 12: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.1s, enable flat sampling, or reduce sample count to 60. aggregate_query_group_by_with_filter_u64 15 12 time: [1.1969 ms 1.2008 ms 1.2049 ms] change: [+1.3015% +2.1513% +3.0016%] (p = 0.00 < 0.05) Performance has regressed. Found 6 outliers among 100 measurements (6.00%) 1 (1.00%) low severe 2 (2.00%) high mild 3 (3.00%) high severe aggregate_query_group_by_u64_multiple_keys time: [14.797 ms 15.112 ms 15.427 ms] change: [-12.072% -8.7274% -5.3392%] (p = 0.00 < 0.05) Performance has improved. Found 3 outliers among 100 measurements (3.00%) 3 (3.00%) high mild aggregate_query_approx_percentile_cont_on_u64 time: [4.1278 ms 4.1687 ms 4.2098 ms] change: [+0.4851% +1.9525% +3.3676%] (p = 0.01 < 0.05) Change within noise threshold. Found 2 outliers among 100 measurements (2.00%) 1 (1.00%) low mild 1 (1.00%) high mild aggregate_query_approx_percentile_cont_on_f32 time: [3.4694 ms 3.4967 ms 3.5245 ms] change: [-1.5467% -0.4432% +0.6609%] (p = 0.43 > 0.05) No change in performance detected. Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high mild ```

For #2723. This has two effects: - **wider feature support:** We now use the V2 aggregator for all group-column types. The arrow row format support is sufficient for that. V1 will only be used if the aggregator itself doesn't support V2 (and these are quite a few at the moment). We'll improve on that front in follow-up PRs. - **more speed:** Turns out the arrow row format is also faster (see below). Perf results (mind the noise in the benchmarks that are actually not even touched by this code change): ```text ❯ cargo bench -p datafusion --bench aggregate_query_sql -- --baseline issue2723a-pre ... Running benches/aggregate_query_sql.rs (target/release/deps/aggregate_query_sql-fdbe5671f9c3019b) aggregate_query_no_group_by 15 12 time: [779.28 µs 782.77 µs 786.28 µs] change: [+2.1375% +2.7672% +3.4171%] (p = 0.00 < 0.05) Performance has regressed. Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high mild aggregate_query_no_group_by_min_max_f64 time: [712.96 µs 715.90 µs 719.14 µs] change: [+0.8379% +1.7648% +2.6345%] (p = 0.00 < 0.05) Change within noise threshold. Found 10 outliers among 100 measurements (10.00%) 3 (3.00%) low mild 6 (6.00%) high mild 1 (1.00%) high severe Benchmarking aggregate_query_no_group_by_count_distinct_wide: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.7s, enable flat sampling, or reduce sample count to 50. aggregate_query_no_group_by_count_distinct_wide time: [1.7297 ms 1.7399 ms 1.7503 ms] change: [-34.647% -33.908% -33.165%] (p = 0.00 < 0.05) Performance has improved. Found 5 outliers among 100 measurements (5.00%) 5 (5.00%) high mild Benchmarking aggregate_query_no_group_by_count_distinct_narrow: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.7s, enable flat sampling, or reduce sample count to 60. aggregate_query_no_group_by_count_distinct_narrow time: [1.0984 ms 1.1045 ms 1.1115 ms] change: [-38.581% -38.076% -37.569%] (p = 0.00 < 0.05) Performance has improved. Found 6 outliers among 100 measurements (6.00%) 1 (1.00%) low mild 5 (5.00%) high mild Benchmarking aggregate_query_group_by: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 9.1s, enable flat sampling, or reduce sample count to 50. aggregate_query_group_by time: [1.7810 ms 1.7925 ms 1.8057 ms] change: [-25.252% -24.127% -22.737%] (p = 0.00 < 0.05) Performance has improved. Found 9 outliers among 100 measurements (9.00%) 2 (2.00%) low mild 5 (5.00%) high mild 2 (2.00%) high severe Benchmarking aggregate_query_group_by_with_filter: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.1s, enable flat sampling, or reduce sample count to 60. aggregate_query_group_by_with_filter time: [1.2068 ms 1.2119 ms 1.2176 ms] change: [+2.2847% +3.0857% +3.8789%] (p = 0.00 < 0.05) Performance has regressed. Found 10 outliers among 100 measurements (10.00%) 1 (1.00%) low mild 7 (7.00%) high mild 2 (2.00%) high severe Benchmarking aggregate_query_group_by_u64 15 12: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.7s, enable flat sampling, or reduce sample count to 50. aggregate_query_group_by_u64 15 12 time: [1.6762 ms 1.6848 ms 1.6942 ms] change: [-29.598% -28.603% -27.400%] (p = 0.00 < 0.05) Performance has improved. Found 8 outliers among 100 measurements (8.00%) 1 (1.00%) low mild 1 (1.00%) high mild 6 (6.00%) high severe Benchmarking aggregate_query_group_by_with_filter_u64 15 12: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.1s, enable flat sampling, or reduce sample count to 60. aggregate_query_group_by_with_filter_u64 15 12 time: [1.1969 ms 1.2008 ms 1.2049 ms] change: [+1.3015% +2.1513% +3.0016%] (p = 0.00 < 0.05) Performance has regressed. Found 6 outliers among 100 measurements (6.00%) 1 (1.00%) low severe 2 (2.00%) high mild 3 (3.00%) high severe aggregate_query_group_by_u64_multiple_keys time: [14.797 ms 15.112 ms 15.427 ms] change: [-12.072% -8.7274% -5.3392%] (p = 0.00 < 0.05) Performance has improved. Found 3 outliers among 100 measurements (3.00%) 3 (3.00%) high mild aggregate_query_approx_percentile_cont_on_u64 time: [4.1278 ms 4.1687 ms 4.2098 ms] change: [+0.4851% +1.9525% +3.3676%] (p = 0.01 < 0.05) Change within noise threshold. Found 2 outliers among 100 measurements (2.00%) 1 (1.00%) low mild 1 (1.00%) high mild aggregate_query_approx_percentile_cont_on_f32 time: [3.4694 ms 3.4967 ms 3.5245 ms] change: [-1.5467% -0.4432% +0.6609%] (p = 0.43 > 0.05) No change in performance detected. Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high mild ```

alamb · 2023-01-18T14:57:36Z

So I think this ticket now is a bit complicated as it was originally about avoiding two GroupBy operations which #4924 from @mustafasrepo does, but also has since grown to include conslidating / improving the aggregate implementation as well.

So what I plan to do is to break out @crepererum 's plan in #2723 (comment) into a new ticket, and close this ticket when #4924 is merged

alamb · 2023-01-18T15:08:27Z

I filed #4973 to track consolidating the aggregators

alamb · 2023-01-18T15:09:18Z

It is actually quite cool to see that @crepererum 's plan from #2723 (comment) is moving right along

alamb added enhancement New feature or request help wanted Extra attention is needed labels Jun 12, 2022

This was referenced Jun 12, 2022

[Epic]: Complete ROW Format (Missing features) #1861

Closed

Support for GROUPING SETS/CUBE/ROLLUP #2716

Merged

alamb changed the title ~~Consolidate GroupByHash implementations row_hash.rs and hash.rs~~ Consolidate GroupByHash implementations row_hash.rs and hash.rs (remove duplication) Sep 15, 2022

alamb mentioned this issue Sep 15, 2022

Generate hash aggregation output in smaller record batches #3461

Merged

This was referenced Oct 24, 2022

Memory Limited GroupBy (Externalized / Spill) #1570

Closed

[Epic] Generate runtime errors if the memory budget is exceeded #3941

Closed

alamb mentioned this issue Nov 28, 2022

[Epic] Optionally Limit memory used by DataFusion plan #587

Closed

8 tasks

tustvold mentioned this issue Dec 12, 2022

Remove AggregateState wrapper #4582

Merged

crepererum mentioned this issue Jan 6, 2023

feat: use arrow row format for hash-group-by #4830

Merged

mustafasrepo mentioned this issue Jan 16, 2023

Unify Row hash and hash implementation #4924

Merged

alamb mentioned this issue Jan 18, 2023

Simplify GroupByHash implementation (to prepare for more work) #4972

Merged

alamb mentioned this issue Jan 18, 2023

Improve the performance of Aggregator, grouping, aggregation #4973

Closed

4 tasks

alamb closed this as completed in #4924 Jan 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consolidate GroupByHash implementations `row_hash.rs` and `hash.rs` (remove duplication) #2723

Consolidate GroupByHash implementations `row_hash.rs` and `hash.rs` (remove duplication) #2723

alamb commented Jun 12, 2022

ming535 commented Jun 15, 2022 •

edited

Loading

alamb commented Jun 15, 2022

yjshen commented Jun 16, 2022 •

edited

Loading

yjshen commented Jun 16, 2022

ming535 commented Jun 23, 2022

ming535 commented Jun 28, 2022 •

edited

Loading

alamb commented Jun 28, 2022

ming535 commented Jun 28, 2022

alamb commented Jun 29, 2022

crepererum commented Nov 23, 2022 •

edited

Loading

alamb commented Nov 23, 2022

crepererum commented Nov 23, 2022

alamb commented Nov 23, 2022

tustvold commented Nov 23, 2022

alamb commented Nov 28, 2022

alamb commented Jan 18, 2023

alamb commented Jan 18, 2023

alamb commented Jan 18, 2023

Consolidate GroupByHash implementations row_hash.rs and hash.rs (remove duplication) #2723

Consolidate GroupByHash implementations row_hash.rs and hash.rs (remove duplication) #2723

Comments

alamb commented Jun 12, 2022

ming535 commented Jun 15, 2022 • edited Loading

alamb commented Jun 15, 2022

yjshen commented Jun 16, 2022 • edited Loading

yjshen commented Jun 16, 2022

ming535 commented Jun 23, 2022

ming535 commented Jun 28, 2022 • edited Loading

alamb commented Jun 28, 2022

ming535 commented Jun 28, 2022

alamb commented Jun 29, 2022

crepererum commented Nov 23, 2022 • edited Loading

Grant Plan

Prior Art

V1 (hash.rs)

V2 (row_hash.rs)

Row Formats

Proposal

Design

Implementation Plan

alamb commented Nov 23, 2022

crepererum commented Nov 23, 2022

alamb commented Nov 23, 2022

tustvold commented Nov 23, 2022

alamb commented Nov 28, 2022

alamb commented Jan 18, 2023

alamb commented Jan 18, 2023

alamb commented Jan 18, 2023

Consolidate GroupByHash implementations `row_hash.rs` and `hash.rs` (remove duplication) #2723

Consolidate GroupByHash implementations `row_hash.rs` and `hash.rs` (remove duplication) #2723

ming535 commented Jun 15, 2022 •

edited

Loading

yjshen commented Jun 16, 2022 •

edited

Loading

ming535 commented Jun 28, 2022 •

edited

Loading

crepererum commented Nov 23, 2022 •

edited

Loading

V1 (`hash.rs`)

V2 (`row_hash.rs`)