-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove GroupByScalar and use ScalarValue in preparation for supporting null values in GroupBy #786
Conversation
/// Extract the value in `col[row]` as a GroupByScalar | ||
fn create_group_by_value(col: &ArrayRef, row: usize) -> Result<GroupByScalar> { | ||
match col.data_type() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that none of this code that created GroupByScalar
checks col.is_valid(row)
and thus is using whatever value happens to be in the array slots when they are NULL
I think in the longer run, we can keep the keys in a contiguous (mutable) array instead and keep offsets/pointers to the values in this array (and null values can be stored in a bitmap, so only 1 bit per value). This will only need roughly 8 bytes for the pointer + the key value in Arrow format. This will also enable other optimizations. The worst case is now something like |
@Dandandan this is a great idea - I will write up in more detail what I think you are proposing to make sure we are on the same page |
I played around a bit and I can shrink the key size overhead from 16 bytes --> 32 bytes with #788 I also think there is a possibility to shrink it even farther, but I owe a writeup of how that will work |
FYI @tustvold |
Some more details to get you started, this is the rough idea that is a similar to how hash join currently works (there it's a bit easier as we can concatenate the data upfront):
Above structure will reduce the memory needed for the state (only needs about 32 bytes total per value + data itself and most of it will be stack allocated / allocated in large chunks). Currently total usage per value will be at least a few times higher. It should also reduce the time to create / (re)hash the keys and will reduce the amount of intermediate vec / small allocations that are needed in the hash aggregate code. There are maybe slightly different ways to encode the data in the hashmap / check collisions, and above structure makes some further optimizations / vectorization possible. |
Given this PR is a potential memory usage regression and we are considering some different approaches (e.g what @Dandandan has described in #786 (comment)) marking this PR as a draft so it is clear I am not sure it should be merged. |
Btw, I think it is totally fine for now to have a small regression in order to support null values and clean up the code base a bit. |
👍 I want to ponder / try out some of the ideas you have explained too before coming to a conclusion |
@Dandandan here is my proposal: #790 for reworking group by hash, along the lines you proposed (I think). I plan to spend some time prototyping it later today, and any comments / concerns / corrections are most welcome |
4a54818
to
3a0b5d9
Compare
This one seems like an incremental change towards #790 so I am going to merge this in and rebase 790 so the changes are in that PR are smaller. |
Which issue does this PR close?
Fixes #376
Rationale for this change
This proposal is both both a code cleanup and a step towards computing the correct answers when group by has nulls (#781 and #782).
Since
GroupByScalar
does not have any notion of a NULL or None now, butScalarValue
does, rather than trying to teach GroupByScalar about Nulls it seemed like a good opportunity to remove it entirely.In parallel I am working on a proposal for the rest of #781
What changes are included in this PR?
Eq
andHash
implementations forScalarValue
, based onOrderedFloat
(which was used byGroupByScalar
)impl From<Option<..>>
forScalarValue
to make it easier to useAre there any user-facing changes?
Not functionally (though more memory will be used when grouping)
Benchmark results
TLDR; I don't think there is any measurable performance difference with this change in the microbenchmarks
I used the following command (with the baseline name changed):
Here are the benchmark results of master and this branch using https://github.com/BurntSushi/critcmp to compare
To see how consistent the measurements were, I gathered numbers twice on master and twice on this branch. My conclusion is that all changes are within the margin of error of my measurement setup
Concerns
This change increases the size of each potential group key by a factor of 4 in HashAggregates to
64
, thesize_of(ScalarValue)
up from from16
(the size ofGroupScalar
).UPDATE: I can shrink the overhead to an extra 32 bytes in this PR: can #788;
This optimization was added by @Dandandan in 1ecdf3f / apache/arrow#8765 and I would be interested in hearing his thoughts on the matter here. In any event we will likely need more than the existing 16 bytes per group key to handle null values, but we probably don't need an extra 56 bytes per key.
Some options I can think of
Box
/Arc
the non primitive values in ScalarValue (which might have some small overhead)Perhaps @jorgecarleitao has some additional thoughts