-
Notifications
You must be signed in to change notification settings - Fork 11
WIP: Restore old grouping algorithm and improve it #76
Conversation
Follow the strategy used by Pandas. The new implementation is more efficient since it avoids creating a NullableCategoricalArray: the integer codes are combined on the fly with those computed from previous columns. Hashing only happens once by giving arbitrary codes to levels in the first pass; after that, only integer codes are used. Move the per-column operations to separate functions which can be specialized by the compiler for each column type. This also allows using a more efficient method for CategoricalArray. Fix ordering of CategoricalArray levels when levels have been reordered, and sort null values last for consistency with other nullable arrays. Enable sorting by default since its cost is relatively small compared with the rest. Avoid some allocations by using in place operations, use Base.unique!().
@alyst I've just added a mechanism to compress group indices when they would overflow. It was actually very simple: we just need to take the computed indices as if they were a grouping column. So every time overflow would happen, we have to pay a cost equivalent to grouping on an integer column. The code now also always compresses the group indices at the end when there are multiple grouping columns, since we allocate a vector of a length equal to the number of groups. Without this, a memory error can happen in the indexer even if we avoided overflow in the step before. This seems to be what Pandas does (though I'm not completely sure), but in some cases it's clearly not useful, i.e. when there are few empty combinations. So I wonder whether a kind of threshold wouldn't be in order (e.g. compress only if there are more than half as many groups as rows). I also realized that by moving to a |
Thanks for working on this Milan! I read through the Discourse discussion and the comments on #17 and the reasons for switching back to the old approach make sense to me. I've started working on a benchmark suite for DataTables again. Given how often people ask about performance, and that we missed the regressions on |
It would really be great to have a series benchmarks. Without it it's hard to make systematic performance improvements. You can probably steal examples from Pandas and data.tables, they show benchmarks on their websites. |
https://github.com/Rdatatable/data.table/wiki/Benchmarks-%3A-Grouping also has some nice benchmarks. |
I have found an alternative approach at #79 which seems to give similar or even better performance. Unfortunately, the issues with |
Closing in favor of #79. |
Follow the strategy used by Pandas. The new implementation is more efficient
since it avoids creating a NullableCategoricalArray: the integer codes are
combined on the fly with those computed from previous columns. Hashing only
happens once by giving arbitrary codes to levels in the first pass; after that,
only integer codes are used.
Move the per-column operations to separate functions which can be specialized
by the compiler for each column type. This also allows using a more efficient
method for CategoricalArray.
Fix ordering of CategoricalArray levels when levels have been reordered,
and sort null values last for consistency with other nullable arrays. Enable
sorting by default since its cost is relatively small compared with the rest.
Avoid some allocations by using in place operations, use Base.unique!().
This is WIP in particular because we need to implement the code to compress the codes when the cartesian product is going to overflow (see this code in Pandas). Also, tests for sorted and unsorted grouping should be adapted.
This branch cuts down the time to run this example from 10.7 seconds (similar to DataFrames) to 5.7 seconds.