WIP: Restore old grouping algorithm and improve it #76

nalimilan · 2017-06-18T14:34:37Z

Follow the strategy used by Pandas. The new implementation is more efficient
since it avoids creating a NullableCategoricalArray: the integer codes are
combined on the fly with those computed from previous columns. Hashing only
happens once by giving arbitrary codes to levels in the first pass; after that,
only integer codes are used.

Move the per-column operations to separate functions which can be specialized
by the compiler for each column type. This also allows using a more efficient
method for CategoricalArray.

Fix ordering of CategoricalArray levels when levels have been reordered,
and sort null values last for consistency with other nullable arrays. Enable
sorting by default since its cost is relatively small compared with the rest.

Avoid some allocations by using in place operations, use Base.unique!().

This is WIP in particular because we need to implement the code to compress the codes when the cartesian product is going to overflow (see this code in Pandas). Also, tests for sorted and unsorted grouping should be adapted.

This branch cuts down the time to run this example from 10.7 seconds (similar to DataFrames) to 5.7 seconds.

using DataTables
N = 100_000_000
A = rand(N);
B = rand(1:100, N);
dt = DataTable(A = A, B = B);
@time by(dt, :B, d -> mean(d[:A]));

Follow the strategy used by Pandas. The new implementation is more efficient since it avoids creating a NullableCategoricalArray: the integer codes are combined on the fly with those computed from previous columns. Hashing only happens once by giving arbitrary codes to levels in the first pass; after that, only integer codes are used. Move the per-column operations to separate functions which can be specialized by the compiler for each column type. This also allows using a more efficient method for CategoricalArray. Fix ordering of CategoricalArray levels when levels have been reordered, and sort null values last for consistency with other nullable arrays. Enable sorting by default since its cost is relatively small compared with the rest. Avoid some allocations by using in place operations, use Base.unique!().

nalimilan · 2017-06-22T16:24:51Z

@alyst I've just added a mechanism to compress group indices when they would overflow. It was actually very simple: we just need to take the computed indices as if they were a grouping column. So every time overflow would happen, we have to pay a cost equivalent to grouping on an integer column.

The code now also always compresses the group indices at the end when there are multiple grouping columns, since we allocate a vector of a length equal to the number of groups. Without this, a memory error can happen in the indexer even if we avoided overflow in the step before. This seems to be what Pandas does (though I'm not completely sure), but in some cases it's clearly not useful, i.e. when there are few empty combinations. So I wonder whether a kind of threshold wouldn't be in order (e.g. compress only if there are more than half as many groups as rows).

I also realized that by moving to a UInt index rather than UInt32, it's quite hard to have enough groups to trigger an overflow. What was the use case which triggered it?

cjprybol · 2017-06-23T16:40:32Z

Thanks for working on this Milan! I read through the Discourse discussion and the comments on #17 and the reasons for switching back to the old approach make sense to me.

I've started working on a benchmark suite for DataTables again. Given how often people ask about performance, and that we missed the regressions on join in #17 (too excited about the improvements in groupby!), I think the sooner we can get some good, library-wide performance coverage the better. I'll try and get the tests for split-apply-combine related functionality done ASAP so we can compare these changes to the current approach from #17, and also so I can check #65.

nalimilan · 2017-06-23T16:47:39Z

It would really be great to have a series benchmarks. Without it it's hard to make systematic performance improvements. You can probably steal examples from Pandas and data.tables, they show benchmarks on their websites.

davidanthoff · 2017-06-23T18:31:21Z

https://github.com/Rdatatable/data.table/wiki/Benchmarks-%3A-Grouping also has some nice benchmarks.

nalimilan · 2017-07-11T18:25:06Z

I have found an alternative approach at #79 which seems to give similar or even better performance. Unfortunately, the issues with join are quite different so we need to find something else.

nalimilan · 2017-08-07T12:51:08Z

Closing in favor of #79.

nalimilan mentioned this pull request Jun 18, 2017

Enhance joining and grouping #17

Merged

nalimilan added 2 commits June 22, 2017 18:16

Compress group indices when needed, improve performance

a9ded4d

nalimilan force-pushed the nl/group branch from d6104d7 to a9ded4d Compare June 22, 2017 16:16

nalimilan mentioned this pull request Jul 11, 2017

Specialize row_group_slots() and findrow() on column types to improve performance #79

Merged

nalimilan closed this Aug 7, 2017

ararslan deleted the nl/group branch August 7, 2017 19:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Restore old grouping algorithm and improve it #76

WIP: Restore old grouping algorithm and improve it #76

nalimilan commented Jun 18, 2017

nalimilan commented Jun 22, 2017

cjprybol commented Jun 23, 2017

nalimilan commented Jun 23, 2017

davidanthoff commented Jun 23, 2017

nalimilan commented Jul 11, 2017

nalimilan commented Aug 7, 2017

WIP: Restore old grouping algorithm and improve it #76

WIP: Restore old grouping algorithm and improve it #76

Conversation

nalimilan commented Jun 18, 2017

nalimilan commented Jun 22, 2017

cjprybol commented Jun 23, 2017

nalimilan commented Jun 23, 2017

davidanthoff commented Jun 23, 2017

nalimilan commented Jul 11, 2017

nalimilan commented Aug 7, 2017