Faster grouping and aggregation #894

tshort · 2015-11-29T21:20:07Z

Our grouping is pretty slow now. I've started a WIP in this branch of DataFramesMeta. Grouping columns that are integers (including PooledDataVectors) is sped up with a counting sort. Aggregation speed is improved by preallocating the result and assuming that functions all reduce.

Here are some timing comparisons from this file with N=1e7:

	new grouping	old grouping	R's data.table
Large groups, one key	0.21	1.93	0.17
Large groups, two keys	0.50	3.52	0.34
Large groups, multiple ops	0.33	1.68	0.73
Large groups, multiple ops	0.34	1.75	0.25
Small groups, multiple ops	1.02	5.52	0.47

This gets us in the ballpark of R's data.table, but note that this is with integer columns. R's data.table can work with characters and other types. It is very fast. I don't know how Matthew does it! I know he uses an MSB radix sort, but I tried implementing something, and it was quite slow.

I did the development in DataFramesMeta, but most of the code is meant for DataFrames. This was just a convenient place for me to try things out while keeping DataFrames working.

This affects other code, particularly joins. @alyst, you've looked at the joining code most recently. Do you think this grouping code could be integrated into your PR #850? Could we get faster joining by replacing the use of groupsort_indexer?

The text was updated successfully, but these errors were encountered:

alyst · 2015-11-30T22:26:52Z

@tshort Nice work, thank you! It would also be interesting to look how allocation stats from @time compare between your code and the DataFrames master.

I've looked into your code very briefly, so I might miss something. What I hadn't figured out is how exactly the counting sort wins in performance over the groupsort_indexer (as far as the sorting algorithm is concerned #850 is essentially the same as master)? Also, what is your vision for the generic case (multicolumn/nonintegral groups)? Would you create the transient index column as in master? (In #850 it's DataFrame-optimized hash)

Joins from #850 use _RowGroupDict type that defines the row groups. So it would be quite easy to replace the methods that generate _RowGroupDict with faster alternatives.

tshort · 2015-12-02T01:01:58Z

I don't think groupsort_indexer is the problem in the existing implementation. The slow speed is mainly from PooledDataArray and overly general use of results.

For the general non-integer case, we can convert to a PooledDataArray like we do now. I have a PooledDataArray constructor that's 2X faster.

https://github.com/JuliaStats/DataFramesMeta.jl/blob/ts/grouping/src/df-replacements.jl#L150

Another option is to try for a faster Radix sort like data.table does.

For multiple columns, I don't know if it's better to pool/sort them individually or to try to hash them first as in #850. For the integer case, I merged at the bit level. That works well for pooled data.

I've looked a bit at the grouping code, and my code might help speed up parts. At least the faster pooling should help, but I don't know if that's better than the custom hashing in #850.

nalimilan · 2015-12-02T09:21:20Z

I've not read the details, but let me comment on one point: I don't think creating a PooledDataArray for non-integer vectors is a good idea, as it requires allocating a lot of memory. If people need it, they can create it beforehand. Else, better use a solution that doesn't imply allocating a full vector copy, even if it's slower.

tshort · 2015-12-02T11:58:25Z

Right now, we create PooledDataArrays for every column (even for integers) as part of grouping and joining. That's the main reason we're slow.

Another option to speed up grouping of string columns is to pool strings. R has a global pool of strings. This cuts down on memory usage. I've experimented with using Symbols as a global string pool with moderate improvement. We could extend the idea to create a global pool(s) for strings that also includes a mapping to an integer. Then, we could use a counting sort for grouping these columns. See also ideas from a discussion on categorical variables at JuliaStats/DataArrays.jl#73. One issue with a global pool is garbage collection. I'm not sure how we could free unused strings.

nalimilan · 2015-12-02T12:11:00Z

Should we care? I tend to consider that variables that should be pooled should be PooledDataArrays, and that it's OK to be slow in other cases. At least, better get the PDA case working well before trying hard to optimize non-standard cases.

tshort · 2015-12-02T12:43:32Z

Pooling at the element level rather than the array level has some advantages. For one, you can start pooling right when you're reading from a file. That cuts down on memory. When joining two DataFrames with a PDA key column, we have to "re-pool" each column. With a global pool, that's unneeded--we can directly compare each column.

alyst · 2015-12-02T12:51:50Z

@tshort There could be R's stringAsFactor= equivalent in table-reading functions that would enable pooling directly during data import.
But are we confident that it's the most critical performance bottleneck of the current implementation?

There are timing comparisons for joining/grouping, but I guess profiling/memory allocation analysis need to be done to identify the real hot spots.

tshort · 2015-12-02T13:19:35Z

For the grouping code, I've done enough profiling to identify that creating PDA's is a key bottleneck. That's easy to see in the following where groupby is almost equivalent to creating a PDA:

julia> using DataFrames, DataFramesMeta
julia> N=10_000_000;
julia> d = DataFrame(a = P(rand(1:10, N)));

julia> @time groupby(d, :a);
  1.021399 seconds (10.00 M allocations: 421.997 MB, 5.92% gc time)

julia> @time PooledDataArray(d[:a]);
  0.939465 seconds (10.00 M allocations: 269.413 MB, 8.49% gc time)

For joining, I suspect that's also the case, but I haven't done any profiling.

alyst · 2015-12-02T13:24:27Z

What I'm concerned is where do these 10^6 allocations of 270MBytes come from? Shouldn't there be fewer allocations in theory?

tshort · 2015-12-02T14:02:16Z

That is a problem. Here are results from my faster constructor for PooledDataArrays:

julia> Pkg.checkout("DataFramesMeta", "ts/grouping")

julia> reload("DataFramesMeta")

julia> @time DataFramesMeta.pooled(d[:a]);
  0.240253 seconds (78 allocations: 76.297 MB, 4.20% gc time)

And in that, I did trade off some extra allocation for less hashing.

alyst · 2015-12-02T14:20:02Z

@tshort Thanks, these numbers looks much more like what we should expect. So it indicates there's probably a type stability problem in the current PooledDataArray constructor, because Dict doesn't need that many allocations:

function test_dict(n::Int)
  d = Dict{Int,Int}()
  sizehint!(d, n)
  for i in 1:n
    d[rand(1:10000)] = rand(1:10000)
  end
end

#julia> @time test_dict(10^6)
#  0.118631 seconds (15 allocations: 17.001 MB, 0.45% gc time)

That's actually my main point -- before doing any DataFrames optimizations we have to be sure the upstream (DataArrays/NullableArrays) is correct.

tshort · 2015-12-02T14:48:31Z

I think we can look at both...

Also, I think your timing includes a lot of rand(). Here's something that I think is a better comparison to what the PDA constructor needs to do:

julia> function f(d, a)
           for i in 1:length(a)
               d[a[i]] = i
           end
           d
       end

julia> @time r=f(Dict{Int,Int}(), d[:a]);
  0.215547 seconds (11 allocations: 768 bytes)

tshort · 2015-12-02T15:07:36Z

Another point on the timing above is that I think my new PDA constructor is about as good as we are going to get using Julia's Dict. Unless we look at different hashing approaches, I don't think we can make that part any faster. That's why looking at other alternatives like the counting sort and better ways of handling strings is helpful.

Here are timings for PDA construction for strings:

julia> s = rand([string("idx", i) for i in 1:10], N)

julia> @time PooledDataArray(s);
  2.913278 seconds (20.00 M allocations: 421.988 MB, 4.28% gc time)

julia> @time DataFramesMeta.pooled(s);
  0.870145 seconds (67 allocations: 76.297 MB, 2.51% gc time)

julia> @time r=f(Dict{ASCIIString,Int}(), s); # note- may have type instability
  1.022405 seconds (10.08 M allocations: 155.544 MB, 6.99% gc time)

Those are the times I think we can improve on a lot with global pooling of strings.

nalimilan · 2015-12-02T15:12:27Z

With small pools, I observed that a simple linear search is faster than a dictionary. Unfortunately, the question remains about how to decide the algorithm to use...

alyst · 2015-12-02T15:16:29Z

I've just checked, and if you explicitly specify the element type, the string timings are much improved

@time s = rand(ASCIIString[string("idx", i) for i in 1:10], 10^7)
 # 0.356052 seconds (75 allocations: 76.298 MB)
@time f(Dict{ASCIIString,Int}(), s)
#  0.562493 seconds (11 allocations: 768 bytes)

tshort · 2015-12-02T15:18:33Z

Agreed on the linear search.

You beat me to the punch on my type problem above. Here are my timings:

julia> s = rand(ASCIIString[string("idx", i) for i in 1:10], N);

julia> @time PooledDataArray(s);
  1.500306 seconds (10.00 M allocations: 269.408 MB, 5.80% gc time)

julia> @time DataFramesMeta.pooled(s);
  0.484201 seconds (67 allocations: 76.297 MB, 1.30% gc time)

julia> @time r=f(Dict{ASCIIString,Int}(), s);
  0.588719 seconds (11 allocations: 768 bytes)

julia> @time r=f(Dict{ASCIIString,Int}(), s);
  0.585572 seconds (11 allocations: 768 bytes)

alyst · 2015-12-02T15:24:01Z

And it's not only the PooledDataArray ctor that affects the performance. set_index!/get_index(DataArray) need also to be checked. I've identified one (obscure) case: JuliaStats/DataArrays.jl#163, but there could be more.

tshort · 2015-12-03T03:03:48Z

I coded up a minimal PooledString type using some code from @johnmyleswhite's CategoricalData.jl. See pooledstrings.jl. To make a PooledString, you can just do pstring("hello world") to use the global string pool.

With this, here are timing comparisons redone based on this version:

	new grouping	old grouping	R's data.table
Large groups, one key, a PDA	0.18	1.55	0.16
Large groups, one key, strings	1.31	1.84	0.16
Large groups, one key, PooledStrings	0.23		0.16
Large groups, two keys	0.37	2.93	0.32
Large groups, multiple ops	0.27	1.25	0.73
Large groups, multiple ops	0.31	1.35	0.25
Small groups, multiple ops	0.86	4.66	0.46

So, PooledStrings speed up grouping quite a bit relative to using plain strings.

tshort · 2015-12-08T00:51:19Z

The joining code that's in DataFrames is slow for two reasons:

Indexing and memory copying--d[idx] eats up time and memory.
Slow grouping

I wrote some faster join code here. The memory issues are reduced by returning views instead of copying. The techniques outlined above for faster joining are also used.

Here are timings in seconds comparing the existing join times with the new join code (inner join only for now). Two different tests are included, each with different types of keys. These tests are based on the timing code here.

	new joins	old joins
Overlapping keys, PDA key	0.14	2.01
Overlapping keys, string key	0.36	2.13
Overlapping keys, PooledString key	0.22	2.05
Overlapping keys, PooledStringArray key	0.20	2.96
1 large/1 small DF, PDA key	1.06	8.52
1 large/1 small DF, string key	1.82	4.66
1 large/1 small DF, PooledString key	0.61	3.84
1 large/1 small DF, PooledStringArray key	0.49	4.87

These cases all result in a DF with about 10,000,000 rows. For reference, R's data.table takes about 3.5 secs for these joins, and dplyr averages about 4.8 secs.

garborg · 2015-12-08T14:31:25Z

@tshort I can't review or contribute meaningfully these days, but I wanted to hop in anyway and say this sounds like a great overhaul.

tshort · 2015-12-31T15:42:43Z

This effort is on hold for now. On the Julia-users mailing list, @Viral wrote that some of the Moore Foundation grant will be used towards dataframes and that a roadmap or plan will be coming out soon. I'm waiting to make sure that this effort fits in with those plans.

datnamer · 2016-01-01T05:31:08Z

@tshort I think you meant to tag @ViralBShah

ViralBShah · 2016-01-01T17:39:58Z

Cc @simonbyrne

ViralBShah · 2016-01-01T17:41:22Z

I should just make clear that we do not have a specific plan for this yet, but dataframes is certainly one of the first priorities to put efforts into to tackle. The hope is that putting some dedicated effort into trying out various things discussed so far should give us a big push forward.

datnamer · 2016-01-01T18:40:22Z

Sounds interesting. What are some of the potential ideas and avenues for exploration? Ie Expression based or macros ? What kind of api and interface with backend? Formula int r face? Database and out of core?

ViralBShah · 2016-01-02T05:50:16Z

Most avenues are discussed in #744.

ViralBShah · 2016-01-02T05:50:47Z

Also some good ideas in @quinnj 's DataStreams.jl.

nalimilan · 2017-03-06T20:48:59Z

@alyst Do you think there are ideas to borrow for DataTables after JuliaData/DataTables.jl#17?

alyst · 2017-03-06T21:06:34Z

The ts/grouping branch is gone. Does the code still reside somewhere?

cc @tshort

nalimilan · 2017-03-06T21:20:06Z

At least there's a link in this comment.

alyst · 2017-03-07T10:59:51Z

AFAICT there are no things that could be [easily] cherry-picked to improve the performance. The proposed changes include:

improve and generalize the pooling of the strings (Pool{T}) and how the two pools are combined for frame joins. But since the DataTables.jl has migrated to CategoricalArrays with the tight control over the categories, the changes either do not apply anymore or are already addressed
for indexing the rows when doing multicolumn join it's proposed to dedicate the specific bits of the UInt32 index for each column. I think this approach has scalability issues and should not be faster than hashing.
changes to the [_]join_idx() method that introduce counting sort. That is an alternative to the hash-based RowGroupDict. I don't know if there would be performance benefits, but incorporating these changes would mean reverting many changes introduced by Enhance joining and grouping DataTables.jl#17. Also, this join_idx() version doesn't preserve the original table rows ordering, which was one of the #17 features.

nalimilan · 2017-03-07T12:19:15Z

OK, thanks for checking. Pooling strings could still be done via a global pool as an optimization that wouldn't have major user-visible consequences; that would be different from CategoricalArray, which is made not only to improve performance, but also to provide features like ordering and comparison of levels. But that's completely orthogonal to what JuliaData/DataTables.jl#17 implemented.

tshort · 2017-03-07T19:37:34Z

I'm good with closing this. Thanks for looking, @alyst.

tshort mentioned this issue Nov 30, 2015

Timings compared to new grouping code meant for DataFrames nalimilan/FreqTables.jl#3

Open

nalimilan mentioned this issue Dec 1, 2015

Feature Request: functions (sum avg, min max, free definable) in addition to count nalimilan/FreqTables.jl#4

Open

tshort mentioned this issue Dec 9, 2015

Pooling strings #895

Closed

quinnj closed this as completed Sep 7, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster grouping and aggregation #894

Faster grouping and aggregation #894

tshort commented Nov 29, 2015

alyst commented Nov 30, 2015

tshort commented Dec 2, 2015

nalimilan commented Dec 2, 2015

tshort commented Dec 2, 2015

nalimilan commented Dec 2, 2015

tshort commented Dec 2, 2015

alyst commented Dec 2, 2015

tshort commented Dec 2, 2015

alyst commented Dec 2, 2015

tshort commented Dec 2, 2015

alyst commented Dec 2, 2015

tshort commented Dec 2, 2015

tshort commented Dec 2, 2015

nalimilan commented Dec 2, 2015

alyst commented Dec 2, 2015

tshort commented Dec 2, 2015

alyst commented Dec 2, 2015

tshort commented Dec 3, 2015

tshort commented Dec 8, 2015

garborg commented Dec 8, 2015

tshort commented Dec 31, 2015

datnamer commented Jan 1, 2016

ViralBShah commented Jan 1, 2016

ViralBShah commented Jan 1, 2016

datnamer commented Jan 1, 2016

ViralBShah commented Jan 2, 2016

ViralBShah commented Jan 2, 2016

nalimilan commented Mar 6, 2017

alyst commented Mar 6, 2017

nalimilan commented Mar 6, 2017

alyst commented Mar 7, 2017

nalimilan commented Mar 7, 2017

tshort commented Mar 7, 2017

Faster grouping and aggregation #894

Faster grouping and aggregation #894

Comments

tshort commented Nov 29, 2015

alyst commented Nov 30, 2015

tshort commented Dec 2, 2015

nalimilan commented Dec 2, 2015

tshort commented Dec 2, 2015

nalimilan commented Dec 2, 2015

tshort commented Dec 2, 2015

alyst commented Dec 2, 2015

tshort commented Dec 2, 2015

alyst commented Dec 2, 2015

tshort commented Dec 2, 2015

alyst commented Dec 2, 2015

tshort commented Dec 2, 2015

tshort commented Dec 2, 2015

nalimilan commented Dec 2, 2015

alyst commented Dec 2, 2015

tshort commented Dec 2, 2015

alyst commented Dec 2, 2015

tshort commented Dec 3, 2015

tshort commented Dec 8, 2015

garborg commented Dec 8, 2015

tshort commented Dec 31, 2015

datnamer commented Jan 1, 2016

ViralBShah commented Jan 1, 2016

ViralBShah commented Jan 1, 2016

datnamer commented Jan 1, 2016

ViralBShah commented Jan 2, 2016

ViralBShah commented Jan 2, 2016

nalimilan commented Mar 6, 2017

alyst commented Mar 6, 2017

nalimilan commented Mar 6, 2017

alyst commented Mar 7, 2017

nalimilan commented Mar 7, 2017

tshort commented Mar 7, 2017