Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster grouping and aggregation #894

Closed
tshort opened this issue Nov 29, 2015 · 33 comments
Closed

Faster grouping and aggregation #894

tshort opened this issue Nov 29, 2015 · 33 comments

Comments

@tshort
Copy link
Contributor

tshort commented Nov 29, 2015

Our grouping is pretty slow now. I've started a WIP in this branch of DataFramesMeta. Grouping columns that are integers (including PooledDataVectors) is sped up with a counting sort. Aggregation speed is improved by preallocating the result and assuming that functions all reduce.

Here are some timing comparisons from this file with N=1e7:

new grouping old grouping R's data.table
Large groups, one key 0.21 1.93 0.17
Large groups, two keys 0.50 3.52 0.34
Large groups, multiple ops 0.33 1.68 0.73
Large groups, multiple ops 0.34 1.75 0.25
Small groups, multiple ops 1.02 5.52 0.47

This gets us in the ballpark of R's data.table, but note that this is with integer columns. R's data.table can work with characters and other types. It is very fast. I don't know how Matthew does it! I know he uses an MSB radix sort, but I tried implementing something, and it was quite slow.

I did the development in DataFramesMeta, but most of the code is meant for DataFrames. This was just a convenient place for me to try things out while keeping DataFrames working.

This affects other code, particularly joins. @alyst, you've looked at the joining code most recently. Do you think this grouping code could be integrated into your PR #850? Could we get faster joining by replacing the use of groupsort_indexer?

@alyst
Copy link
Contributor

alyst commented Nov 30, 2015

@tshort Nice work, thank you! It would also be interesting to look how allocation stats from @time compare between your code and the DataFrames master.

I've looked into your code very briefly, so I might miss something. What I hadn't figured out is how exactly the counting sort wins in performance over the groupsort_indexer (as far as the sorting algorithm is concerned #850 is essentially the same as master)? Also, what is your vision for the generic case (multicolumn/nonintegral groups)? Would you create the transient index column as in master? (In #850 it's DataFrame-optimized hash)

Joins from #850 use _RowGroupDict type that defines the row groups. So it would be quite easy to replace the methods that generate _RowGroupDict with faster alternatives.

@tshort
Copy link
Contributor Author

tshort commented Dec 2, 2015

I don't think groupsort_indexer is the problem in the existing implementation. The slow speed is mainly from PooledDataArray and overly general use of results.

For the general non-integer case, we can convert to a PooledDataArray like we do now. I have a PooledDataArray constructor that's 2X faster.

https://github.com/JuliaStats/DataFramesMeta.jl/blob/ts/grouping/src/df-replacements.jl#L150

Another option is to try for a faster Radix sort like data.table does.

For multiple columns, I don't know if it's better to pool/sort them individually or to try to hash them first as in #850. For the integer case, I merged at the bit level. That works well for pooled data.

I've looked a bit at the grouping code, and my code might help speed up parts. At least the faster pooling should help, but I don't know if that's better than the custom hashing in #850.

@nalimilan
Copy link
Member

I've not read the details, but let me comment on one point: I don't think creating a PooledDataArray for non-integer vectors is a good idea, as it requires allocating a lot of memory. If people need it, they can create it beforehand. Else, better use a solution that doesn't imply allocating a full vector copy, even if it's slower.

@tshort
Copy link
Contributor Author

tshort commented Dec 2, 2015

Right now, we create PooledDataArrays for every column (even for integers) as part of grouping and joining. That's the main reason we're slow.

Another option to speed up grouping of string columns is to pool strings. R has a global pool of strings. This cuts down on memory usage. I've experimented with using Symbols as a global string pool with moderate improvement. We could extend the idea to create a global pool(s) for strings that also includes a mapping to an integer. Then, we could use a counting sort for grouping these columns. See also ideas from a discussion on categorical variables at JuliaStats/DataArrays.jl#73. One issue with a global pool is garbage collection. I'm not sure how we could free unused strings.

@nalimilan
Copy link
Member

Should we care? I tend to consider that variables that should be pooled should be PooledDataArrays, and that it's OK to be slow in other cases. At least, better get the PDA case working well before trying hard to optimize non-standard cases.

@tshort
Copy link
Contributor Author

tshort commented Dec 2, 2015

Pooling at the element level rather than the array level has some advantages. For one, you can start pooling right when you're reading from a file. That cuts down on memory. When joining two DataFrames with a PDA key column, we have to "re-pool" each column. With a global pool, that's unneeded--we can directly compare each column.

@alyst
Copy link
Contributor

alyst commented Dec 2, 2015

@tshort There could be R's stringAsFactor= equivalent in table-reading functions that would enable pooling directly during data import.
But are we confident that it's the most critical performance bottleneck of the current implementation?

There are timing comparisons for joining/grouping, but I guess profiling/memory allocation analysis need to be done to identify the real hot spots.

@tshort
Copy link
Contributor Author

tshort commented Dec 2, 2015

For the grouping code, I've done enough profiling to identify that creating PDA's is a key bottleneck. That's easy to see in the following where groupby is almost equivalent to creating a PDA:

julia> using DataFrames, DataFramesMeta
julia> N=10_000_000;
julia> d = DataFrame(a = P(rand(1:10, N)));

julia> @time groupby(d, :a);
  1.021399 seconds (10.00 M allocations: 421.997 MB, 5.92% gc time)

julia> @time PooledDataArray(d[:a]);
  0.939465 seconds (10.00 M allocations: 269.413 MB, 8.49% gc time)

For joining, I suspect that's also the case, but I haven't done any profiling.

@alyst
Copy link
Contributor

alyst commented Dec 2, 2015

What I'm concerned is where do these 10^6 allocations of 270MBytes come from? Shouldn't there be fewer allocations in theory?

@tshort
Copy link
Contributor Author

tshort commented Dec 2, 2015

That is a problem. Here are results from my faster constructor for PooledDataArrays:

julia> Pkg.checkout("DataFramesMeta", "ts/grouping")

julia> reload("DataFramesMeta")

julia> @time DataFramesMeta.pooled(d[:a]);
  0.240253 seconds (78 allocations: 76.297 MB, 4.20% gc time)

And in that, I did trade off some extra allocation for less hashing.

@alyst
Copy link
Contributor

alyst commented Dec 2, 2015

@tshort Thanks, these numbers looks much more like what we should expect. So it indicates there's probably a type stability problem in the current PooledDataArray constructor, because Dict doesn't need that many allocations:

function test_dict(n::Int)
  d = Dict{Int,Int}()
  sizehint!(d, n)
  for i in 1:n
    d[rand(1:10000)] = rand(1:10000)
  end
end

#julia> @time test_dict(10^6)
#  0.118631 seconds (15 allocations: 17.001 MB, 0.45% gc time)

That's actually my main point -- before doing any DataFrames optimizations we have to be sure the upstream (DataArrays/NullableArrays) is correct.

@tshort
Copy link
Contributor Author

tshort commented Dec 2, 2015

I think we can look at both...

Also, I think your timing includes a lot of rand(). Here's something that I think is a better comparison to what the PDA constructor needs to do:

julia> function f(d, a)
           for i in 1:length(a)
               d[a[i]] = i
           end
           d
       end

julia> @time r=f(Dict{Int,Int}(), d[:a]);
  0.215547 seconds (11 allocations: 768 bytes)

@tshort
Copy link
Contributor Author

tshort commented Dec 2, 2015

Another point on the timing above is that I think my new PDA constructor is about as good as we are going to get using Julia's Dict. Unless we look at different hashing approaches, I don't think we can make that part any faster. That's why looking at other alternatives like the counting sort and better ways of handling strings is helpful.

Here are timings for PDA construction for strings:

julia> s = rand([string("idx", i) for i in 1:10], N)

julia> @time PooledDataArray(s);
  2.913278 seconds (20.00 M allocations: 421.988 MB, 4.28% gc time)

julia> @time DataFramesMeta.pooled(s);
  0.870145 seconds (67 allocations: 76.297 MB, 2.51% gc time)

julia> @time r=f(Dict{ASCIIString,Int}(), s); # note- may have type instability
  1.022405 seconds (10.08 M allocations: 155.544 MB, 6.99% gc time)

Those are the times I think we can improve on a lot with global pooling of strings.

@nalimilan
Copy link
Member

With small pools, I observed that a simple linear search is faster than a dictionary. Unfortunately, the question remains about how to decide the algorithm to use...

@alyst
Copy link
Contributor

alyst commented Dec 2, 2015

I've just checked, and if you explicitly specify the element type, the string timings are much improved

@time s = rand(ASCIIString[string("idx", i) for i in 1:10], 10^7)
 # 0.356052 seconds (75 allocations: 76.298 MB)
@time f(Dict{ASCIIString,Int}(), s)
#  0.562493 seconds (11 allocations: 768 bytes)

@tshort
Copy link
Contributor Author

tshort commented Dec 2, 2015

Agreed on the linear search.

You beat me to the punch on my type problem above. Here are my timings:

julia> s = rand(ASCIIString[string("idx", i) for i in 1:10], N);

julia> @time PooledDataArray(s);
  1.500306 seconds (10.00 M allocations: 269.408 MB, 5.80% gc time)

julia> @time DataFramesMeta.pooled(s);
  0.484201 seconds (67 allocations: 76.297 MB, 1.30% gc time)

julia> @time r=f(Dict{ASCIIString,Int}(), s);
  0.588719 seconds (11 allocations: 768 bytes)

julia> @time r=f(Dict{ASCIIString,Int}(), s);
  0.585572 seconds (11 allocations: 768 bytes)

@alyst
Copy link
Contributor

alyst commented Dec 2, 2015

And it's not only the PooledDataArray ctor that affects the performance. set_index!/get_index(DataArray) need also to be checked. I've identified one (obscure) case: JuliaStats/DataArrays.jl#163, but there could be more.

@tshort
Copy link
Contributor Author

tshort commented Dec 3, 2015

I coded up a minimal PooledString type using some code from @johnmyleswhite's CategoricalData.jl. See pooledstrings.jl. To make a PooledString, you can just do pstring("hello world") to use the global string pool.

With this, here are timing comparisons redone based on this version:

new grouping old grouping R's data.table
Large groups, one key, a PDA 0.18 1.55 0.16
Large groups, one key, strings 1.31 1.84 0.16
Large groups, one key, PooledStrings 0.23 0.16
Large groups, two keys 0.37 2.93 0.32
Large groups, multiple ops 0.27 1.25 0.73
Large groups, multiple ops 0.31 1.35 0.25
Small groups, multiple ops 0.86 4.66 0.46

So, PooledStrings speed up grouping quite a bit relative to using plain strings.

@tshort
Copy link
Contributor Author

tshort commented Dec 8, 2015

The joining code that's in DataFrames is slow for two reasons:

  • Indexing and memory copying--d[idx] eats up time and memory.
  • Slow grouping

I wrote some faster join code here. The memory issues are reduced by returning views instead of copying. The techniques outlined above for faster joining are also used.

Here are timings in seconds comparing the existing join times with the new join code (inner join only for now). Two different tests are included, each with different types of keys. These tests are based on the timing code here.

new joins old joins
Overlapping keys, PDA key 0.14 2.01
Overlapping keys, string key 0.36 2.13
Overlapping keys, PooledString key 0.22 2.05
Overlapping keys, PooledStringArray key 0.20 2.96
1 large/1 small DF, PDA key 1.06 8.52
1 large/1 small DF, string key 1.82 4.66
1 large/1 small DF, PooledString key 0.61 3.84
1 large/1 small DF, PooledStringArray key 0.49 4.87

These cases all result in a DF with about 10,000,000 rows. For reference, R's data.table takes about 3.5 secs for these joins, and dplyr averages about 4.8 secs.

@garborg
Copy link
Contributor

garborg commented Dec 8, 2015

@tshort I can't review or contribute meaningfully these days, but I wanted to hop in anyway and say this sounds like a great overhaul.

@tshort tshort mentioned this issue Dec 9, 2015
@tshort
Copy link
Contributor Author

tshort commented Dec 31, 2015

This effort is on hold for now. On the Julia-users mailing list, @Viral wrote that some of the Moore Foundation grant will be used towards dataframes and that a roadmap or plan will be coming out soon. I'm waiting to make sure that this effort fits in with those plans.

@datnamer
Copy link

datnamer commented Jan 1, 2016

@tshort I think you meant to tag @ViralBShah

@ViralBShah
Copy link
Contributor

Cc @simonbyrne

@ViralBShah
Copy link
Contributor

I should just make clear that we do not have a specific plan for this yet, but dataframes is certainly one of the first priorities to put efforts into to tackle. The hope is that putting some dedicated effort into trying out various things discussed so far should give us a big push forward.

@datnamer
Copy link

datnamer commented Jan 1, 2016

Sounds interesting. What are some of the potential ideas and avenues for exploration? Ie Expression based or macros ? What kind of api and interface with backend? Formula int r face? Database and out of core?

@ViralBShah
Copy link
Contributor

Most avenues are discussed in #744.

@ViralBShah
Copy link
Contributor

Also some good ideas in @quinnj 's DataStreams.jl.

@nalimilan
Copy link
Member

@alyst Do you think there are ideas to borrow for DataTables after JuliaData/DataTables.jl#17?

@alyst
Copy link
Contributor

alyst commented Mar 6, 2017

The ts/grouping branch is gone. Does the code still reside somewhere?

cc @tshort

@nalimilan
Copy link
Member

At least there's a link in this comment.

@alyst
Copy link
Contributor

alyst commented Mar 7, 2017

AFAICT there are no things that could be [easily] cherry-picked to improve the performance. The proposed changes include:

  • improve and generalize the pooling of the strings (Pool{T}) and how the two pools are combined for frame joins. But since the DataTables.jl has migrated to CategoricalArrays with the tight control over the categories, the changes either do not apply anymore or are already addressed
  • for indexing the rows when doing multicolumn join it's proposed to dedicate the specific bits of the UInt32 index for each column. I think this approach has scalability issues and should not be faster than hashing.
  • changes to the [_]join_idx() method that introduce counting sort. That is an alternative to the hash-based RowGroupDict. I don't know if there would be performance benefits, but incorporating these changes would mean reverting many changes introduced by Enhance joining and grouping DataTables.jl#17. Also, this join_idx() version doesn't preserve the original table rows ordering, which was one of the #17 features.

@nalimilan
Copy link
Member

OK, thanks for checking. Pooling strings could still be done via a global pool as an optimization that wouldn't have major user-visible consequences; that would be different from CategoricalArray, which is made not only to improve performance, but also to provide features like ordering and comparison of levels. But that's completely orthogonal to what JuliaData/DataTables.jl#17 implemented.

@tshort
Copy link
Contributor Author

tshort commented Mar 7, 2017

I'm good with closing this. Thanks for looking, @alyst.

@quinnj quinnj closed this as completed Sep 7, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants