-
Notifications
You must be signed in to change notification settings - Fork 368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster grouping and aggregation #894
Comments
@tshort Nice work, thank you! It would also be interesting to look how allocation stats from I've looked into your code very briefly, so I might miss something. What I hadn't figured out is how exactly the counting sort wins in performance over the Joins from #850 use |
I don't think For the general non-integer case, we can convert to a https://github.com/JuliaStats/DataFramesMeta.jl/blob/ts/grouping/src/df-replacements.jl#L150 Another option is to try for a faster Radix sort like data.table does. For multiple columns, I don't know if it's better to pool/sort them individually or to try to hash them first as in #850. For the integer case, I merged at the bit level. That works well for pooled data. I've looked a bit at the grouping code, and my code might help speed up parts. At least the faster pooling should help, but I don't know if that's better than the custom hashing in #850. |
I've not read the details, but let me comment on one point: I don't think creating a |
Right now, we create Another option to speed up grouping of string columns is to pool strings. R has a global pool of strings. This cuts down on memory usage. I've experimented with using |
Should we care? I tend to consider that variables that should be pooled should be |
Pooling at the element level rather than the array level has some advantages. For one, you can start pooling right when you're reading from a file. That cuts down on memory. When joining two DataFrames with a PDA key column, we have to "re-pool" each column. With a global pool, that's unneeded--we can directly compare each column. |
@tshort There could be R's There are timing comparisons for joining/grouping, but I guess profiling/memory allocation analysis need to be done to identify the real hot spots. |
For the grouping code, I've done enough profiling to identify that creating PDA's is a key bottleneck. That's easy to see in the following where julia> using DataFrames, DataFramesMeta
julia> N=10_000_000;
julia> d = DataFrame(a = P(rand(1:10, N)));
julia> @time groupby(d, :a);
1.021399 seconds (10.00 M allocations: 421.997 MB, 5.92% gc time)
julia> @time PooledDataArray(d[:a]);
0.939465 seconds (10.00 M allocations: 269.413 MB, 8.49% gc time) For joining, I suspect that's also the case, but I haven't done any profiling. |
What I'm concerned is where do these 10^6 allocations of 270MBytes come from? Shouldn't there be fewer allocations in theory? |
That is a problem. Here are results from my faster constructor for PooledDataArrays: julia> Pkg.checkout("DataFramesMeta", "ts/grouping")
julia> reload("DataFramesMeta")
julia> @time DataFramesMeta.pooled(d[:a]);
0.240253 seconds (78 allocations: 76.297 MB, 4.20% gc time) And in that, I did trade off some extra allocation for less hashing. |
@tshort Thanks, these numbers looks much more like what we should expect. So it indicates there's probably a type stability problem in the current function test_dict(n::Int)
d = Dict{Int,Int}()
sizehint!(d, n)
for i in 1:n
d[rand(1:10000)] = rand(1:10000)
end
end
#julia> @time test_dict(10^6)
# 0.118631 seconds (15 allocations: 17.001 MB, 0.45% gc time) That's actually my main point -- before doing any |
I think we can look at both... Also, I think your timing includes a lot of julia> function f(d, a)
for i in 1:length(a)
d[a[i]] = i
end
d
end
julia> @time r=f(Dict{Int,Int}(), d[:a]);
0.215547 seconds (11 allocations: 768 bytes) |
Another point on the timing above is that I think my new PDA constructor is about as good as we are going to get using Julia's Dict. Unless we look at different hashing approaches, I don't think we can make that part any faster. That's why looking at other alternatives like the counting sort and better ways of handling strings is helpful. Here are timings for PDA construction for strings: julia> s = rand([string("idx", i) for i in 1:10], N)
julia> @time PooledDataArray(s);
2.913278 seconds (20.00 M allocations: 421.988 MB, 4.28% gc time)
julia> @time DataFramesMeta.pooled(s);
0.870145 seconds (67 allocations: 76.297 MB, 2.51% gc time)
julia> @time r=f(Dict{ASCIIString,Int}(), s); # note- may have type instability
1.022405 seconds (10.08 M allocations: 155.544 MB, 6.99% gc time) Those are the times I think we can improve on a lot with global pooling of strings. |
With small pools, I observed that a simple linear search is faster than a dictionary. Unfortunately, the question remains about how to decide the algorithm to use... |
I've just checked, and if you explicitly specify the element type, the string timings are much improved @time s = rand(ASCIIString[string("idx", i) for i in 1:10], 10^7)
# 0.356052 seconds (75 allocations: 76.298 MB)
@time f(Dict{ASCIIString,Int}(), s)
# 0.562493 seconds (11 allocations: 768 bytes) |
Agreed on the linear search. You beat me to the punch on my type problem above. Here are my timings: julia> s = rand(ASCIIString[string("idx", i) for i in 1:10], N);
julia> @time PooledDataArray(s);
1.500306 seconds (10.00 M allocations: 269.408 MB, 5.80% gc time)
julia> @time DataFramesMeta.pooled(s);
0.484201 seconds (67 allocations: 76.297 MB, 1.30% gc time)
julia> @time r=f(Dict{ASCIIString,Int}(), s);
0.588719 seconds (11 allocations: 768 bytes)
julia> @time r=f(Dict{ASCIIString,Int}(), s);
0.585572 seconds (11 allocations: 768 bytes) |
And it's not only the |
I coded up a minimal PooledString type using some code from @johnmyleswhite's CategoricalData.jl. See pooledstrings.jl. To make a PooledString, you can just do With this, here are timing comparisons redone based on this version:
So, PooledStrings speed up grouping quite a bit relative to using plain strings. |
The joining code that's in DataFrames is slow for two reasons:
I wrote some faster join code here. The memory issues are reduced by returning views instead of copying. The techniques outlined above for faster joining are also used. Here are timings in seconds comparing the existing join times with the new join code (inner join only for now). Two different tests are included, each with different types of keys. These tests are based on the timing code here.
These cases all result in a DF with about 10,000,000 rows. For reference, R's data.table takes about 3.5 secs for these joins, and dplyr averages about 4.8 secs. |
@tshort I can't review or contribute meaningfully these days, but I wanted to hop in anyway and say this sounds like a great overhaul. |
This effort is on hold for now. On the Julia-users mailing list, @Viral wrote that some of the Moore Foundation grant will be used towards dataframes and that a roadmap or plan will be coming out soon. I'm waiting to make sure that this effort fits in with those plans. |
@tshort I think you meant to tag @ViralBShah |
Cc @simonbyrne |
I should just make clear that we do not have a specific plan for this yet, but dataframes is certainly one of the first priorities to put efforts into to tackle. The hope is that putting some dedicated effort into trying out various things discussed so far should give us a big push forward. |
Sounds interesting. What are some of the potential ideas and avenues for exploration? Ie Expression based or macros ? What kind of api and interface with backend? Formula int r face? Database and out of core? |
Most avenues are discussed in #744. |
Also some good ideas in @quinnj 's DataStreams.jl. |
@alyst Do you think there are ideas to borrow for DataTables after JuliaData/DataTables.jl#17? |
The cc @tshort |
At least there's a link in this comment. |
AFAICT there are no things that could be [easily] cherry-picked to improve the performance. The proposed changes include:
|
OK, thanks for checking. Pooling strings could still be done via a global pool as an optimization that wouldn't have major user-visible consequences; that would be different from |
I'm good with closing this. Thanks for looking, @alyst. |
Our grouping is pretty slow now. I've started a WIP in this branch of DataFramesMeta. Grouping columns that are integers (including PooledDataVectors) is sped up with a counting sort. Aggregation speed is improved by preallocating the result and assuming that functions all reduce.
Here are some timing comparisons from this file with N=1e7:
This gets us in the ballpark of R's data.table, but note that this is with integer columns. R's data.table can work with characters and other types. It is very fast. I don't know how Matthew does it! I know he uses an MSB radix sort, but I tried implementing something, and it was quite slow.
I did the development in DataFramesMeta, but most of the code is meant for DataFrames. This was just a convenient place for me to try things out while keeping DataFrames working.
This affects other code, particularly joins. @alyst, you've looked at the joining code most recently. Do you think this grouping code could be integrated into your PR #850? Could we get faster joining by replacing the use of
groupsort_indexer
?The text was updated successfully, but these errors were encountered: