-
Notifications
You must be signed in to change notification settings - Fork 368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhance joining and grouping #850
Conversation
9d29327
to
df8411e
Compare
e697c4b
to
4772d3b
Compare
I've updated the PR. The newer version should be faster and use less memory. Now As requested by @matthieugomez, here's a small benchmark for using RDatasets
diamonds = dataset("ggplot2", "diamonds");
#diamonds[:Cut] = convert(PooledDataArray{ASCIIString, UInt}, diamonds[:Cut])
diamonds[:Cut] = convert(DataArray{ASCIIString}, diamonds[:Cut]);
diamonds[:Clarity] = convert(DataArray{ASCIIString}, diamonds[:Clarity]);
f(n) = for i in 1:n groupby(diamonds, [:Clarity, :Carat]) end;
f(1)
Profile.clear_malloc_data()
@time f(100) (without column
This PR:
In this example the new implementation is both faster and memory efficient. But I don't think the reported numbers reflect the actual efficiency of the implementations. Enormous number of allocations are probably due to elements indexing methods etc. Fixing just JuliaStats/DataArrays.jl#163 already reduced the number of allocations by 50% and there should be some other hotspots. In reality, the difference should be around few hundred allocations. Note also that doing this on master (after converting resp. columns ref types to groupby(diamonds, [:Cut, :Carat, :Clarity, :Depth, :Table, :Price, :Color]) could even crash some systems. The problem is that during grouping the estimated number of groups ( |
08aec90
to
8802e10
Compare
For big frames sorting of the row groups can take quite some time. Also, IMHO it's more logical to preserve the original order of rows as much as possible by default. So I've added |
Great. btw dplyr sorts and datatable does not |
914228a
to
50f3154
Compare
There's a lot to like about this PR. The problem is well written out. The added code has good comments and tests. Doc strings are updated. My main hesitation is that there's a lot of code churn, and the code looks more complicated. The basic grouping code needs work to improve performance (probably a rewrite). We are at least a factor of ten slower than R's data.table and dplyr packages. This PR doesn't significantly improve grouping speeds based on this code. My review/opinion here is on the grouping part of the code and not joins. |
@tshort Thanks for the review! |
…iaData#1042) Add compatibility with pre-contrasts ModelFrame constructor
…ise for speed improvement (JuliaData#1070)
Completely remove support for DataArrays.
This depends on PRs moving these into NullableArrays.jl. Also use isequal() instead of ==, as the latter is in Base and unlikely to change its semantics.
groupby() did not follow the order of levels, and wasn't robust to reordering levels. Add tests for corner cases.
Use the fallbacks for now, should be added back after JuliaData/CategoricalArrays.jl#12 is fixed.
so that DataFrameRow object doesn't need to be created
RowGroupDict that implements memory-efficient hashing of the data frame rows and its methods
- don't encode the indexing columns, use DataFrameRow hashes instead - do only the parts of left-right rows matching that are required for a particular join kind - avoid vcat() that is very slow for [Nullable]CategoricalVector - now join respects left-frame order for all join kinds, so the tests/data.jl test were updated
sorting order is changed from null first to null last (it matches the default data frame sorting)
by default no sorting is applied to preserve original ordering (the initial order of the 1st rows is preserved) and make things faster
refactor unsafe_hashindex() for data frame elements into hash_colel() that is marked with @propagate_inbounds
Did you intend to revive the PR? |
@nalimilan It's not so dead, I've rebased it recently. :) I was just waiting for the |
AFAIK it's mostly done for a long time now, what's needed is the high-level API (StructuredQueries) to make it easier to use. Regarding support in NullableArrays, I will comment on the other PR. |
@alyst The rebase has gone wrong, it's very hard to see your changes now. EDIT: of course, that's because master switched to the old DataArrays based branch. I guess the best thing to do is move the PR to DataTables now. Can you tell us more about the groupby algorithm you've adopted here? Cf. discussion at JuliaData/DataTables.jl#3. Also, what changes do you need in NullableArrays exactly? |
@nalimilan I must admit I was not following the recent development in the DataFrames.jl, the introduction of DataTables.jl etc. IIUC, it looks like the master branch was replaced at some point, that's why the merge has conflicts. I can try rebasing the PR again, if it would have any value given JuliaData/DataTables.jl#3. The grouping algorithm (it's also used for joining and finding duplicate rows) uses the dedicated hash table that is optimized for data frames: hashes are generated along the columns for more effective memory access. The custom hash table solves the problem of generating unique indices for rows when doing multi-column groups/joins on the real data. The master implementation was "naive" and often led to integer overflows or tried to allocate insane amounts of memory. Unfortunately, now I don't recall whether the changes in Nullable/Categorical were required to make the current tests pass or it was to address some bugs I discovered while using the PR for my data. Also at the time I rebased this PR to the NullableArrays-using master, there was some API polishing going on. Maybe all the problems are naturally resolved now. |
What happened is that the master branch was moved to DataTables. So if you do AFAIK, the current code was inspired by Pandas, except that it didn't implement the code to avoid overflow. Do you think your code is as fast as Pandas, including when all input variables are categorical? |
@nalimilan I'm not a Python/Pandas user, so I cannot comment on the performance comparison. It could be that for single-column joins the current code is faster (it's hard to beat, because it just directly uses the column values for indexing), but I spent some time trying to optimize the code for more complex "real life" scenarios. There is a specialized version of column hashing for categorical arrays and nullable arrays. However, to make the joins between nullable/non-nullable and categorical/non-categorical columns work correctly the hashes have to use the underlying values. |
The PR has now been merged in DataTables (JuliaData/DataTables.jl#17). |
JuliaData/DataTables.jl#17 contains many improvements over this PR. So if there is need to introduce these changes to |
PR addresses 4 limitations of current joining and grouping implementation:
:inner
,:semi
,:anti
full matching of left and right frames is done, which is not necessaryPooledDataVector
with large pools due to slowsetindex!(PooledDataArray)
.Here's the simple benchmarking scripts to test the performance.
For the simple cases PR has almost the same join performance as the current implementation (increased GC overhead is likely due to the allocation of additional arrays that store the rows order in the resulting frame):
Current times:
PR times:
However, if in the previous random table generation test, some columns would be converted into pooled vectors, the current implementation would fail
(
InexactError
is a sign that group index exceeded thetypemax(eltype(pooled_vector.refs))
)While for PR the times are
The PR wins if the frames contain e.g.
PooledDataVector
columns with suffificiently large pools:Current implementation:
PR: