Enhance joining and grouping #850

alyst · 2015-08-07T14:51:31Z

PR addresses 4 limitations of current joining and grouping implementation:

the way row groups are indexed for multicolumn joining/grouping is very sparse (i.e. many usable indices have no rows assigned) and can lead to math overflows even for medium size frames
rows order in the left-joined or outer-joined frame doesn't match the rows order in the left frame (which is usually nice to maintain)
for any join kind, including :inner, :semi, :anti full matching of left and right frames is done, which is not necessary
poor performance for frames containing PooledDataVector with large pools due to slow setindex!(PooledDataArray).

Here's the simple benchmarking scripts to test the performance.

using Distributions, DataFrames

function random_frame(nrow::Int, col_values::Dict{Symbol, Any})
  DataFrame(Any[isa(col_values[key], PooledDataArray) ?
                @pdata(sample(col_values[key], nrow)) :
                @data(sample(col_values[key], nrow)) for key in keys(col_values)],
            keys(col_values) |> collect)
end

function random_join(kind::Symbol, nrow_left::Int, nrow_right::Int,
                     on_col_values::Dict{Symbol, Any},
                     left_col_values::Dict{Symbol, Any},
                     right_col_values::Dict{Symbol, Any})
  dfl = random_frame(nrow_left, merge(on_col_values, left_col_values))
  dfr = random_frame(nrow_right, merge(on_col_values, right_col_values))
  join(dfl, dfr, on = keys(on_col_values) |> collect, kind = kind)
end

function f(n::Int)
  for i in 1:n
    r = random_join(:outer, 1000, 2000,
                Dict{Symbol,Any}(:A => 1:10, :B => @data([:A, :B, :C, :D]),
                                 :C => 1:10, :D => 1:10),
                Dict{Symbol,Any}(:E => 1:10, :F => @data([:A, :B, :C, :D])),
                Dict{Symbol,Any}(:G => 1:10, :H => @data([:A, :B, :C, :D])))
  end
end

f(1)

@time f(100)

For the simple cases PR has almost the same join performance as the current implementation (increased GC overhead is likely due to the allocation of additional arrays that store the rows order in the resulting frame):

Current times:

  7.196765 seconds (60.15 M allocations: 1.920 GB, 3.94% gc time)
  7.134183 seconds (60.16 M allocations: 1.920 GB, 4.01% gc time)
  7.180465 seconds (60.15 M allocations: 1.920 GB, 3.95% gc time)

PR times:

  7.460440 seconds (139.77 M allocations: 3.300 GB, 5.42% gc time)
  7.377374 seconds (139.68 M allocations: 3.298 GB, 5.28% gc time)
  7.340252 seconds (139.88 M allocations: 3.301 GB, 5.60% gc time)

However, if in the previous random table generation test, some columns would be converted into pooled vectors, the current implementation would fail

function g(n::Int)
  for i in 1:n
    r = random_join(:outer, 1000, 2000,
                Dict{Symbol,Any}(:A => 1:10, :B => @data([:A, :B, :C, :D]),
                                 :C => @pdata(1:10), :D => 1:10),
                Dict{Symbol,Any}(:E => 1:10, :F => @pdata([:A, :B, :C, :D])),
                Dict{Symbol,Any}(:G => 1:10, :H => @data([:A, :B, :C, :D])))
  end
end

julia> g(1)
ERROR: InexactError()
 in setindex! at ./array.jl:303

(InexactError is a sign that group index exceeded the typemax(eltype(pooled_vector.refs)))

While for PR the times are

  7.760247 seconds (139.69 M allocations: 3.310 GB, 5.17% gc time)
  7.768788 seconds (139.61 M allocations: 3.308 GB, 5.11% gc time)
  7.776039 seconds (139.76 M allocations: 3.310 GB, 5.18% gc time)

The PR wins if the frames contain e.g. PooledDataVector columns with suffificiently large pools:

function h(n::Int)
    for i in 1:n
        r = random_join(:outer, 10000, 20000,
                       Dict{Symbol,Any}(:A => @pdata(1:10000)),
                       Dict{Symbol,Any}(:B => @pdata(1:10000)),
                       Dict{Symbol,Any}(:C => @pdata(1:10000)))
        end
    end

h(1)

@time h(100)

Current implementation:

 19.152983 seconds (63.97 M allocations: 2.064 GB, 1.13% gc time)
 19.126810 seconds (63.96 M allocations: 2.064 GB, 1.11% gc time)
 19.198262 seconds (63.97 M allocations: 2.064 GB, 1.02% gc time)

PR:

  5.145708 seconds (87.02 M allocations: 2.756 GB, 6.64% gc time)
  5.368688 seconds (87.02 M allocations: 2.755 GB, 6.75% gc time)
  5.236543 seconds (87.00 M allocations: 2.741 GB, 6.99% gc time)

alyst · 2015-08-26T16:49:50Z

I've updated the PR. The newer version should be faster and use less memory. Now _RowGroupDict implements its own memory-efficient hashing of rows instead of using Dict{DataFrameRow, Int}.

As requested by @matthieugomez, here's a small benchmark for groupby()

using RDatasets

diamonds = dataset("ggplot2", "diamonds");
#diamonds[:Cut] = convert(PooledDataArray{ASCIIString, UInt}, diamonds[:Cut])
diamonds[:Cut] = convert(DataArray{ASCIIString}, diamonds[:Cut]);
diamonds[:Clarity] = convert(DataArray{ASCIIString}, diamonds[:Clarity]);
f(n) = for i in 1:n groupby(diamonds, [:Clarity, :Carat]) end;
f(1)
Profile.clear_malloc_data()
@time f(100)

(without column :Cut conversion groupby() fails on master).

DataFrames.jl master:

2.258086 seconds (37.01 M allocations: 783.709 MB, 2.24% gc time)

This PR:

  1.833439 seconds (26.86 M allocations: 591.702 MB, 2.00% gc time)

In this example the new implementation is both faster and memory efficient. But I don't think the reported numbers reflect the actual efficiency of the implementations. Enormous number of allocations are probably due to elements indexing methods etc. Fixing just JuliaStats/DataArrays.jl#163 already reduced the number of allocations by 50% and there should be some other hotspots. In reality, the difference should be around few hundred allocations.

Note also that doing this on master (after converting resp. columns ref types to UInt):

groupby(diamonds, [:Cut, :Carat, :Clarity, :Depth, :Table, :Price, :Color])

could even crash some systems. The problem is that during grouping the estimated number of groups (ngroups) would grow very large. The current implementation calls DataArrays.groupsort_indexer() to throw away unused group indices, and it tries to allocate vector of ngroups length. So we hit the limitations of the current implementation much before overflowing UInt (see also #862). One potential fix would be to call groupsort_indexer() for each new column added into indexing with the obvious impact on performance.

alyst · 2015-08-30T21:48:04Z

For big frames sorting of the row groups can take quite some time. Also, IMHO it's more logical to preserve the original order of rows as much as possible by default. So I've added sort= option (disabled by default) to by() and groupby().

matthieugomez · 2015-08-30T22:51:25Z

Great. btw dplyr sorts and datatable does not

tshort · 2015-11-19T23:52:32Z

There's a lot to like about this PR. The problem is well written out. The added code has good comments and tests. Doc strings are updated.

My main hesitation is that there's a lot of code churn, and the code looks more complicated. The basic grouping code needs work to improve performance (probably a rewrite). We are at least a factor of ten slower than R's data.table and dplyr packages. This PR doesn't significantly improve grouping speeds based on this code. My review/opinion here is on the grouping part of the code and not joins.

alyst · 2015-11-20T00:48:05Z

@tshort Thanks for the review!
I think that to move forward we would also need to add the benchmarks for getindex()/setindex!() etc basic functions in DataArrays/NullableArrays (it would also make sense to compare the results with data.table/dplyr). The suspiciously high number of allocations in the @time output (see above) both for master and this PR suggests there could be some type stability problems upstream.

…iaData#1042) Add compatibility with pre-contrasts ModelFrame constructor

…ise for speed improvement (JuliaData#1070)

Completely remove support for DataArrays.

This depends on PRs moving these into NullableArrays.jl. Also use isequal() instead of ==, as the latter is in Base and unlikely to change its semantics.

groupby() did not follow the order of levels, and wasn't robust to reordering levels. Add tests for corner cases.

Use the fallbacks for now, should be added back after JuliaData/CategoricalArrays.jl#12 is fixed.

so that DataFrameRow object doesn't need to be created

RowGroupDict that implements memory-efficient hashing of the data frame rows and its methods

- don't encode the indexing columns, use DataFrameRow hashes instead - do only the parts of left-right rows matching that are required for a particular join kind - avoid vcat() that is very slow for [Nullable]CategoricalVector - now join respects left-frame order for all join kinds, so the tests/data.jl test were updated

sorting order is changed from null first to null last (it matches the default data frame sorting)

by default no sorting is applied to preserve original ordering (the initial order of the 1st rows is preserved) and make things faster

refactor unsafe_hashindex() for data frame elements into hash_colel() that is marked with @propagate_inbounds

nalimilan · 2017-01-20T16:29:18Z

Did you intend to revive the PR?

alyst · 2017-01-22T00:18:51Z

@nalimilan It's not so dead, I've rebased it recently. :) I was just waiting for the NullableArrays stuff to settle down until thinking of merging this PR again. It relies on e.g. JuliaStats/NullableArrays.jl#158, which was not merged, because of the anticipated support of lifting in Base.
I'm not up-to-date with the Nullable progress, is it mostly done?

nalimilan · 2017-01-22T16:06:00Z

I'm not up-to-date with the Nullable progress, is it mostly done?

AFAIK it's mostly done for a long time now, what's needed is the high-level API (StructuredQueries) to make it easier to use.

Regarding support in NullableArrays, I will comment on the other PR.

nalimilan · 2017-02-18T15:56:40Z

@alyst The rebase has gone wrong, it's very hard to see your changes now. EDIT: of course, that's because master switched to the old DataArrays based branch. I guess the best thing to do is move the PR to DataTables now.

Can you tell us more about the groupby algorithm you've adopted here? Cf. discussion at JuliaData/DataTables.jl#3.

Also, what changes do you need in NullableArrays exactly?

alyst · 2017-02-18T16:58:24Z

@nalimilan I must admit I was not following the recent development in the DataFrames.jl, the introduction of DataTables.jl etc. IIUC, it looks like the master branch was replaced at some point, that's why the merge has conflicts. I can try rebasing the PR again, if it would have any value given JuliaData/DataTables.jl#3.

The grouping algorithm (it's also used for joining and finding duplicate rows) uses the dedicated hash table that is optimized for data frames: hashes are generated along the columns for more effective memory access. The custom hash table solves the problem of generating unique indices for rows when doing multi-column groups/joins on the real data. The master implementation was "naive" and often led to integer overflows or tried to allocate insane amounts of memory.

Unfortunately, now I don't recall whether the changes in Nullable/Categorical were required to make the current tests pass or it was to address some bugs I discovered while using the PR for my data. Also at the time I rebased this PR to the NullableArrays-using master, there was some API polishing going on. Maybe all the problems are naturally resolved now.

nalimilan · 2017-02-18T21:22:12Z

What happened is that the master branch was moved to DataTables. So if you do git format-patch and then git am in DataTables, it should work (maybe after replacing DataFrame with DataTable and df| with dt` in the resulting patch).

AFAIK, the current code was inspired by Pandas, except that it didn't implement the code to avoid overflow. Do you think your code is as fast as Pandas, including when all input variables are categorical?

alyst · 2017-02-19T00:38:58Z

@nalimilan I'm not a Python/Pandas user, so I cannot comment on the performance comparison. It could be that for single-column joins the current code is faster (it's hard to beat, because it just directly uses the column values for indexing), but I spent some time trying to optimize the code for more complex "real life" scenarios. There is a specialized version of column hashing for categorical arrays and nullable arrays. However, to make the joins between nullable/non-nullable and categorical/non-categorical columns work correctly the hashes have to use the underlying values.
It's discussed above that there were some type inference problems in DataArrays.jl, so subtle changes in how the values are accessed had big impact on the performance. That was also one of the things I was taking into account, though for NullableArrays it might be different.

nalimilan · 2017-03-06T20:40:13Z

The PR has now been merged in DataTables (JuliaData/DataTables.jl#17).

alyst · 2017-03-06T21:20:28Z

JuliaData/DataTables.jl#17 contains many improvements over this PR. So if there is need to introduce these changes to DataFrames.jl, JuliaData/DataTables.jl#17 should be used as a reference. Closing.

alyst force-pushed the enhance_join branch 4 times, most recently from 9d29327 to df8411e Compare August 7, 2015 21:00

alyst mentioned this pull request Aug 22, 2015

Avoid overflow #862

Closed

alyst force-pushed the enhance_join branch 2 times, most recently from e697c4b to 4772d3b Compare August 26, 2015 15:57

alyst force-pushed the enhance_join branch 5 times, most recently from 08aec90 to 8802e10 Compare August 30, 2015 21:30

alyst mentioned this pull request Nov 13, 2015

Add benchmarks from R's data.table benchmarks JuliaData/DataFramesMeta.jl#36

Merged

alyst force-pushed the enhance_join branch 2 times, most recently from 914228a to 50f3154 Compare November 13, 2015 21:53

alyst mentioned this pull request Nov 23, 2015

Recode pool on overflow #892

Closed

tshort mentioned this pull request Nov 29, 2015

Faster grouping and aggregation #894

Closed

Gord Stephen and others added 8 commits September 14, 2016 10:13

RFC: Add compatibility with pre-contrasts ModelFrame constructor (Jul…

968e980

…iaData#1042) Add compatibility with pre-contrasts ModelFrame constructor

Reindex transposed sparse contrast matrix into modelmat_cols column-w…

d4ad15b

…ise for speed improvement (JuliaData#1070)

Fill existing arrays with scalars (JuliaData#1057)

2931693

Port to NullableArrays and CategoricalArrays

e4662fd

Completely remove support for DataArrays.

Get rid of custom Nullable operators and functions

9de5c08

This depends on PRs moving these into NullableArrays.jl. Also use isequal() instead of ==, as the latter is in Base and unlikely to change its semantics.

Fix grouping

6ac7549

groupby() did not follow the order of levels, and wasn't robust to reordering levels. Add tests for corner cases.

Remove custom isnull() definition

653fc1d

Remove optimized sorting methods

a17f264

Use the fallbacks for now, should be added back after JuliaData/CategoricalArrays.jl#12 is fixed.

alyst added 14 commits January 9, 2017 13:27

RepeatedVector: NullableArrays support

daa3035

DFRowIterator: cache nrow

fca5c38

DataFrameRow: rearrange methods a bit

2e97974

DataFrameRow: comparing rows from different frames

01e5b9f

add isless(DataFrameRow, DataFrameRow)

a7d6215

hash() and isequal() that require DF and row ix

8a966d8

so that DataFrameRow object doesn't need to be created

add helper functions for grouping and joining

2340c92

RowGroupDict that implements memory-efficient hashing of the data frame rows and its methods

use RowGroupDict for nonunique()

1f9dda0

groupby(): use group_rows()

822384f

sorting order is changed from null first to null last (it matches the default data frame sorting)

sort groups without temporary data frame creation

a7598a6

make sorting of row groups optional

5c5f4de

by default no sorting is applied to preserve original ordering (the initial order of the 1st rows is preserved) and make things faster

group tests with @testset

66c986e

replace padnull!() with resize!()

a9f0be6

alyst force-pushed the enhance_join branch from c53f3de to a9f0be6 Compare January 18, 2017 14:19

unsafe_hashindex() -> hash_colel() and @prop_inbnd

76947d4

refactor unsafe_hashindex() for data frame elements into hash_colel() that is marked with @propagate_inbounds

ararslan force-pushed the master branch from 1013694 to e5347cf Compare February 11, 2017 18:47

nalimilan mentioned this pull request Feb 18, 2017

rewrite groupby JuliaData/DataTables.jl#3

Closed

cjprybol mentioned this pull request Feb 20, 2017

Enhance joining and grouping JuliaData/DataTables.jl#17

Merged

alyst closed this Mar 6, 2017

nalimilan mentioned this pull request Jul 11, 2017

Specialize row_group_slots() and findrow() on column types to improve performance JuliaData/DataTables.jl#79

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance joining and grouping #850

Enhance joining and grouping #850

alyst commented Aug 7, 2015

alyst commented Aug 26, 2015

alyst commented Aug 30, 2015

matthieugomez commented Aug 30, 2015

tshort commented Nov 19, 2015

alyst commented Nov 20, 2015

nalimilan commented Jan 20, 2017

alyst commented Jan 22, 2017

nalimilan commented Jan 22, 2017

nalimilan commented Feb 18, 2017 •

edited

Loading

alyst commented Feb 18, 2017

nalimilan commented Feb 18, 2017

alyst commented Feb 19, 2017

nalimilan commented Mar 6, 2017

alyst commented Mar 6, 2017

Enhance joining and grouping #850

Enhance joining and grouping #850

Conversation

alyst commented Aug 7, 2015

alyst commented Aug 26, 2015

alyst commented Aug 30, 2015

matthieugomez commented Aug 30, 2015

tshort commented Nov 19, 2015

alyst commented Nov 20, 2015

nalimilan commented Jan 20, 2017

alyst commented Jan 22, 2017

nalimilan commented Jan 22, 2017

nalimilan commented Feb 18, 2017 • edited Loading

alyst commented Feb 18, 2017

nalimilan commented Feb 18, 2017

alyst commented Feb 19, 2017

nalimilan commented Mar 6, 2017

alyst commented Mar 6, 2017

nalimilan commented Feb 18, 2017 •

edited

Loading