-
Notifications
You must be signed in to change notification settings - Fork 11
Conversation
and sorry for hijacking your pull request from DataFrames @alyst. Feel free to fork this and resubmit the pull request under your account if you'd like! |
@cjprybol Great, thanks a lot for transferring the code to DataTables and for the benchmarks! I've looked at it briefly and it LGTM. Do you also plan to cherry pick the other uses of row grouping, i.e. joins (which was my original motivation) and |
Yes, thanks for mentioning those (joins & julia> dt1 = DataTable(a = shuffle!([1:10;]),
b = [:A,:B][rand(1:2, 10)],
v1 = randn(10))
10×3 DataTables.DataTable
│ Row │ a │ b │ v1 │
├─────┼────┼────┼────────────┤
│ 1 │ 9 │ :A │ -1.72976 │
│ 2 │ 5 │ :A │ 0.795949 │
│ 3 │ 1 │ :A │ 0.670062 │
│ 4 │ 7 │ :B │ 0.550852 │
│ 5 │ 4 │ :A │ -0.0633746 │
│ 6 │ 10 │ :B │ 1.33694 │
│ 7 │ 6 │ :B │ -0.0731486 │
│ 8 │ 2 │ :B │ -0.745464 │
│ 9 │ 8 │ :A │ -1.22006 │
│ 10 │ 3 │ :B │ -0.0531773 │
julia> dt2 = DataTable(a = shuffle!(reverse([1:5;])),
b2 = [:A,:B,:C][rand(1:3, 5)],
v2 = randn(5))
5×3 DataTables.DataTable
│ Row │ a │ b2 │ v2 │
├─────┼───┼────┼───────────┤
│ 1 │ 1 │ :A │ 0.0704676 │
│ 2 │ 5 │ :A │ 0.341794 │
│ 3 │ 2 │ :C │ 1.73517 │
│ 4 │ 3 │ :A │ 1.29992 │
│ 5 │ 4 │ :B │ 0.206364 │
julia> m1 = join(dt1, dt2, on = :a)
0×5 DataTables.DataTable I merged your join.jl as seen here. Any idea what I'm missing? |
@cjprybol Strange. Could you please give me the link to the Git branch of DataTables with joins merged so that I can try debugging it? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you both, that sounds really cool. I have reviewed what I could, without really diving into the details of the algorithm.
The benchmarks are really good, I wonder how they compare to e.g. Pandas. There just one case where the number and amount of allocations increased, though, and it would be good to check that it's not because of an easy to fix mistake:
julia> @benchmark groupby(dt1, [:v1, :v2])
#master
memory estimate: 1.22 MiB
allocs estimate: 66536
median time: 3.547 ms (0.00% GC)
#3
memory estimate: 10.68 MiB
allocs estimate: 476970
median time: 19.039 ms (0.00% GC)
#this pr/DataFrames:850
memory estimate: 3.33 MiB
allocs estimate: 115865
median time: 2.119 ms (0.00% GC)
src/datatablerow/datatablerow.jl
Outdated
isnull(v, i) ? hash(NULL_MAGIC, h) : hash(get(v[i]), h) | ||
Base.@propagate_inbounds hash_colel{T}(v::AbstractCategoricalArray{T}, i, h::UInt = zero(UInt)) = | ||
hash(CategoricalArrays.index(v.pool)[v.refs[i]], h) | ||
Base.@propagate_inbounds function hash_colel{T}(v::AbstractCategoricalArray{T}, i, h::UInt = zero(UInt)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AbstractCategoricalArray
should be AbstractNullableCategoricalArray
AFAICT. That also means a test is missing to catch this.
src/datatablerow/datatablerow.jl
Outdated
Base.@propagate_inbounds hash_colel{T}(v::NullableArray{T}, i, h::UInt = zero(UInt)) = | ||
isnull(v, i) ? hash(NULL_MAGIC, h) : hash(get(v[i]), h) | ||
Base.@propagate_inbounds hash_colel{T}(v::AbstractCategoricalArray{T}, i, h::UInt = zero(UInt)) = | ||
hash(CategoricalArrays.index(v.pool)[v.refs[i]], h) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure this is really more efficient than the default hash
method for CategoricalValue
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it's not as efficient. But in these functions the constraint is to make hashes invariant to the hashed value representation: whether it's nullable or not and whether it's stored "as is" or in a categorical array. Otherwise joins would not work (we may require that joins only use the columns of identical types, but that would result in too much overhead on the user side). So we have to check if the default hash functions have this property (Nullable
AFAIR is not).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be surprising that it would be significantly slower, since the code is very similar. Though since we need the special method for NullableCategoricalArray
to avoid the cost of creating a Nullable
just to unwrap it, I guess it doesn't matter too much what we do here.
src/abstractdatatable/sort.jl
Outdated
Base.sortperm(dt::AbstractDataTable, a::Algorithm, o::Ordering) = sortperm(dt, a, DTPerm(o,dt)) | ||
|
||
# Extras to speed up sorting | ||
#Base.sortperm{V}(dt::AbstractDataTable, a::Algorithm, o::FastPerm{Sort.ForwardOrdering,V}) = sortperm(o.vec) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this commented out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't recall now. Maybe something to do with methods ambiguity, but worth rechecking now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FastPerm
is part of DataArrays
so these (this and below) aren't applicable. I'll remove them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nalimilan see here. I'll add these back
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, funny how you can forget about issues. I'll try to add these.
src/abstractdatatable/sort.jl
Outdated
#Base.sortperm{V}(dt::AbstractDataTable, a::Algorithm, o::FastPerm{Sort.ReverseOrdering,V}) = reverse(sortperm(o.vec)) | ||
|
||
# permute rows | ||
function Base.permute!(dt::AbstractDataTable, p::AbstractVector) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a docstring to explain that permutation is applied to rows?
src/datatablerow/datatablerow.jl
Outdated
Base.collect(r::DataTableRow) = Tuple{Symbol, Any}[x for x in r] | ||
|
||
# the equal elements of nullable and normal arrays would have the same hashes | ||
const NULL_MAGIC = 0xBADDEED # what to hash if the element is null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's already a hash for null values in base/nullable.jl
:
const nullablehash_seed = UInt === UInt64 ? 0x932e0143e51d0171 : 0xe51d0171
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I missed that. Then we should use nullablehash_seed
instead NULL_MAGIC
.
src/datatablerow/utils.jl
Outdated
(gd.dt === dt) && return row # same frame, return itself | ||
# different frames, content matching required | ||
rhash = rowhash(dt, row) | ||
szm1 = length(gd.gslots)-1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again move inside the loop.
src/datatablerow/utils.jl
Outdated
return 0 # not found | ||
end | ||
|
||
# Finds indices of rows in 'gd' that match given row by content. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finds -> Find
returns -> return
src/datatablerow/utils.jl
Outdated
# returns empty set if no row matches | ||
function Base.get(gd::RowGroupDict, dt::DataTable, row::Int) | ||
g_row = findrow(gd, dt, row) | ||
(g_row == 0) && return Compat.view(gd.rperm, 0:-1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for Compat
now (same below).
src/groupeddatatable/grouping.jl
Outdated
dt_groups = group_rows(sdt) | ||
# sort the groups | ||
if sort | ||
group_perm = sortperm(sub(sdt, dt_groups.rperm[dt_groups.starts])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sub -> view
src/groupeddatatable/grouping.jl
Outdated
if sort | ||
group_perm = sortperm(sub(sdt, dt_groups.rperm[dt_groups.starts])) | ||
permute!(dt_groups.starts, group_perm) | ||
permute!(dt_groups.stops, group_perm) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The second call could be Base.permute!!
since the second argument isn't used after that?
I just pushed the EDIT: I added @alyst as a collaborator to my DataTables fork so I think everyone has the ability to write to this pull request |
src/datatablerow/datatablerow.jl
Outdated
isequal_colel(a::Any, b::Nullable) = isequal_colel(b, a) | ||
isequal_colel(a::Nullable, b::Nullable) = isnull(a)==isnull(b) && (isnull(a) || isequal(get(a), get(b))) | ||
isequal_colel(a::Nullable, b::Nullable) = isnull(a)==isnull(b) && (isnull(a) || isequal(a, b)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose here it could be just
isequal_colel(a::Nullable, b::Nullable) = isequal(a, b)
src/datatablerow/datatablerow.jl
Outdated
isnull(v, i) ? hash(Base.nullablehash_seed, h) : hash(v.values[i], h) | ||
Base.@propagate_inbounds hash_colel{T}(v::AbstractCategoricalArray{T}, i, h::UInt = zero(UInt)) = | ||
hash(CategoricalArrays.index(v.pool)[v.refs[i]], h) | ||
Base.@propagate_inbounds function hash_colel{T}(v::AbstractNullableCategoricalArray{T}, i, h::UInt = zero(UInt)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to add a test to distinguish this from above function per now out-of-date comment
AbstractCategoricalArray
should beAbstractNullableCategoricalArray
AFAICT. That also means a test is missing to catch this.
src/abstractdatatable/join.jl
Outdated
@assert length(left_ixs) == length(right_ixs) | ||
# compose left half of the result taking all left columns | ||
# FIXME is it still relevant? complicated way to do vcat that avoids expensive setindex!() for PooledDataVector | ||
all_orig_left_ixs = [left_ixs.orig; leftonly_ixs.orig] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to change this to vcat
or keep it and remove the # FIXME
? These might not be the best tests but I'm not seeing any noticable performance difference so maybe this consideration doesn't apply to the CategoricalArray
s that replaced PooledDataVector
?
x = rand(1000); y = rand(100)
julia> @benchmark vcat(x, y)
BenchmarkTools.Trial:
memory estimate: 8.75 KiB
allocs estimate: 1
--------------
minimum time: 587.663 ns (0.00% GC)
median time: 1.007 μs (0.00% GC)
mean time: 1.967 μs (33.84% GC)
maximum time: 17.909 μs (92.71% GC)
--------------
samples: 10000
evals/sample: 181
time tolerance: 5.00%
memory tolerance: 1.00%
julia> @benchmark [x; y]
BenchmarkTools.Trial:
memory estimate: 8.75 KiB
allocs estimate: 1
--------------
minimum time: 599.674 ns (0.00% GC)
median time: 1.004 μs (0.00% GC)
mean time: 1.892 μs (33.37% GC)
maximum time: 23.995 μs (76.91% GC)
--------------
samples: 10000
evals/sample: 175
time tolerance: 5.00%
memory tolerance: 1.00%
x = CategoricalArray(rand(1000)); y = CategoricalArray(rand(100))
julia> @benchmark vcat(x, y)
BenchmarkTools.Trial:
memory estimate: 494.70 KiB
allocs estimate: 2834
--------------
minimum time: 340.367 μs (0.00% GC)
median time: 382.938 μs (0.00% GC)
mean time: 492.088 μs (13.61% GC)
maximum time: 5.615 ms (79.17% GC)
--------------
samples: 9981
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
julia> @benchmark [x; y]
BenchmarkTools.Trial:
memory estimate: 494.70 KiB
allocs estimate: 2834
--------------
minimum time: 341.812 μs (0.00% GC)
median time: 380.336 μs (0.00% GC)
mean time: 496.584 μs (14.43% GC)
maximum time: 5.783 ms (79.25% GC)
--------------
samples: 9897
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIR setindex(PooledDataVector)
was very slow, because it did linear search in the pool for the value being assigned. It should not be the case for CategoricalArray. The problem is that now I don't remember anymore how the simple version should look like. :) Need to get into the code once again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vcat
should be fast for CategoricalArray
(and for PooledDataArray
now too), it works directly with integer codes. (Though it still allocates copies of original codes before concatenating them, which could be improved for the simple 1-D case.)
src/abstractdatatable/join.jl
Outdated
dtr_noon = without(joiner.dtr, joiner.on_cols) | ||
# FIXME is it still relevant? complicated way to do vcat that avoids expensive setindex!() for PooledDataVector | ||
# permutation to swap rightonly and leftonly rows | ||
right_perm = [1:length(right_ixs); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here re: # FIXME
I think I've worked everything out aside from interactions with julia> Pkg.test("DataTables")
INFO: Computing test dependencies for DataTables...
INFO: Installing Atom v0.5.9
INFO: Installing Blink v0.5.1
INFO: Installing CodeTools v0.4.3
INFO: Installing Codecs v0.2.0
INFO: Installing DataArrays v0.3.12
INFO: Installing HttpCommon v0.2.6
INFO: Installing HttpParser v0.2.0
INFO: Installing HttpServer v0.1.7
INFO: Installing LNR v0.0.2
INFO: Installing LaTeXStrings v0.2.0
INFO: Installing Lazy v0.11.5
INFO: Installing MbedTLS v0.4.3
INFO: Installing Mustache v0.1.3
INFO: Installing Mux v0.2.3
INFO: Installing RData v0.0.4
INFO: Installing RDatasets v0.2.0
INFO: Installing WebSockets v0.2.1
INFO: Building HttpParser
INFO: Building Homebrew
Already up-to-date.
INFO: Building MbedTLS
Using system libraries...
INFO: Testing DataTables
Running tests:
PASSED: utils.jl
PASSED: cat.jl
WARNING: using DataTables.sub in module TestData conflicts with an existing identifier.
FAILED: data.jl
LoadError: Unordered CategoricalValue objects cannot be tested for order; use the ordered! function on the parent array to change this
in macro expansion; at /Users/Cameron/.julia/v0.5/DataTables/test/runtests.jl:40 [inlined]
in anonymous at ./<missing>:?
in include_from_node1(::String) at ./loading.jl:488
in include_from_node1(::String) at /Users/Cameron/julia/usr/lib/julia/sys.dylib:?
in process_options(::Base.JLOptions) at ./client.jl:262
in _start() at ./client.jl:318
in _start() at /Users/Cameron/julia/usr/lib/julia/sys.dylib:?
while loading /Users/Cameron/.julia/v0.5/DataTables/test/data.jl, in expression starting on line 113
PASSED: index.jl
FAILED: datatable.jl
LoadError: Unordered CategoricalValue objects cannot be tested for order; use the ordered! function on the parent array to change this
in macro expansion; at /Users/Cameron/.julia/v0.5/DataTables/test/runtests.jl:40 [inlined]
in anonymous at ./<missing>:?
in include_from_node1(::String) at ./loading.jl:488
in include_from_node1(::String) at /Users/Cameron/julia/usr/lib/julia/sys.dylib:?
in process_options(::Base.JLOptions) at ./client.jl:262
in _start() at ./client.jl:318
in _start() at /Users/Cameron/julia/usr/lib/julia/sys.dylib:?
while loading /Users/Cameron/.julia/v0.5/DataTables/test/datatable.jl, in expression starting on line 320
PASSED: datatablerow.jl
PASSED: io.jl
PASSED: constructors.jl
PASSED: conversions.jl
PASSED: sort.jl
PASSED: grouping.jl
FAILED: join.jl
LoadError: MethodError: no method matching resize!(::CategoricalArrays.NullableCategoricalArray{Int64,1,UInt8}, ::Int64)
Closest candidates are:
resize!(::Array{T,1}, ::Integer) at array.jl:510
resize!(::BitArray{1}, ::Integer) at bitarray.jl:687
resize!{T}(::NullableArrays.NullableArray{T,1}, ::Int64) at /Users/Cameron/.julia/v0.5/NullableArrays/src/primitives.jl:104
in macro expansion; at /Users/Cameron/.julia/v0.5/DataTables/test/runtests.jl:40 [inlined]
in anonymous at ./<missing>:?
in include_from_node1(::String) at ./loading.jl:488
in include_from_node1(::String) at /Users/Cameron/julia/usr/lib/julia/sys.dylib:?
in process_options(::Base.JLOptions) at ./client.jl:262
in _start() at ./client.jl:318
in _start() at /Users/Cameron/julia/usr/lib/julia/sys.dylib:?
while loading /Users/Cameron/.julia/v0.5/DataTables/test/join.jl, in expression starting on line 100
PASSED: iteration.jl
PASSED: duplicates.jl
PASSED: show.jl
ERROR: LoadError: "Tests failed"
in include_from_node1(::String) at ./loading.jl:488
in include_from_node1(::String) at /Users/Cameron/julia/usr/lib/julia/sys.dylib:?
in process_options(::Base.JLOptions) at ./client.jl:262
in _start() at ./client.jl:318
in _start() at /Users/Cameron/julia/usr/lib/julia/sys.dylib:?
while loading /Users/Cameron/.julia/v0.5/DataTables/test/runtests.jl, in expression starting on line 47
===========================================================================================[ ERROR: DataTables ]============================================================================================
failed process: Process(`/Users/Cameron/julia/usr/bin/julia -Cnative -J/Users/Cameron/julia/usr/lib/julia/sys.dylib --compile=yes --depwarn=yes --check-bounds=yes --code-coverage=none --color=yes --compilecache=yes /Users/Cameron/.julia/v0.5/DataTables/test/runtests.jl`, ProcessExited(1)) [1]
============================================================================================================================================================================================================
INFO: Removing Atom v0.5.9
INFO: Removing Blink v0.5.1
INFO: Removing CodeTools v0.4.3
INFO: Removing Codecs v0.2.0
INFO: Removing DataArrays v0.3.12
INFO: Removing HttpCommon v0.2.6
INFO: Removing HttpParser v0.2.0
INFO: Removing HttpServer v0.1.7
INFO: Removing LNR v0.0.2
INFO: Removing LaTeXStrings v0.2.0
INFO: Removing Lazy v0.11.5
INFO: Removing MbedTLS v0.4.3
INFO: Removing Mustache v0.1.3
INFO: Removing Mux v0.2.3
INFO: Removing RData v0.0.4
INFO: Removing RDatasets v0.2.0
INFO: Removing WebSockets v0.2.1
ERROR: DataTables had test errors
in #test#61(::Bool, ::Function, ::Array{AbstractString,1}) at ./pkg/entry.jl:740
in (::Base.Pkg.Entry.#kw##test)(::Array{Any,1}, ::Base.Pkg.Entry.#test, ::Array{AbstractString,1}) at ./<missing>:0
in (::Base.Pkg.Dir.##2#3{Array{Any,1},Base.Pkg.Entry.#test,Tuple{Array{AbstractString,1}}})() at ./pkg/dir.jl:31
in cd(::Base.Pkg.Dir.##2#3{Array{Any,1},Base.Pkg.Entry.#test,Tuple{Array{AbstractString,1}}}, ::String) at ./file.jl:59
in #cd#1(::Array{Any,1}, ::Function, ::Function, ::Array{AbstractString,1}, ::Vararg{Array{AbstractString,1},N}) at ./pkg/dir.jl:31
in (::Base.Pkg.Dir.#kw##cd)(::Array{Any,1}, ::Base.Pkg.Dir.#cd, ::Function, ::Array{AbstractString,1}, ::Vararg{Array{AbstractString,1},N}) at ./<missing>:0
in #test#3(::Bool, ::Function, ::String, ::Vararg{String,N}) at ./pkg/pkg.jl:258
in test(::String, ::Vararg{String,N}) at ./pkg/pkg.jl:258 |
Thank you both for persisting through this terribly long review! The features and benchmarks are great -- except of course for the last one which regresses. Since the number of allocations is higher, it probably comes from a type instability. The memory profiler should tell that. Note that AFAICT Now, to the next step... |
I've adapted the benchmarks a bit to run on DataFrames in order to compare the new DataTables with it. I've also dropped the "ordered" and "unordered" tables, which were actually all unordered and did not really correspond to a plausible scenario since they only contained unique values. The new code is here: https://gist.github.com/nalimilan/aa9391f204967adf70fce3700eab5885 Overall, DataTables master is faster, but the difference is not so large as in the comparison with the old DataTables, which suffered from major type instabilities. In some cases DataTable allocates more (e.g. # DataTables
julia> @benchmark groupby(small, [:A, :B])
BenchmarkTools.Trial:
memory estimate: 362.80 KiB
allocs estimate: 11016
--------------
minimum time: 263.435 μs (0.00% GC)
median time: 284.065 μs (0.00% GC)
mean time: 366.683 μs (18.16% GC)
maximum time: 9.307 ms (94.58% GC)
--------------
samples: 10000
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
# DataFrames
julia> @benchmark groupby(small, [:A, :B])
BenchmarkTools.Trial:
memory estimate: 192.45 KiB
allocs estimate: 7186
--------------
minimum time: 525.938 μs (0.00% GC)
median time: 550.305 μs (0.00% GC)
mean time: 600.612 μs (3.22% GC)
maximum time: 5.727 ms (79.89% GC)
--------------
samples: 8227
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
# DataTables
julia> @benchmark groupby(dt1, [:v1, :v2])
BenchmarkTools.Trial:
memory estimate: 2.13 MiB
allocs estimate: 76276
--------------
minimum time: 1.434 ms (0.00% GC)
median time: 1.614 ms (0.00% GC)
mean time: 2.183 ms (20.18% GC)
maximum time: 12.800 ms (78.78% GC)
--------------
samples: 2266
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
# DataFrames
julia> @benchmark groupby(dt1, [:v1, :v2])
BenchmarkTools.Trial:
memory estimate: 606.95 KiB
allocs estimate: 28487
--------------
minimum time: 1.468 ms (0.00% GC)
median time: 1.537 ms (0.00% GC)
mean time: 1.602 ms (3.19% GC)
maximum time: 4.247 ms (46.88% GC)
--------------
samples: 3106
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
# DataTables
julia> @benchmark groupby(dt2, [:v1, :v2])
BenchmarkTools.Trial:
memory estimate: 600.27 KiB
allocs estimate: 61
--------------
minimum time: 294.550 μs (0.00% GC)
median time: 324.514 μs (0.00% GC)
mean time: 360.833 μs (5.49% GC)
maximum time: 2.328 ms (69.33% GC)
--------------
samples: 10000
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
# DataFrames
julia> @benchmark groupby(dt2, [:v1, :v2])
BenchmarkTools.Trial:
memory estimate: 1.34 MiB
allocs estimate: 47387
--------------
minimum time: 1.619 ms (0.00% GC)
median time: 1.720 ms (0.00% GC)
mean time: 1.861 ms (6.07% GC)
maximum time: 6.255 ms (58.91% GC)
--------------
samples: 2672
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
# DataTables
julia> @benchmark groupby(dt3, [:v1, :v2, :v1_1, :v2_1])
BenchmarkTools.Trial:
memory estimate: 600.44 KiB
allocs estimate: 65
--------------
minimum time: 388.856 μs (0.00% GC)
median time: 425.296 μs (0.00% GC)
mean time: 456.659 μs (4.25% GC)
maximum time: 2.979 ms (65.61% GC)
--------------
samples: 10000
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
# DataFrames
julia> @benchmark groupby(dt3, [:v1, :v2, :v1_1, :v2_1])
BenchmarkTools.Trial:
memory estimate: 19.13 MiB
allocs estimate: 161096
--------------
minimum time: 11.457 ms (0.00% GC)
median time: 13.299 ms (11.49% GC)
mean time: 13.629 ms (8.97% GC)
maximum time: 27.229 ms (13.16% GC)
--------------
samples: 366
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
# DataTables
julia> @benchmark f()
BenchmarkTools.Trial:
memory estimate: 2.29 MiB
allocs estimate: 63263
--------------
minimum time: 3.119 ms (0.00% GC)
median time: 3.654 ms (0.00% GC)
mean time: 4.303 ms (9.84% GC)
maximum time: 16.701 ms (52.10% GC)
--------------
samples: 1157
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
# DataFrames
julia> @benchmark f()
BenchmarkTools.Trial:
memory estimate: 1.19 MiB
allocs estimate: 24076
--------------
minimum time: 4.154 ms (0.00% GC)
median time: 4.392 ms (0.00% GC)
mean time: 4.628 ms (3.36% GC)
maximum time: 10.013 ms (47.09% GC)
--------------
samples: 1075
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
# DataTables
julia> @benchmark h()
BenchmarkTools.Trial:
memory estimate: 13.46 MiB
allocs estimate: 399702
--------------
minimum time: 21.948 ms (0.00% GC)
median time: 26.196 ms (11.28% GC)
mean time: 26.022 ms (7.88% GC)
maximum time: 38.248 ms (10.68% GC)
--------------
samples: 192
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
# DataFrames
julia> @benchmark h()
BenchmarkTools.Trial:
memory estimate: 8.00 MiB
allocs estimate: 214342
--------------
minimum time: 12.387 ms (0.00% GC)
median time: 13.102 ms (0.00% GC)
mean time: 13.922 ms (6.42% GC)
maximum time: 18.948 ms (0.00% GC)
--------------
samples: 359
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00% |
Thanks Milan! I started a branch to try and get the benchmarks setup using PkgBenchmark.jl as you had suggested earlier. I wonder if removing the column promotion to NullableArrays in #24 will have any affect on these benchmarks. I've started working on JuliaData/DataFrames.jl#485 too, which may have implications on |
I've just found this, might be relevant regarding possible optimizations: http://scattered-thoughts.net/blog/2016/10/11/a-practical-relational-query-compiler-in-500-lines/ |
Another possibly interesting reference: the Stata ftools package, whose algorithm is documented at https://github.com/sergiocorreia/ftools/blob/master/src/ftools.sthlp |
This Discourse post shows that for a simple example our For reference, Pandas uses a different strategy, which we also used before this PR: instead of hashing it assigns an integer code to each level of each grouping variables, and combines them. This has been explained here and here, though it's a bit old. (Our old implementation did not support compressing the integer codes when the number of possible combinations was too high, which Pandas does.) This also requires using a hash table to assign an integer code to each level (essentially what Overall, I'd be inclined to move back to the old strategy (which should only require changing limited parts of this PR). There must be a reason Pandas uses it. And among other advantages, it's really really fast for |
In the current implementation the rows hash should be updated column-wise, so it is type-stable and memory-efficient. |
Yes. The problem is that, while hashing is optimized, the equality check isn't. There's also a much bigger issue with joining, see the example at https://discourse.julialang.org/t/how-is-the-data-ecosystem-right-now-for-large-datasets/4281/19. |
One possible approach is that |
I'm not sure that's acceptable either. Grouping/joining on strings is quite common too, and even "reasonable ranges" can get out of control quickly if you combine multiple columns. Basically, the two approaches are similar in many respects, but they differ in the way the per-column hashes are combined:
So basically, the Pandas approach differs from the current one we use by the fact that a perfect hashing function is created. The cost of checking for collision is paid separately for each column, where an equality check is more efficient; then it cannot happen when grouping rows. AFAICT, the only drawback of that approach is that you need to compress the integer codes when the number of possible combinations of levels gets too high. But that's not the end of the world, we have code examples for that in Pandas and recoding an integer vector is quite fast. What are the advantages of the current approach in your opinion? It could be efficient if the column types were statically known, but even then I'm not sure it would beat the other approach. If Pandas uses it since 2012, it probably means they couldn't find a more efficient solution, and they have put much more resources than us on this. |
But, basically, "building a hash for each column separately" is an implicit conversion to a categorical. And then, IIUC, what I propose would match what Pandas is doing. One advantage (that motivated this PR) is that it doesn't rely on an efficient compression scheme, which was a big problem for multi-column joins of reasonably-sized tables at that time. |
What's the problem with "implicit conversion to categorical"? That's just a dict lookup. I've experimented a bit locally with a more efficient approach than the old one which actually used to build a complete
AFAIK it was a big problem because it wasn't implemented at all. Since Pandas uses it, it appears to work well in practice. We should run some benchmarks, but the cost of compressing an integer vector will likely be quite negligible compared with the cost of hashing and dict lookups. I haven't looked too deeply at join algorithms yet, unfortunately there's no blog post by Wes McKinney to ease the understanding of the code. |
That's true, I'm just saying that in certain situations not all the keys would be looked up. But probably it doesn't justify complicating the code with unique and non-unique hashes. As I see it now, to implement Pandas-like behaviour, one strategy could be to modify In the current implementation joins are just about enumerating all pairs of matching rows from the two tables, where the right table (or left for the right-join) is used to build |
I've worked on grouping (since it's simpler than joining), and improving the implementation of the old algorithm I could make it twice faster than both the old and the existing code: #76. That means we should more or less be about twice slower than Pandas (for this particular example), which sounds promising (especially given that half of the time is now spent in areas I haven't explored yet). Do you think you could have a look at joins? You're much more familiar than I am with that code. It would be great to summarize the differences between our implementation and Pandas' so that we can identify places where we could use a more clever algorithm. |
The implementation @alyst wrote for JuliaData/DataFrames.jl#850 is very well done. It's clearly more performant than the current master and what I wrote for #3. I merged JuliaData/DataFrames.jl#850 with DataTables but if I could get some keen eyes, particularly @alyst's to make sure I didn't make any meaningful errors while fixing merge conflicts that would be appreciated. I merged benchmarks from JuliaData/DataFrames.jl#850, #3, and #12 to generate this list of benchmarks.
benchmarks
results