From e0cd3b8808d93800395192df25298547d30a9940 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Bogumi=C5=82=20Kami=C5=84ski?= Date: Sat, 24 Dec 2022 08:35:03 +0100 Subject: [PATCH] Add an option in joins to specify row order (#3233) --- NEWS.md | 11 +- docs/src/man/joins.md | 251 +++++++++++++++++++++++++++++++++++--- src/join/composer.jl | 273 ++++++++++++++++++++++++++++++++---------- test/join.jl | 200 +++++++++++++++++++++++++++++++ 4 files changed, 653 insertions(+), 82 deletions(-) diff --git a/NEWS.md b/NEWS.md index 7a81f0eef2..725c97475e 100644 --- a/NEWS.md +++ b/NEWS.md @@ -3,13 +3,22 @@ ## New functionalities * Add `Iterators.partition` support - ([#3212](https://github.com/JuliaData/DataFrames.jl/pull/3212)) + ([#3212](https://github.com/JuliaData/DataFrames.jl/pull/3212)) * Add `allunique` and allow transformations in `cols` argument of `describe` and `nonunique` when working with `SubDataFrame` ([3232](https://github.com/JuliaData/DataFrames.jl/pull/3232)) * Add support for `operator` keyword argument in `Cols` to take a set operation to apply to passed selectors (`union` by default) ([3224](https://github.com/JuliaData/DataFrames.jl/pull/3224)) +* Joining functions now support `order` keyword argument allowing the user + to specify the order of the rows in the produced table + ([#3233](https://github.com/JuliaData/DataFrames.jl/pull/3233)) + +## Bug fixes + +* passing very many data frames to `innerjoin` and `outerjoin` + does not lead to stack overflow + ([#3233](https://github.com/JuliaData/DataFrames.jl/pull/3233)) # DataFrames.jl v1.4.4 Patch Release Notes diff --git a/docs/src/man/joins.md b/docs/src/man/joins.md index 2e228c7698..0078a8e518 100644 --- a/docs/src/man/joins.md +++ b/docs/src/man/joins.md @@ -1,6 +1,10 @@ # Database-Style Joins -We often need to combine two or more data sets together to provide a complete picture of the topic we are studying. For example, suppose that we have the following two data sets: +## Introduction to joins + +We often need to combine two or more data sets together to provide a complete +picture of the topic we are studying. For example, suppose that we have the +following two data sets: ```jldoctest joins julia> using DataFrames @@ -22,7 +26,8 @@ julia> jobs = DataFrame(ID=[20, 40], Job=["Lawyer", "Doctor"]) 2 │ 40 Doctor ``` -We might want to work with a larger data set that contains both the names and jobs for each ID. We can do this using the `innerjoin` function: +We might want to work with a larger data set that contains both the names and +jobs for each ID. We can do this using the `innerjoin` function: ```jldoctest joins julia> innerjoin(people, jobs, on = :ID) @@ -34,21 +39,29 @@ julia> innerjoin(people, jobs, on = :ID) 2 │ 40 Jane Doe Doctor ``` -In relational database theory, this operation is generally referred to as a join. -The columns used to determine which rows should be combined during a join are called keys. +In relational database theory, this operation is generally referred to as a +join. The columns used to determine which rows should be combined during a join +are called keys. The following functions are provided to perform seven kinds of joins: -- `innerjoin`: the output contains rows for values of the key that exist in all passed data frames. -- `leftjoin`: the output contains rows for values of the key that exist in the first (left) argument, - whether or not that value exists in the second (right) argument. -- `rightjoin`: the output contains rows for values of the key that exist in the second (right) argument, - whether or not that value exists in the first (left) argument. -- `outerjoin`: the output contains rows for values of the key that exist in any of the passed data frames. -- `semijoin`: Like an inner join, but output is restricted to columns from the first (left) argument. -- `antijoin`: The output contains rows for values of the key that exist in the first (left) but not the second (right) argument. - As with `semijoin`, output is restricted to columns from the first (left) argument. -- `crossjoin`: The output is the cartesian product of rows from all passed data frames. +- `innerjoin`: the output contains rows for values of the key that exist in all + passed data frames. +- `leftjoin`: the output contains rows for values of the key that exist in the + first (left) argument, whether or not that value exists in the second (right) + argument. +- `rightjoin`: the output contains rows for values of the key that exist in the + second (right) argument, whether or not that value exists in the first (left) + argument. +- `outerjoin`: the output contains rows for values of the key that exist in any + of the passed data frames. +- `semijoin`: Like an inner join, but output is restricted to columns from the + first (left) argument. +- `antijoin`: The output contains rows for values of the key that exist in the + first (left) but not the second (right) argument. As with `semijoin`, output + is restricted to columns from the first (left) argument. +- `crossjoin`: The output is the cartesian product of rows from all passed data + frames. See [the Wikipedia page on SQL joins](https://en.wikipedia.org/wiki/Join_(SQL)) for more information. @@ -124,8 +137,10 @@ julia> crossjoin(people, jobs, makeunique = true) 4 │ 40 Jane Doe 60 Astronaut ``` -In order to join data frames on keys which have different names in the left and right tables, -you may pass `left => right` pairs as `on` argument: +## Joining on key columns with different names + +In order to join data frames on keys which have different names in the left and +right tables, you may pass `left => right` pairs as `on` argument: ```jldoctest joins julia> a = DataFrame(ID=[20, 40], Name=["John Doe", "Jane Doe"]) @@ -198,6 +213,8 @@ julia> innerjoin(a, b, on = [:City => :Location, :Job => :Work]) 9 │ New York Doctor 5 e ``` +## Handling of duplicate keys and tracking source data frame + Additionally, notice that in the last join rows 2 and 3 had the same values on `on` variables in both joined `DataFrame`s. In such a situation `innerjoin`, `outerjoin`, `leftjoin` and `rightjoin` will produce all combinations of @@ -248,3 +265,205 @@ julia> outerjoin(a, b, on=:ID, validate=(true, true), source=:source) Note that this time we also used the `validate` keyword argument and it did not produce errors as the keys defined in both source data frames were unique. + +## Renaming joined columns + +Often you want to keep track of the source data frame of a given column. +This feature is supported with the `ranamecols` keyword argument: + +```jldoctest joins +julia> innerjoin(a, b, on=:ID, renamecols = "_left" => "_right") +1×3 DataFrame + Row │ ID Name_left Job_right + │ Int64 String String +─────┼───────────────────────────── + 1 │ 20 John Lawyer +``` + +In the above example we added the `"_left"` suffix to the non-key columns from +the left table and the `"_right"` suffix to the non-key columns from the right +table. + +Alternatively it is allowed to pass a function transforming column names: +```jldoctest joins +julia> innerjoin(a, b, on=:ID, renamecols = lowercase => uppercase) +1×3 DataFrame + Row │ ID name JOB + │ Int64 String String +─────┼─────────────────────── + 1 │ 20 John Lawyer + +``` + +## Matching missing values in joins + +By default when you try to to perform a join on a key that has `missing` values +you get an error: + +```jldoctest joins +julia> df1 = DataFrame(id=[1, missing, 3], a=1:3) +3×2 DataFrame + Row │ id a + │ Int64? Int64 +─────┼──────────────── + 1 │ 1 1 + 2 │ missing 2 + 3 │ 3 3 + +julia> df2 = DataFrame(id=[1, 2, missing], b=1:3) +3×2 DataFrame + Row │ id b + │ Int64? Int64 +─────┼──────────────── + 1 │ 1 1 + 2 │ 2 2 + 3 │ missing 3 + +julia> innerjoin(df1, df2, on=:id) +ERROR: ArgumentError: missing values in key columns are not allowed when matchmissing == :error +``` + +If you would prefer `missing` values to be treated as equal pass +the `matchmissing=:equal` keyword argument: + +```jldoctest joins +julia> innerjoin(df1, df2, on=:id, matchmissing=:equal) +2×3 DataFrame + Row │ id a b + │ Int64? Int64 Int64 +─────┼─────────────────────── + 1 │ 1 1 1 + 2 │ missing 2 3 +``` + +Alternatively you might want to drop all rows with `missing` values. In this +case pass `matchmissing=:notequal`: + +```jldoctest joins +julia> innerjoin(df1, df2, on=:id, matchmissing=:notequal) +1×3 DataFrame + Row │ id a b + │ Int64? Int64 Int64 +─────┼────────────────────── + 1 │ 1 1 1 +``` + +## Specifying row order in the join result + +By default the order of rows produced by the join operation is undefined: + +```jldoctest joins +julia> df_left = DataFrame(id=[1, 2, 4, 5], left=1:4) +4×2 DataFrame + Row │ id left + │ Int64 Int64 +─────┼────────────── + 1 │ 1 1 + 2 │ 2 2 + 3 │ 4 3 + 4 │ 5 4 + +julia> df_right = DataFrame(id=[2, 1, 3, 6, 7], right=1:5) +5×2 DataFrame + Row │ id right + │ Int64 Int64 +─────┼────────────── + 1 │ 2 1 + 2 │ 1 2 + 3 │ 3 3 + 4 │ 6 4 + 5 │ 7 5 + +julia> outerjoin(df_left, df_right, on=:id) +7×3 DataFrame + Row │ id left right + │ Int64 Int64? Int64? +─────┼───────────────────────── + 1 │ 2 2 1 + 2 │ 1 1 2 + 3 │ 4 3 missing + 4 │ 5 4 missing + 5 │ 3 missing 3 + 6 │ 6 missing 4 + 7 │ 7 missing 5 +``` + +If you would like the result to keep the row order of the left table pass +the `order=:left` keyword argument: + +```jldoctest joins +julia> outerjoin(df_left, df_right, on=:id, order=:left) +7×3 DataFrame + Row │ id left right + │ Int64 Int64? Int64? +─────┼───────────────────────── + 1 │ 1 1 2 + 2 │ 2 2 1 + 3 │ 4 3 missing + 4 │ 5 4 missing + 5 │ 3 missing 3 + 6 │ 6 missing 4 + 7 │ 7 missing 5 +``` + +Note that in this case keys missing from the left table are put after the keys +present in it. + +Similarly `order=:right` keeps the order of the right table (and puts keys +not present in it at the end): + +```jldoctest joins +julia> outerjoin(df_left, df_right, on=:id, order=:right) +7×3 DataFrame + Row │ id left right + │ Int64 Int64? Int64? +─────┼───────────────────────── + 1 │ 2 2 1 + 2 │ 1 1 2 + 3 │ 3 missing 3 + 4 │ 6 missing 4 + 5 │ 7 missing 5 + 6 │ 4 3 missing + 7 │ 5 4 missing +``` + +## In-place left join + +A common operation is adding data from a reference table to some main table. +It is possible to perform such an in-place update using the `leftjoin!` +function. In this case the left table is updated in place with matching rows from +the right table. + +```jldoctest joins +julia> main = DataFrame(id=1:4, main=1:4) +4×2 DataFrame + Row │ id main + │ Int64 Int64 +─────┼────────────── + 1 │ 1 1 + 2 │ 2 2 + 3 │ 3 3 + 4 │ 4 4 + +julia> leftjoin!(main, DataFrame(id=[2, 4], info=["a", "b"]), on=:id); + +julia> main +4×3 DataFrame + Row │ id main info + │ Int64 Int64 String? +─────┼─────────────────────── + 1 │ 1 1 missing + 2 │ 2 2 a + 3 │ 3 3 missing + 4 │ 4 4 b +``` + +Note that in this case the order and number of rows in the left table is not +changed. Therefore, in particular, it is not allowed to have duplicate keys +in the right table: + +``` +julia> leftjoin!(main, DataFrame(id=[2, 2], info_bad=["a", "b"]), on=:id) +ERROR: ArgumentError: duplicate rows found in right table +``` + diff --git a/src/join/composer.jl b/src/join/composer.jl index 958c8ba925..53cb7057e4 100644 --- a/src/join/composer.jl +++ b/src/join/composer.jl @@ -182,11 +182,72 @@ function _propagate_join_metadata!(joiner::DataFrameJoiner, dfr_noon::AbstractDa return nothing end +# return a permutation vector that puts `input` into sorted order +# using counting sort algorithm +function _count_sortperm(input::Vector{Int}) + isempty(input) && return UInt32[] + vmin, vmax = extrema(input) + delta = vmin - 1 + Tc = vmax - delta < typemax(UInt32) ? UInt32 : Int + Tp = length(input) < typemax(UInt32) ? UInt32 : Int + return _count_sortperm!(input, zeros(Tc, vmax - delta + 1), + Vector{Tp}(undef, length(input)), delta) +end + +# put into `output` a permutation that puts `input` into sorted order; +# changes `count` vector. +# +# `delta` is by how much integers in `input` need to be shifted so that +# smallest of them has index 1 +# +# After the first loop `count` vector holds in location `i-delta` the number of +# times an integer `i` is present in `input`. +# After the second loop a cumulative sum of these values is stored. +# Third loop updates `count` to determine the locations where data should go. +# It is assumed that initially `count` vector holds only zeros. +# Length of `count` is by 2 greater than the difference between maximal and +# minimal element of `input` (i.e. number of unique values plus 1) +function _count_sortperm!(input::Vector{Int}, count::Vector, + output::Vector, delta::Int) + @assert firstindex(input) == 1 + # consider adding @inbounds to these loops in the future after the code + # has been used enough in production + for j in input + count[j - delta] += 1 + end + prev = count[1] + for i in 2:length(count) + prev = (count[i] += prev) + end + for i in length(input):-1:1 + j = input[i] - delta + v = (count[j] -= 1) + output[v + 1] = i + end + return output +end + function compose_inner_table(joiner::DataFrameJoiner, makeunique::Bool, left_rename::Union{Function, AbstractString, Symbol}, - right_rename::Union{Function, AbstractString, Symbol}) + right_rename::Union{Function, AbstractString, Symbol}, + order::Symbol) left_ixs, right_ixs = find_inner_rows(joiner) + @assert left_ixs isa Vector{Int} + @assert right_ixs isa Vector{Int} + @assert length(left_ixs) == length(right_ixs) + + if order == :left && !issorted(left_ixs) + csp_l = _count_sortperm(left_ixs) + left_ixs = left_ixs[csp_l] + right_ixs = right_ixs[csp_l] + end + + if order == :right && !issorted(right_ixs) + csp_r = _count_sortperm(right_ixs) + left_ixs = left_ixs[csp_r] + right_ixs = right_ixs[csp_r] + end if Threads.nthreads() > 1 && length(left_ixs) >= 1_000_000 dfl_task = Threads.@spawn joiner.dfl[left_ixs, :] @@ -227,9 +288,13 @@ end function compose_joined_table(joiner::DataFrameJoiner, kind::Symbol, makeunique::Bool, left_rename::Union{Function, AbstractString, Symbol}, right_rename::Union{Function, AbstractString, Symbol}, - indicator::Union{Nothing, Symbol, AbstractString}) + indicator::Union{Nothing, Symbol, AbstractString}, + order::Symbol) @assert kind == :left || kind == :right || kind == :outer left_ixs, right_ixs = find_inner_rows(joiner) + @assert left_ixs isa Vector{Int} + @assert right_ixs isa Vector{Int} + @assert length(left_ixs) == length(right_ixs) if kind == :left || kind == :outer leftonly_ixs = find_missing_idxs(left_ixs, nrow(joiner.dfl)) @@ -243,7 +308,8 @@ function compose_joined_table(joiner::DataFrameJoiner, kind::Symbol, makeunique: rightonly_ixs = 1:0 end return _compose_joined_table(joiner, kind, makeunique, left_rename, right_rename, - indicator, left_ixs, right_ixs, leftonly_ixs, rightonly_ixs) + indicator, left_ixs, right_ixs, + leftonly_ixs, rightonly_ixs, order) end function _compose_joined_table(joiner::DataFrameJoiner, kind::Symbol, makeunique::Bool, @@ -251,7 +317,9 @@ function _compose_joined_table(joiner::DataFrameJoiner, kind::Symbol, makeunique right_rename::Union{Function, AbstractString, Symbol}, indicator::Union{Nothing, Symbol, AbstractString}, left_ixs::AbstractVector, right_ixs::AbstractVector, - leftonly_ixs::AbstractVector, rightonly_ixs::AbstractVector) + leftonly_ixs::AbstractVector, + rightonly_ixs::AbstractVector, + order::Symbol) lil = length(left_ixs) ril = length(right_ixs) loil = length(leftonly_ixs) @@ -321,7 +389,8 @@ function _compose_joined_table(joiner::DataFrameJoiner, kind::Symbol, makeunique for col in eachcol(dfl_noon) cols_i = left_idxs[col_idx] Threads.@spawn _noon_compose_helper!(cols, _similar_left, cols_i, - col, target_nrow, left_ixs, lil + 1, leftonly_ixs, loil) + col, target_nrow, left_ixs, lil + 1, + leftonly_ixs, loil) col_idx += 1 end @assert col_idx == ncol(joiner.dfl) + 1 @@ -346,12 +415,31 @@ function _compose_joined_table(joiner::DataFrameJoiner, kind::Symbol, makeunique end end + new_order = nothing + if order == :left && !(issorted(left_ixs) && isempty(leftonly_ixs)) + left_cols_idxs = _sort_compose_helper(nrow(joiner.dfl) + 1, + 1:nrow(joiner.dfl), target_nrow, + left_ixs, lil + 1, leftonly_ixs, loil) + new_order = _count_sortperm(left_cols_idxs) + end + if order == :right && !(issorted(right_ixs) && isempty(rightonly_ixs)) + right_cols_idxs = _sort_compose_helper(nrow(joiner.dfr) + 1, + 1:nrow(joiner.dfr), target_nrow, + right_ixs, lil + loil + 1, rightonly_ixs, roil) + new_order = _count_sortperm(right_cols_idxs) + end + @assert col_idx == length(cols) + 1 new_names = vcat(_rename_cols(_names(joiner.dfl), left_rename, joiner.left_on), _rename_cols(_names(dfr_noon), right_rename)) res = DataFrame(cols, new_names, makeunique=makeunique, copycols=false) + if new_order !== nothing + isnothing(src_indicator) || permute!(src_indicator, new_order) + permute!(res, new_order) + end + _propagate_join_metadata!(joiner, dfr_noon, res, kind) return res, src_indicator @@ -363,8 +451,8 @@ function _noon_compose_helper!(cols::Vector{AbstractVector}, # target container col::AbstractVector, # source column target_nrow::Integer, # target number of rows in new column side_ixs::AbstractVector, # indices in col that were matched - offset::Integer, # offset to put non matched indices - sideonly_ixs::AbstractVector, # indices in col that were not + offset::Integer, # offset to put non-matching indices + sideonly_ixs::AbstractVector, # indices in col that were not matched tocopy::Integer) # number on non-matched rows to copy @assert tocopy == length(sideonly_ixs) cols[cols_i] = similar_col(col, target_nrow) @@ -372,16 +460,36 @@ function _noon_compose_helper!(cols::Vector{AbstractVector}, # target container copyto!(cols[cols_i], offset, view(col, sideonly_ixs), 1, tocopy) end +function _sort_compose_helper(fillval::Int, # value to use to fill unused indices + col::AbstractVector, # source column + target_nrow::Integer, # target number of rows in new column + side_ixs::AbstractVector, # indices in col that were matched + offset::Integer, # offset to put non-matching indices + sideonly_ixs::AbstractVector, # indices in col that were not matched + tocopy::Integer) # number on non-matched rows to copy + @assert tocopy == length(sideonly_ixs) + outcol = Vector{Int}(undef, target_nrow) + copyto!(outcol, view(col, side_ixs)) + fill!(view(outcol, length(side_ixs)+1:offset-1), fillval) + copyto!(outcol, offset, view(col, sideonly_ixs), 1, tocopy) + fill!(view(outcol, offset+tocopy:target_nrow), fillval) + return outcol +end + function _join(df1::AbstractDataFrame, df2::AbstractDataFrame; on::Union{<:OnType, AbstractVector}, kind::Symbol, makeunique::Bool, indicator::Union{Nothing, Symbol, AbstractString}, validate::Union{Pair{Bool, Bool}, Tuple{Bool, Bool}}, left_rename::Union{Function, AbstractString, Symbol}, right_rename::Union{Function, AbstractString, Symbol}, - matchmissing::Symbol) + matchmissing::Symbol, order::Symbol) _check_consistency(df1) _check_consistency(df2) + if !(order in (:undefined, :left, :right)) + throw(ArgumentError("order argument must be :undefined, :left, or :right.")) + end + if on == [] throw(ArgumentError("Missing join argument 'on'.")) end @@ -464,16 +572,16 @@ function _join(df1::AbstractDataFrame, df2::AbstractDataFrame; src_indicator = nothing if kind == :inner - joined = compose_inner_table(joiner, makeunique, left_rename, right_rename) + joined = compose_inner_table(joiner, makeunique, left_rename, right_rename, order) elseif kind == :left joined, src_indicator = - compose_joined_table(joiner, kind, makeunique, left_rename, right_rename, indicator) + compose_joined_table(joiner, kind, makeunique, left_rename, right_rename, indicator, order) elseif kind == :right joined, src_indicator = - compose_joined_table(joiner, kind, makeunique, left_rename, right_rename, indicator) + compose_joined_table(joiner, kind, makeunique, left_rename, right_rename, indicator, order) elseif kind == :outer joined, src_indicator = - compose_joined_table(joiner, kind, makeunique, left_rename, right_rename, indicator) + compose_joined_table(joiner, kind, makeunique, left_rename, right_rename, indicator, order) elseif kind == :semi joined = joiner.dfl[find_semi_rows(joiner), :] elseif kind == :anti @@ -514,16 +622,16 @@ end """ innerjoin(df1, df2; on, makeunique=false, validate=(false, false), - renamecols=(identity => identity), matchmissing=:error) + renamecols=(identity => identity), matchmissing=:error, + order=:undefined) innerjoin(df1, df2, dfs...; on, makeunique=false, - validate=(false, false), matchmissing=:error) + validate=(false, false), matchmissing=:error, + order=:undefined) Perform an inner join of two or more data frame objects and return a `DataFrame` containing the result. An inner join includes rows with keys that match in all passed data frames. -The order of rows in the result is undefined and may change in the future releases. - In the returned data frame the type of the columns on which the data frames are joined is determined by the type of these columns in `df1`. This behavior may change in future releases. @@ -559,6 +667,10 @@ change in future releases. in `on` columns; if equal to `:equal` then `missing` is allowed and missings are matched; if equal to `:notequal` then missings are dropped in `df1` and `df2` `on` columns; `isequal` is used for comparisons of rows for equality +- `order` : if `:undefined` (the default) the order of rows in the result is + undefined and may change in future releases. If `:left` then the order of + rows from the left data frame is retained. If `:right` then the order of rows + from the right data frame is retained. It is not allowed to join on columns that contain `NaN` or `-0.0` in real or imaginary part of the number. If you need to perform a join on such values use @@ -640,7 +752,8 @@ function innerjoin(df1::AbstractDataFrame, df2::AbstractDataFrame; makeunique::Bool=false, validate::Union{Pair{Bool, Bool}, Tuple{Bool, Bool}}=(false, false), renamecols::Pair=identity => identity, - matchmissing::Symbol=:error) + matchmissing::Symbol=:error, + order::Symbol=:undefined) if !all(x -> x isa Union{Function, AbstractString, Symbol}, renamecols) throw(ArgumentError("renamecols keyword argument must be a `Pair` " * "containing functions, strings, or `Symbol`s")) @@ -648,28 +761,36 @@ function innerjoin(df1::AbstractDataFrame, df2::AbstractDataFrame; return _join(df1, df2, on=on, kind=:inner, makeunique=makeunique, indicator=nothing, validate=validate, left_rename=first(renamecols), right_rename=last(renamecols), - matchmissing=matchmissing) + matchmissing=matchmissing, order=order) end -innerjoin(df1::AbstractDataFrame, df2::AbstractDataFrame, dfs::AbstractDataFrame...; - on::Union{<:OnType, AbstractVector} = Symbol[], - makeunique::Bool=false, - validate::Union{Pair{Bool, Bool}, Tuple{Bool, Bool}}=(false, false), - matchmissing::Symbol=:error) = - innerjoin(innerjoin(df1, df2, on=on, makeunique=makeunique, validate=validate, - matchmissing=matchmissing), - dfs..., on=on, makeunique=makeunique, validate=validate, - matchmissing=matchmissing) +function innerjoin(df1::AbstractDataFrame, df2::AbstractDataFrame, dfs::AbstractDataFrame...; + on::Union{<:OnType, AbstractVector} = Symbol[], + makeunique::Bool=false, + validate::Union{Pair{Bool, Bool}, Tuple{Bool, Bool}}=(false, false), + matchmissing::Symbol=:error, + order::Symbol=:undefined) + @assert !isempty(dfs) + res = innerjoin(df1, df2, on=on, makeunique=makeunique, validate=validate, + matchmissing=matchmissing, + order=order === :right ? :undefined : order) + for (i, dfn) in enumerate(dfs) + res = innerjoin(res, dfn, on=on, makeunique=makeunique, validate=validate, + matchmissing=matchmissing, + order= order === :right ? + (i == length(dfs) ? :right : :undefined) : + order) + end + return res +end """ leftjoin(df1, df2; on, makeunique=false, source=nothing, validate=(false, false), - renamecols=(identity => identity), matchmissing=:error) + renamecols=(identity => identity), matchmissing=:error, order=:undefined) Perform a left join of two data frame objects and return a `DataFrame` containing the result. A left join includes all rows from `df1`. -The order of rows in the result is undefined and may change in the future releases. - In the returned data frame the type of the columns on which the data frames are joined is determined by the type of these columns in `df1`. This behavior may change in future releases. @@ -707,6 +828,10 @@ change in future releases. in `on` columns; if equal to `:equal` then `missing` is allowed and missings are matched; if equal to `:notequal` then missings are dropped in `df2` `on` columns; `isequal` is used for comparisons of rows for equality +- `order` : if `:undefined` (the default) the order of rows in the result is + undefined and may change in future releases. If `:left` then the order of + rows from the left data frame is retained. If `:right` then the order of rows + from the right data frame is retained (non-matching rows are put at the end). All columns of the returned data frame will support missing values. @@ -784,11 +909,12 @@ julia> leftjoin(name, job2, on = [:ID => :identifier], renamecols = uppercase => ``` """ function leftjoin(df1::AbstractDataFrame, df2::AbstractDataFrame; - on::Union{<:OnType, AbstractVector} = Symbol[], makeunique::Bool=false, - source::Union{Nothing, Symbol, AbstractString}=nothing, - indicator::Union{Nothing, Symbol, AbstractString}=nothing, - validate::Union{Pair{Bool, Bool}, Tuple{Bool, Bool}}=(false, false), - renamecols::Pair=identity => identity, matchmissing::Symbol=:error) + on::Union{<:OnType, AbstractVector} = Symbol[], makeunique::Bool=false, + source::Union{Nothing, Symbol, AbstractString}=nothing, + indicator::Union{Nothing, Symbol, AbstractString}=nothing, + validate::Union{Pair{Bool, Bool}, Tuple{Bool, Bool}}=(false, false), + renamecols::Pair=identity => identity, matchmissing::Symbol=:error, + order::Symbol=:undefined) if !all(x -> x isa Union{Function, AbstractString, Symbol}, renamecols) throw(ArgumentError("renamecols keyword argument must be a `Pair` " * "containing functions, strings, or `Symbol`s")) @@ -808,18 +934,18 @@ function leftjoin(df1::AbstractDataFrame, df2::AbstractDataFrame; return _join(df1, df2, on=on, kind=:left, makeunique=makeunique, indicator=source, validate=validate, left_rename=first(renamecols), right_rename=last(renamecols), - matchmissing=matchmissing) + matchmissing=matchmissing, order=order) end """ rightjoin(df1, df2; on, makeunique=false, source=nothing, validate=(false, false), renamecols=(identity => identity), - matchmissing=:error) + matchmissing=:error, order=:undefined) Perform a right join on two data frame objects and return a `DataFrame` containing the result. A right join includes all rows from `df2`. -The order of rows in the result is undefined and may change in the future releases. +The order of rows in the result is undefined and may change in future releases. In the returned data frame the type of the columns on which the data frames are joined is determined by the type of these columns in `df2`. This behavior may @@ -858,6 +984,10 @@ change in future releases. in `on` columns; if equal to `:equal` then `missing` is allowed and missings are matched; if equal to `:notequal` then missings are dropped in `df1` `on` columns; `isequal` is used for comparisons of rows for equality +- `order` : if `:undefined` (the default) the order of rows in the result is + undefined and may change in future releases. If `:left` then the order of + rows from the left data frame is retained (non-matching rows are put at the end). + If `:right` then the order of rows from the right data frame is retained. All columns of the returned data frame will support missing values. @@ -935,11 +1065,12 @@ julia> rightjoin(name, job2, on = [:ID => :identifier], renamecols = uppercase = ``` """ function rightjoin(df1::AbstractDataFrame, df2::AbstractDataFrame; - on::Union{<:OnType, AbstractVector} = Symbol[], makeunique::Bool=false, - source::Union{Nothing, Symbol, AbstractString}=nothing, - indicator::Union{Nothing, Symbol, AbstractString}=nothing, - validate::Union{Pair{Bool, Bool}, Tuple{Bool, Bool}}=(false, false), - renamecols::Pair=identity => identity, matchmissing::Symbol=:error) + on::Union{<:OnType, AbstractVector} = Symbol[], makeunique::Bool=false, + source::Union{Nothing, Symbol, AbstractString}=nothing, + indicator::Union{Nothing, Symbol, AbstractString}=nothing, + validate::Union{Pair{Bool, Bool}, Tuple{Bool, Bool}}=(false, false), + renamecols::Pair=identity => identity, matchmissing::Symbol=:error, + order::Symbol=:undefined) if !all(x -> x isa Union{Function, AbstractString, Symbol}, renamecols) throw(ArgumentError("renamecols keyword argument must be a `Pair` " * "containing functions, strings, or `Symbol`s")) @@ -959,20 +1090,20 @@ function rightjoin(df1::AbstractDataFrame, df2::AbstractDataFrame; return _join(df1, df2, on=on, kind=:right, makeunique=makeunique, indicator=source, validate=validate, left_rename=first(renamecols), right_rename=last(renamecols), - matchmissing=matchmissing) + matchmissing=matchmissing, order=order) end """ outerjoin(df1, df2; on, makeunique=false, source=nothing, validate=(false, false), - renamecols=(identity => identity), matchmissing=:error) + renamecols=(identity => identity), matchmissing=:error, order=:undefined) outerjoin(df1, df2, dfs...; on, makeunique = false, - validate = (false, false), matchmissing=:error) + validate = (false, false), matchmissing=:error, order=:undefined) Perform an outer join of two or more data frame objects and return a `DataFrame` containing the result. An outer join includes rows with keys that appear in any of the passed data frames. -The order of rows in the result is undefined and may change in the future releases. +The order of rows in the result is undefined and may change in future releases. In the returned data frame the type of the columns on which the data frames are joined is determined by the element type of these columns both `df1` and `df2`. @@ -1013,6 +1144,11 @@ This behavior may change in future releases. - `matchmissing` : if equal to `:error` throw an error if `missing` is present in `on` columns; if equal to `:equal` then `missing` is allowed and missings are matched; `isequal` is used for comparisons of rows for equality +- `order` : if `:undefined` (the default) the order of rows in the result is + undefined and may change in future releases. If `:left` then the order of + rows from the left data frame is retained (non-matching rows are put at the end). + If `:right` then the order of rows from the right data frame is retained + (non-matching rows are put at the end). All columns of the returned data frame will support missing values. @@ -1099,11 +1235,12 @@ julia> outerjoin(name, job2, on = [:ID => :identifier], renamecols = uppercase = ``` """ function outerjoin(df1::AbstractDataFrame, df2::AbstractDataFrame; - on::Union{<:OnType, AbstractVector} = Symbol[], makeunique::Bool=false, - source::Union{Nothing, Symbol, AbstractString}=nothing, - indicator::Union{Nothing, Symbol, AbstractString}=nothing, - validate::Union{Pair{Bool, Bool}, Tuple{Bool, Bool}}=(false, false), - renamecols::Pair=identity => identity, matchmissing::Symbol=:error) + on::Union{<:OnType, AbstractVector} = Symbol[], makeunique::Bool=false, + source::Union{Nothing, Symbol, AbstractString}=nothing, + indicator::Union{Nothing, Symbol, AbstractString}=nothing, + validate::Union{Pair{Bool, Bool}, Tuple{Bool, Bool}}=(false, false), + renamecols::Pair=identity => identity, matchmissing::Symbol=:error, + order::Symbol=:undefined) if !all(x -> x isa Union{Function, AbstractString, Symbol}, renamecols) throw(ArgumentError("renamecols keyword argument must be a `Pair` " * "containing functions, strings, or `Symbol`s")) @@ -1123,17 +1260,21 @@ function outerjoin(df1::AbstractDataFrame, df2::AbstractDataFrame; return _join(df1, df2, on=on, kind=:outer, makeunique=makeunique, indicator=source, validate=validate, left_rename=first(renamecols), right_rename=last(renamecols), - matchmissing=matchmissing) + matchmissing=matchmissing, order=order) end -outerjoin(df1::AbstractDataFrame, df2::AbstractDataFrame, dfs::AbstractDataFrame...; - on::Union{<:OnType, AbstractVector} = Symbol[], makeunique::Bool=false, - validate::Union{Pair{Bool, Bool}, Tuple{Bool, Bool}}=(false, false), - matchmissing::Symbol=:error) = - outerjoin(outerjoin(df1, df2, on=on, makeunique=makeunique, validate=validate, - matchmissing=matchmissing), - dfs..., on=on, makeunique=makeunique, validate=validate, - matchmissing=matchmissing) +function outerjoin(df1::AbstractDataFrame, df2::AbstractDataFrame, dfs::AbstractDataFrame...; + on::Union{<:OnType, AbstractVector} = Symbol[], makeunique::Bool=false, + validate::Union{Pair{Bool, Bool}, Tuple{Bool, Bool}}=(false, false), + matchmissing::Symbol=:error, order::Symbol=:undefined) + res = outerjoin(df1, df2, on=on, makeunique=makeunique, validate=validate, + matchmissing=matchmissing, order=order) + for dfn in dfs + res = outerjoin(res, dfn, on=on, makeunique=makeunique, validate=validate, + matchmissing=matchmissing, order=order) + end + return res +end """ semijoin(df1, df2; on, makeunique=false, validate=(false, false), matchmissing=:error) @@ -1142,7 +1283,7 @@ Perform a semi join of two data frame objects and return a `DataFrame` containing the result. A semi join returns the subset of rows of `df1` that match with the keys in `df2`. -The order of rows in the result is undefined and may change in the future releases. +The order of rows in the result is kept from `df1`. # Arguments - `df1`, `df2`: the `AbstractDataFrames` to be joined @@ -1243,7 +1384,8 @@ semijoin(df1::AbstractDataFrame, df2::AbstractDataFrame; matchmissing::Symbol=:error) = _join(df1, df2, on=on, kind=:semi, makeunique=makeunique, indicator=nothing, validate=validate, - left_rename=identity, right_rename=identity, matchmissing=matchmissing) + left_rename=identity, right_rename=identity, matchmissing=matchmissing, + order=:left) """ antijoin(df1, df2; on, makeunique=false, validate=(false, false), matchmissing=:error) @@ -1252,7 +1394,7 @@ Perform an anti join of two data frame objects and return a `DataFrame` containing the result. An anti join returns the subset of rows of `df1` that do not match with the keys in `df2`. -The order of rows in the result is undefined and may change in the future releases. +The order of rows in the result is kept from `df1`. # Arguments - `df1`, `df2`: the `AbstractDataFrames` to be joined @@ -1347,7 +1489,8 @@ antijoin(df1::AbstractDataFrame, df2::AbstractDataFrame; _join(df1, df2, on=on, kind=:anti, makeunique=makeunique, indicator=nothing, validate=validate, left_rename=identity, right_rename=identity, - matchmissing=matchmissing) + matchmissing=matchmissing, + order=:left) """ crossjoin(df1, df2, dfs...; makeunique = false) diff --git a/test/join.jl b/test/join.jl index ff8bb5f614..501f896dd8 100644 --- a/test/join.jl +++ b/test/join.jl @@ -2027,4 +2027,204 @@ end on=:a, matchmissing=:equal) ≅ DataFrame(a=missing, b=1, c=2, d=3) end +@testset "_count_sortperm" begin + Random.seed!(1234) + for i in 0:20, rep in 1:100 + x = rand(min(i, 1):i, i) + @test sortperm(x) == DataFrames._count_sortperm(x) + @test issorted(x[DataFrames._count_sortperm(x)]) + end + for i in 0:20, rep in 1:100 + x = randperm(i) + @test sortperm(x) == DataFrames._count_sortperm(x) + @test issorted(x[DataFrames._count_sortperm(x)]) + x = randperm(i) .+ i + @test sortperm(x) == DataFrames._count_sortperm(x) + @test issorted(x[DataFrames._count_sortperm(x)]) + end + for i in 1:20 + @test DataFrames._count_sortperm(ones(Int, i)) == 1:i + @test DataFrames._count_sortperm(zeros(Int, i)) == 1:i + @test DataFrames._count_sortperm([fill(1, i); fill(2, i)]) == 1:2*i + @test DataFrames._count_sortperm([fill(2, i); fill(1, i)]) == [i+1:2i; 1:i] + end +end + +@testset "basic join tests with order" begin + for fun in (innerjoin, leftjoin, rightjoin, outerjoin) + df1 = DataFrame(x=[0, 3, 1, 2, 4], id1=1:5) + df2 = DataFrame(x=[2, 5, 1, 3, 7, 6], id2=1:6) + ref = fun(df1, df2, on=:x) + res = fun(df1, df2, on=:x, order=:left) + @test issorted(res.id1) + @test sort(ref, :id1) ≅ res + res = fun(df1, df2, on=:x, order=:right) + @test issorted(res.id2) + @test sort(ref, :id2) ≅ res + df1.x = string.(df1.x) + df2.x = string.(df2.x) + ref = fun(df1, df2, on=:x) + res = fun(df1, df2, on=:x, order=:left) + @test issorted(res.id1) + @test sort(ref, :id1) ≅ res + res = fun(df1, df2, on=:x, order=:right) + @test issorted(res.id2) + @test sort(ref, :id2) ≅ res + end + + for fun in (innerjoin, leftjoin, rightjoin, outerjoin) + df1 = DataFrame(x=[0, 1, 2, 3, 4], id1=1:5) + df2 = DataFrame(x=[1, 2, 3, 5, 6, 7], id2=1:6) + ref = fun(df1, df2, on=:x) + res = fun(df1, df2, on=:x, order=:left) + @test issorted(res.id1) + @test sort(ref, :id1) ≅ res + res = fun(df1, df2, on=:x, order=:right) + @test issorted(res.id2) + @test sort(ref, :id2) ≅ res + df1.x = string.(df1.x) + df2.x = string.(df2.x) + ref = fun(df1, df2, on=:x) + res = fun(df1, df2, on=:x, order=:left) + @test issorted(res.id1) + @test sort(ref, :id1) ≅ res + res = fun(df1, df2, on=:x, order=:right) + @test issorted(res.id2) + @test sort(ref, :id2) ≅ res + end + + for fun in (leftjoin, rightjoin, outerjoin) + df1 = DataFrame(x=[0, 3, 1, 2, 4], id1=1:5) + df2 = DataFrame(x=[2, 5, 1, 3, 7, 6], id2=1:6) + ref = fun(df1, df2, on=:x, source=:src) + res = fun(df1, df2, on=:x, order=:left, source=:src) + @test issorted(res.id1) + @test sort(ref, :id1) ≅ res + res = fun(df1, df2, on=:x, order=:right, source=:src) + @test issorted(res.id2) + @test sort(ref, :id2) ≅ res + df1.x = string.(df1.x) + df2.x = string.(df2.x) + ref = fun(df1, df2, on=:x, source=:src) + res = fun(df1, df2, on=:x, order=:left, source=:src) + @test issorted(res.id1) + @test sort(ref, :id1) ≅ res + res = fun(df1, df2, on=:x, order=:right, source=:src) + @test issorted(res.id2) + @test sort(ref, :id2) ≅ res + end + + for fun in (leftjoin, rightjoin, outerjoin) + df1 = DataFrame(x=[0, 1, 2, 3, 4], id1=1:5) + df2 = DataFrame(x=[1, 2, 3, 5, 6, 7], id2=1:6) + ref = fun(df1, df2, on=:x, source=:src) + res = fun(df1, df2, on=:x, order=:left, source=:src) + @test issorted(res.id1) + @test sort(ref, :id1) ≅ res + res = fun(df1, df2, on=:x, order=:right, source=:src) + @test issorted(res.id2) + @test sort(ref, :id2) ≅ res + df1.x = string.(df1.x) + df2.x = string.(df2.x) + ref = fun(df1, df2, on=:x, source=:src) + res = fun(df1, df2, on=:x, order=:left, source=:src) + @test issorted(res.id1) + @test sort(ref, :id1) ≅ res + res = fun(df1, df2, on=:x, order=:right, source=:src) + @test issorted(res.id2) + @test sort(ref, :id2) ≅ res + end + + df1 = DataFrame(x=[0, 3, 1, 2, 4], id1=1:5) + df2 = DataFrame(x=[2, 5, 1, 3, 7, 6], id2=1:6) + @test issorted(semijoin(df1, df2, on=:x).id1) + @test issorted(semijoin(df2, df1, on=:x).id2) + @test issorted(antijoin(df1, df2, on=:x).id1) + @test issorted(antijoin(df2, df1, on=:x).id2) + df1.x = string.(df1.x) + df2.x = string.(df2.x) + @test issorted(semijoin(df1, df2, on=:x).id1) + @test issorted(semijoin(df2, df1, on=:x).id2) + @test issorted(antijoin(df1, df2, on=:x).id1) + @test issorted(antijoin(df2, df1, on=:x).id2) + df1 = DataFrame(x=[0, 1, 2, 3, 4], id1=1:5) + df2 = DataFrame(x=[1, 2, 3, 5, 6, 7], id2=1:6) + @test issorted(semijoin(df1, df2, on=:x).id1) + @test issorted(semijoin(df2, df1, on=:x).id2) + @test issorted(antijoin(df1, df2, on=:x).id1) + @test issorted(antijoin(df2, df1, on=:x).id2) + df1.x = string.(df1.x) + df2.x = string.(df2.x) + @test issorted(semijoin(df1, df2, on=:x).id1) + @test issorted(semijoin(df2, df1, on=:x).id2) + @test issorted(antijoin(df1, df2, on=:x).id1) + @test issorted(antijoin(df2, df1, on=:x).id2) + + @test_throws ArgumentError innerjoin(df1, df2, on=:x, order=:x) + @test_throws ArgumentError leftjoin(df1, df2, on=:x, order=:x) + @test_throws ArgumentError rightjoin(df1, df2, on=:x, order=:x) + @test_throws ArgumentError outerjoin(df1, df2, on=:x, order=:x) +end + +@time @testset "randomized join tests with sort" begin + Random.seed!(1234) + for lenl in 0:20, lenr in 0:20, rep in 1:10 + df1 = DataFrame(x=rand(0:lenl, lenl), id1=1:lenl) + df2 = DataFrame(x=rand(0:lenr, lenr), id2=1:lenr) + ref = innerjoin(df1, df2, on=:x) + res = innerjoin(df1, df2, on=:x, order=:left) + @test issorted(res, [:id1, :id2]) + @test sort(ref, :id1) ≅ res + res = innerjoin(df1, df2, on=:x, order=:right) + @test issorted(res, [:id2, :id1]) + @test sort(ref, :id2) ≅ res + for fun in (leftjoin, rightjoin, outerjoin) + ref = fun(df1, df2, on=:x, source=:src) + res = fun(df1, df2, on=:x, order=:left, source=:src) + @test issorted(res, [:id1, :id2]) + @test sort(ref, :id1) ≅ res + res = fun(df1, df2, on=:x, order=:right, source=:src) + @test issorted(res, [:id2, :id1]) + @test sort(ref, :id2) ≅ res + end + df1.x = string.(df1.x) + df2.x = string.(df2.x) + ref = innerjoin(df1, df2, on=:x) + res = innerjoin(df1, df2, on=:x, order=:left) + @test issorted(res, [:id1, :id2]) + @test sort(ref, :id1) ≅ res + res = innerjoin(df1, df2, on=:x, order=:right) + @test issorted(res, [:id2, :id1]) + @test sort(ref, :id2) ≅ res + for fun in (leftjoin, rightjoin, outerjoin) + ref = fun(df1, df2, on=:x, source=:src) + res = fun(df1, df2, on=:x, order=:left, source=:src) + @test issorted(res, [:id1, :id2]) + @test sort(ref, :id1) ≅ res + res = fun(df1, df2, on=:x, order=:right, source=:src) + @test issorted(res, [:id2, :id1]) + @test sort(ref, :id2) ≅ res + end + end +end + +@testset "wide joins" begin + Random.seed!(1234) + # we need many repetitions to make sure we cover all cases + @time for _ in 1:1000, k in 2:4 + dfs = [(n=rand(10:20); + DataFrame("id" => randperm(n), "x$i" => 1:n)) for i in 1:4] + @test issorted(innerjoin(dfs..., on="id", order=:left)[:, 2]) + @test issorted(outerjoin(dfs..., on="id", order=:left)[:, 2]) + @test issorted(innerjoin(dfs..., on="id", order=:right)[:, end]) + @test issorted(outerjoin(dfs..., on="id", order=:right)[:, end]) + end + + dfs = [DataFrame("id" => 0, "x$i" => i) for i in 1:10000] + res = innerjoin(dfs..., on="id") + @test res == DataFrame(["id" => 0; ["x$i" => i for i in 1:10000]]) + res = outerjoin(dfs..., on="id") + @test res == DataFrame(["id" => 0; ["x$i" => i for i in 1:10000]]) +end + end # module