Add an option in joins to specify row order (#3233)

JuliaData · Dec 24, 2022 · e0cd3b8 · e0cd3b8
1 parent b240458
commit e0cd3b8
Show file tree

Hide file tree

Showing 4 changed files with 653 additions and 82 deletions.
diff --git a/NEWS.md b/NEWS.md
@@ -3,13 +3,22 @@
 ## New functionalities
 
 * Add `Iterators.partition` support
-   ([#3212](https://github.com/JuliaData/DataFrames.jl/pull/3212))
+  ([#3212](https://github.com/JuliaData/DataFrames.jl/pull/3212))
 * Add `allunique` and allow transformations in `cols` argument of `describe`
   and `nonunique` when working with `SubDataFrame`
   ([3232](https://github.com/JuliaData/DataFrames.jl/pull/3232))
 * Add support for `operator` keyword argument in `Cols`
   to take a set operation to apply to passed selectors (`union` by default)
   ([3224](https://github.com/JuliaData/DataFrames.jl/pull/3224))
+* Joining functions now support `order` keyword argument allowing the user
+  to specify the order of the rows in the produced table
+  ([#3233](https://github.com/JuliaData/DataFrames.jl/pull/3233))
+
+## Bug fixes
+
+* passing very many data frames to `innerjoin` and `outerjoin`
+  does not lead to stack overflow
+  ([#3233](https://github.com/JuliaData/DataFrames.jl/pull/3233))
 
 # DataFrames.jl v1.4.4 Patch Release Notes
 

diff --git a/docs/src/man/joins.md b/docs/src/man/joins.md
@@ -1,6 +1,10 @@
 # Database-Style Joins
 
-We often need to combine two or more data sets together to provide a complete picture of the topic we are studying. For example, suppose that we have the following two data sets:
+## Introduction to joins
+
+We often need to combine two or more data sets together to provide a complete
+picture of the topic we are studying. For example, suppose that we have the
+following two data sets:
 
 ```jldoctest joins
 julia> using DataFrames
@@ -22,7 +26,8 @@ julia> jobs = DataFrame(ID=[20, 40], Job=["Lawyer", "Doctor"])
    2 │    40  Doctor
 ```
 
-We might want to work with a larger data set that contains both the names and jobs for each ID. We can do this using the `innerjoin` function:
+We might want to work with a larger data set that contains both the names and
+jobs for each ID. We can do this using the `innerjoin` function:
 
 ```jldoctest joins
 julia> innerjoin(people, jobs, on = :ID)
@@ -34,21 +39,29 @@ julia> innerjoin(people, jobs, on = :ID)
    2 │    40  Jane Doe  Doctor
 ```
 
-In relational database theory, this operation is generally referred to as a join.
-The columns used to determine which rows should be combined during a join are called keys.
+In relational database theory, this operation is generally referred to as a
+join. The columns used to determine which rows should be combined during a join
+are called keys.
 
 The following functions are provided to perform seven kinds of joins:
 
--   `innerjoin`: the output contains rows for values of the key that exist in all passed data frames.
--   `leftjoin`: the output contains rows for values of the key that exist in the first (left) argument,
-    whether or not that value exists in the second (right) argument.
--   `rightjoin`: the output contains rows for values of the key that exist in the second (right) argument,
-    whether or not that value exists in the first (left) argument.
--   `outerjoin`: the output contains rows for values of the key that exist in any of the passed data frames.
--   `semijoin`: Like an inner join, but output is restricted to columns from the first (left) argument.
--   `antijoin`: The output contains rows for values of the key that exist in the first (left) but not the second (right) argument.
-    As with `semijoin`, output is restricted to columns from the first (left) argument.
--   `crossjoin`: The output is the cartesian product of rows from all passed data frames.
+- `innerjoin`: the output contains rows for values of the key that exist in all
+  passed data frames.
+- `leftjoin`: the output contains rows for values of the key that exist in the
+  first (left) argument, whether or not that value exists in the second (right)
+  argument.
+- `rightjoin`: the output contains rows for values of the key that exist in the
+  second (right) argument, whether or not that value exists in the first (left)
+  argument.
+- `outerjoin`: the output contains rows for values of the key that exist in any
+  of the passed data frames.
+- `semijoin`: Like an inner join, but output is restricted to columns from the
+  first (left) argument.
+- `antijoin`: The output contains rows for values of the key that exist in the
+  first (left) but not the second (right) argument. As with `semijoin`, output
+  is restricted to columns from the first (left) argument.
+- `crossjoin`: The output is the cartesian product of rows from all passed data
+  frames.
 
 See [the Wikipedia page on SQL joins](https://en.wikipedia.org/wiki/Join_(SQL)) for more information.
 
@@ -124,8 +137,10 @@ julia> crossjoin(people, jobs, makeunique = true)
    4 │    40  Jane Doe     60  Astronaut
 ```
 
-In order to join data frames on keys which have different names in the left and right tables,
-you may pass `left => right` pairs as `on` argument:
+## Joining on key columns with different names
+
+In order to join data frames on keys which have different names in the left and
+right tables, you may pass `left => right` pairs as `on` argument:
 
 ```jldoctest joins
 julia> a = DataFrame(ID=[20, 40], Name=["John Doe", "Jane Doe"])
@@ -198,6 +213,8 @@ julia> innerjoin(a, b, on = [:City => :Location, :Job => :Work])
    9 │ New York   Doctor         5  e
 ```
 
+## Handling of duplicate keys and tracking source data frame
+
 Additionally, notice that in the last join rows 2 and 3 had the same values on
 `on` variables in both joined `DataFrame`s. In such a situation `innerjoin`,
 `outerjoin`, `leftjoin` and `rightjoin` will produce all combinations of
@@ -248,3 +265,205 @@ julia> outerjoin(a, b, on=:ID, validate=(true, true), source=:source)
 
 Note that this time we also used the `validate` keyword argument and it did not
 produce errors as the keys defined in both source data frames were unique.
+
+## Renaming joined columns
+
+Often you want to keep track of the source data frame of a given column.
+This feature is supported with the `ranamecols` keyword argument:
+
+```jldoctest joins
+julia> innerjoin(a, b, on=:ID, renamecols = "_left" => "_right")
+1×3 DataFrame
+ Row │ ID     Name_left  Job_right 
+     │ Int64  String     String
+─────┼─────────────────────────────
+   1 │    20  John       Lawyer
+```
+
+In the above example we added the `"_left"` suffix to the non-key columns from
+the left table and the `"_right"` suffix to the non-key columns from the right
+table.
+
+Alternatively it is allowed to pass a function transforming column names:
+```jldoctest joins
+julia> innerjoin(a, b, on=:ID, renamecols = lowercase => uppercase)
+1×3 DataFrame
+ Row │ ID     name    JOB    
+     │ Int64  String  String
+─────┼───────────────────────
+   1 │    20  John    Lawyer
+
+```
+
+## Matching missing values in joins
+
+By default when you try to to perform a join on a key that has `missing` values
+you get an error:
+
+```jldoctest joins
+julia> df1 = DataFrame(id=[1, missing, 3], a=1:3)
+3×2 DataFrame
+ Row │ id       a     
+     │ Int64?   Int64 
+─────┼────────────────
+   1 │       1      1
+   2 │ missing      2
+   3 │       3      3
+
+julia> df2 = DataFrame(id=[1, 2, missing], b=1:3)
+3×2 DataFrame
+ Row │ id       b     
+     │ Int64?   Int64 
+─────┼────────────────
+   1 │       1      1
+   2 │       2      2
+   3 │ missing      3
+
+julia> innerjoin(df1, df2, on=:id)
+ERROR: ArgumentError: missing values in key columns are not allowed when matchmissing == :error
+```
+
+If you would prefer `missing` values to be treated as equal pass
+the `matchmissing=:equal` keyword argument:
+
+```jldoctest joins
+julia> innerjoin(df1, df2, on=:id, matchmissing=:equal)
+2×3 DataFrame
+ Row │ id       a      b     
+     │ Int64?   Int64  Int64 
+─────┼───────────────────────
+   1 │       1      1      1
+   2 │ missing      2      3
+```
+
+Alternatively you might want to drop all rows with `missing` values. In this
+case pass `matchmissing=:notequal`:
+
+```jldoctest joins
+julia> innerjoin(df1, df2, on=:id, matchmissing=:notequal)
+1×3 DataFrame
+ Row │ id      a      b     
+     │ Int64?  Int64  Int64
+─────┼──────────────────────
+   1 │      1      1      1
+```
+
+## Specifying row order in the join result
+
+By default the order of rows produced by the join operation is undefined:
+
+```jldoctest joins
+julia> df_left = DataFrame(id=[1, 2, 4, 5], left=1:4)
+4×2 DataFrame
+ Row │ id     left  
+     │ Int64  Int64 
+─────┼──────────────
+   1 │     1      1
+   2 │     2      2
+   3 │     4      3
+   4 │     5      4
+
+julia> df_right = DataFrame(id=[2, 1, 3, 6, 7], right=1:5)
+5×2 DataFrame
+ Row │ id     right 
+     │ Int64  Int64 
+─────┼──────────────
+   1 │     2      1
+   2 │     1      2
+   3 │     3      3
+   4 │     6      4
+   5 │     7      5
+
+julia> outerjoin(df_left, df_right, on=:id)
+7×3 DataFrame
+ Row │ id     left     right   
+     │ Int64  Int64?   Int64?
+─────┼─────────────────────────
+   1 │     2        2        1
+   2 │     1        1        2
+   3 │     4        3  missing
+   4 │     5        4  missing
+   5 │     3  missing        3
+   6 │     6  missing        4
+   7 │     7  missing        5
+```
+
+If you would like the result to keep the row order of the left table pass
+the `order=:left` keyword argument:
+
+```jldoctest joins
+julia> outerjoin(df_left, df_right, on=:id, order=:left)
+7×3 DataFrame
+ Row │ id     left     right   
+     │ Int64  Int64?   Int64?
+─────┼─────────────────────────
+   1 │     1        1        2
+   2 │     2        2        1
+   3 │     4        3  missing
+   4 │     5        4  missing
+   5 │     3  missing        3
+   6 │     6  missing        4
+   7 │     7  missing        5
+```
+
+Note that in this case keys missing from the left table are put after the keys
+present in it.
+
+Similarly `order=:right` keeps the order of the right table (and puts keys
+not present in it at the end):
+
+```jldoctest joins
+julia> outerjoin(df_left, df_right, on=:id, order=:right)
+7×3 DataFrame
+ Row │ id     left     right   
+     │ Int64  Int64?   Int64?
+─────┼─────────────────────────
+   1 │     2        2        1
+   2 │     1        1        2
+   3 │     3  missing        3
+   4 │     6  missing        4
+   5 │     7  missing        5
+   6 │     4        3  missing
+   7 │     5        4  missing
+```
+
+## In-place left join
+
+A common operation is adding data from a reference table to some main table.
+It is possible to perform such an in-place update using the `leftjoin!`
+function. In this case the left table is updated in place with matching rows from
+the right table.
+
+```jldoctest joins
+julia> main = DataFrame(id=1:4, main=1:4)
+4×2 DataFrame
+ Row │ id     main  
+     │ Int64  Int64 
+─────┼──────────────
+   1 │     1      1
+   2 │     2      2
+   3 │     3      3
+   4 │     4      4
+
+julia> leftjoin!(main, DataFrame(id=[2, 4], info=["a", "b"]), on=:id);
+
+julia> main
+4×3 DataFrame
+ Row │ id     main   info    
+     │ Int64  Int64  String? 
+─────┼───────────────────────
+   1 │     1      1  missing 
+   2 │     2      2  a
+   3 │     3      3  missing 
+   4 │     4      4  b
+```
+
+Note that in this case the order and number of rows in the left table is not
+changed. Therefore, in particular, it is not allowed to have duplicate keys
+in the right table:
+
+```
+julia> leftjoin!(main, DataFrame(id=[2, 2], info_bad=["a", "b"]), on=:id)
+ERROR: ArgumentError: duplicate rows found in right table
+```
+