Skip to content

Commit

Permalink
Add an option in joins to specify row order (#3233)
Browse files Browse the repository at this point in the history
  • Loading branch information
bkamins authored Dec 24, 2022
1 parent b240458 commit e0cd3b8
Show file tree
Hide file tree
Showing 4 changed files with 653 additions and 82 deletions.
11 changes: 10 additions & 1 deletion NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,22 @@
## New functionalities

* Add `Iterators.partition` support
([#3212](https://github.com/JuliaData/DataFrames.jl/pull/3212))
([#3212](https://github.com/JuliaData/DataFrames.jl/pull/3212))
* Add `allunique` and allow transformations in `cols` argument of `describe`
and `nonunique` when working with `SubDataFrame`
([3232](https://github.com/JuliaData/DataFrames.jl/pull/3232))
* Add support for `operator` keyword argument in `Cols`
to take a set operation to apply to passed selectors (`union` by default)
([3224](https://github.com/JuliaData/DataFrames.jl/pull/3224))
* Joining functions now support `order` keyword argument allowing the user
to specify the order of the rows in the produced table
([#3233](https://github.com/JuliaData/DataFrames.jl/pull/3233))

## Bug fixes

* passing very many data frames to `innerjoin` and `outerjoin`
does not lead to stack overflow
([#3233](https://github.com/JuliaData/DataFrames.jl/pull/3233))

# DataFrames.jl v1.4.4 Patch Release Notes

Expand Down
251 changes: 235 additions & 16 deletions docs/src/man/joins.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,10 @@
# Database-Style Joins

We often need to combine two or more data sets together to provide a complete picture of the topic we are studying. For example, suppose that we have the following two data sets:
## Introduction to joins

We often need to combine two or more data sets together to provide a complete
picture of the topic we are studying. For example, suppose that we have the
following two data sets:

```jldoctest joins
julia> using DataFrames
Expand All @@ -22,7 +26,8 @@ julia> jobs = DataFrame(ID=[20, 40], Job=["Lawyer", "Doctor"])
2 │ 40 Doctor
```

We might want to work with a larger data set that contains both the names and jobs for each ID. We can do this using the `innerjoin` function:
We might want to work with a larger data set that contains both the names and
jobs for each ID. We can do this using the `innerjoin` function:

```jldoctest joins
julia> innerjoin(people, jobs, on = :ID)
Expand All @@ -34,21 +39,29 @@ julia> innerjoin(people, jobs, on = :ID)
2 │ 40 Jane Doe Doctor
```

In relational database theory, this operation is generally referred to as a join.
The columns used to determine which rows should be combined during a join are called keys.
In relational database theory, this operation is generally referred to as a
join. The columns used to determine which rows should be combined during a join
are called keys.

The following functions are provided to perform seven kinds of joins:

- `innerjoin`: the output contains rows for values of the key that exist in all passed data frames.
- `leftjoin`: the output contains rows for values of the key that exist in the first (left) argument,
whether or not that value exists in the second (right) argument.
- `rightjoin`: the output contains rows for values of the key that exist in the second (right) argument,
whether or not that value exists in the first (left) argument.
- `outerjoin`: the output contains rows for values of the key that exist in any of the passed data frames.
- `semijoin`: Like an inner join, but output is restricted to columns from the first (left) argument.
- `antijoin`: The output contains rows for values of the key that exist in the first (left) but not the second (right) argument.
As with `semijoin`, output is restricted to columns from the first (left) argument.
- `crossjoin`: The output is the cartesian product of rows from all passed data frames.
- `innerjoin`: the output contains rows for values of the key that exist in all
passed data frames.
- `leftjoin`: the output contains rows for values of the key that exist in the
first (left) argument, whether or not that value exists in the second (right)
argument.
- `rightjoin`: the output contains rows for values of the key that exist in the
second (right) argument, whether or not that value exists in the first (left)
argument.
- `outerjoin`: the output contains rows for values of the key that exist in any
of the passed data frames.
- `semijoin`: Like an inner join, but output is restricted to columns from the
first (left) argument.
- `antijoin`: The output contains rows for values of the key that exist in the
first (left) but not the second (right) argument. As with `semijoin`, output
is restricted to columns from the first (left) argument.
- `crossjoin`: The output is the cartesian product of rows from all passed data
frames.

See [the Wikipedia page on SQL joins](https://en.wikipedia.org/wiki/Join_(SQL)) for more information.

Expand Down Expand Up @@ -124,8 +137,10 @@ julia> crossjoin(people, jobs, makeunique = true)
4 │ 40 Jane Doe 60 Astronaut
```

In order to join data frames on keys which have different names in the left and right tables,
you may pass `left => right` pairs as `on` argument:
## Joining on key columns with different names

In order to join data frames on keys which have different names in the left and
right tables, you may pass `left => right` pairs as `on` argument:

```jldoctest joins
julia> a = DataFrame(ID=[20, 40], Name=["John Doe", "Jane Doe"])
Expand Down Expand Up @@ -198,6 +213,8 @@ julia> innerjoin(a, b, on = [:City => :Location, :Job => :Work])
9 │ New York Doctor 5 e
```

## Handling of duplicate keys and tracking source data frame

Additionally, notice that in the last join rows 2 and 3 had the same values on
`on` variables in both joined `DataFrame`s. In such a situation `innerjoin`,
`outerjoin`, `leftjoin` and `rightjoin` will produce all combinations of
Expand Down Expand Up @@ -248,3 +265,205 @@ julia> outerjoin(a, b, on=:ID, validate=(true, true), source=:source)

Note that this time we also used the `validate` keyword argument and it did not
produce errors as the keys defined in both source data frames were unique.

## Renaming joined columns

Often you want to keep track of the source data frame of a given column.
This feature is supported with the `ranamecols` keyword argument:

```jldoctest joins
julia> innerjoin(a, b, on=:ID, renamecols = "_left" => "_right")
1×3 DataFrame
Row │ ID Name_left Job_right
│ Int64 String String
─────┼─────────────────────────────
1 │ 20 John Lawyer
```

In the above example we added the `"_left"` suffix to the non-key columns from
the left table and the `"_right"` suffix to the non-key columns from the right
table.

Alternatively it is allowed to pass a function transforming column names:
```jldoctest joins
julia> innerjoin(a, b, on=:ID, renamecols = lowercase => uppercase)
1×3 DataFrame
Row │ ID name JOB
│ Int64 String String
─────┼───────────────────────
1 │ 20 John Lawyer
```

## Matching missing values in joins

By default when you try to to perform a join on a key that has `missing` values
you get an error:

```jldoctest joins
julia> df1 = DataFrame(id=[1, missing, 3], a=1:3)
3×2 DataFrame
Row │ id a
│ Int64? Int64
─────┼────────────────
1 │ 1 1
2 │ missing 2
3 │ 3 3
julia> df2 = DataFrame(id=[1, 2, missing], b=1:3)
3×2 DataFrame
Row │ id b
│ Int64? Int64
─────┼────────────────
1 │ 1 1
2 │ 2 2
3 │ missing 3
julia> innerjoin(df1, df2, on=:id)
ERROR: ArgumentError: missing values in key columns are not allowed when matchmissing == :error
```

If you would prefer `missing` values to be treated as equal pass
the `matchmissing=:equal` keyword argument:

```jldoctest joins
julia> innerjoin(df1, df2, on=:id, matchmissing=:equal)
2×3 DataFrame
Row │ id a b
│ Int64? Int64 Int64
─────┼───────────────────────
1 │ 1 1 1
2 │ missing 2 3
```

Alternatively you might want to drop all rows with `missing` values. In this
case pass `matchmissing=:notequal`:

```jldoctest joins
julia> innerjoin(df1, df2, on=:id, matchmissing=:notequal)
1×3 DataFrame
Row │ id a b
│ Int64? Int64 Int64
─────┼──────────────────────
1 │ 1 1 1
```

## Specifying row order in the join result

By default the order of rows produced by the join operation is undefined:

```jldoctest joins
julia> df_left = DataFrame(id=[1, 2, 4, 5], left=1:4)
4×2 DataFrame
Row │ id left
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
3 │ 4 3
4 │ 5 4
julia> df_right = DataFrame(id=[2, 1, 3, 6, 7], right=1:5)
5×2 DataFrame
Row │ id right
│ Int64 Int64
─────┼──────────────
1 │ 2 1
2 │ 1 2
3 │ 3 3
4 │ 6 4
5 │ 7 5
julia> outerjoin(df_left, df_right, on=:id)
7×3 DataFrame
Row │ id left right
│ Int64 Int64? Int64?
─────┼─────────────────────────
1 │ 2 2 1
2 │ 1 1 2
3 │ 4 3 missing
4 │ 5 4 missing
5 │ 3 missing 3
6 │ 6 missing 4
7 │ 7 missing 5
```

If you would like the result to keep the row order of the left table pass
the `order=:left` keyword argument:

```jldoctest joins
julia> outerjoin(df_left, df_right, on=:id, order=:left)
7×3 DataFrame
Row │ id left right
│ Int64 Int64? Int64?
─────┼─────────────────────────
1 │ 1 1 2
2 │ 2 2 1
3 │ 4 3 missing
4 │ 5 4 missing
5 │ 3 missing 3
6 │ 6 missing 4
7 │ 7 missing 5
```

Note that in this case keys missing from the left table are put after the keys
present in it.

Similarly `order=:right` keeps the order of the right table (and puts keys
not present in it at the end):

```jldoctest joins
julia> outerjoin(df_left, df_right, on=:id, order=:right)
7×3 DataFrame
Row │ id left right
│ Int64 Int64? Int64?
─────┼─────────────────────────
1 │ 2 2 1
2 │ 1 1 2
3 │ 3 missing 3
4 │ 6 missing 4
5 │ 7 missing 5
6 │ 4 3 missing
7 │ 5 4 missing
```

## In-place left join

A common operation is adding data from a reference table to some main table.
It is possible to perform such an in-place update using the `leftjoin!`
function. In this case the left table is updated in place with matching rows from
the right table.

```jldoctest joins
julia> main = DataFrame(id=1:4, main=1:4)
4×2 DataFrame
Row │ id main
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
3 │ 3 3
4 │ 4 4
julia> leftjoin!(main, DataFrame(id=[2, 4], info=["a", "b"]), on=:id);
julia> main
4×3 DataFrame
Row │ id main info
│ Int64 Int64 String?
─────┼───────────────────────
1 │ 1 1 missing
2 │ 2 2 a
3 │ 3 3 missing
4 │ 4 4 b
```

Note that in this case the order and number of rows in the left table is not
changed. Therefore, in particular, it is not allowed to have duplicate keys
in the right table:

```
julia> leftjoin!(main, DataFrame(id=[2, 2], info_bad=["a", "b"]), on=:id)
ERROR: ArgumentError: duplicate rows found in right table
```

Loading

0 comments on commit e0cd3b8

Please sign in to comment.