Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve error messages in joins #3349

Merged
merged 4 commits into from
Jun 30, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
57 changes: 32 additions & 25 deletions docs/src/man/joins.md
Original file line number Diff line number Diff line change
Expand Up @@ -145,7 +145,15 @@ function. This is consistent with the `Set` and `Dict` types in Julia Base.
It is not recommended to use floating point numbers as keys: floating point
comparisons can be surprising and unpredictable. If you do use floating point
keys, note that by default an error is raised when keys include `-0.0`
(negative zero) or `NaN` values. This can be overridden by wrapping the key
(negative zero) or `NaN` values.
Here is an example:

```jldoctest joins
julia> innerjoin(DataFrame(id=[-0.0]), DataFrame(id=[0.0]), on=:id)
ERROR: ArgumentError: Currently for numeric values `NaN` and `-0.0` in their real or imaginary components are not allowed. Such value was found in column :id in left data frame. Use CategoricalArrays.jl to wrap these values in a CategoricalVector to perform the requested join.
```

This can be overridden by wrapping the key
values in a [categorical](@ref man-categorical) vector.

## Joining on key columns with different names
Expand Down Expand Up @@ -285,7 +293,7 @@ This feature is supported with the `renamecols` keyword argument:
```jldoctest joins
julia> innerjoin(a, b, on=:ID, renamecols = "_left" => "_right")
1×3 DataFrame
Row │ ID Name_left Job_right
Row │ ID Name_left Job_right
│ Int64 String String
─────┼─────────────────────────────
1 │ 20 John Lawyer
Expand All @@ -299,7 +307,7 @@ Alternatively it is allowed to pass a function transforming column names:
```jldoctest joins
julia> innerjoin(a, b, on=:ID, renamecols = lowercase => uppercase)
1×3 DataFrame
Row │ ID name JOB
Row │ ID name JOB
│ Int64 String String
─────┼───────────────────────
1 │ 20 John Lawyer
Expand All @@ -314,24 +322,24 @@ you get an error:
```jldoctest joins
julia> df1 = DataFrame(id=[1, missing, 3], a=1:3)
3×2 DataFrame
Row │ id a
│ Int64? Int64
Row │ id a
│ Int64? Int64
─────┼────────────────
1 │ 1 1
2 │ missing 2
3 │ 3 3

julia> df2 = DataFrame(id=[1, 2, missing], b=1:3)
3×2 DataFrame
Row │ id b
│ Int64? Int64
Row │ id b
│ Int64? Int64
─────┼────────────────
1 │ 1 1
2 │ 2 2
3 │ missing 3

julia> innerjoin(df1, df2, on=:id)
ERROR: ArgumentError: missing values in key columns are not allowed when matchmissing == :error
ERROR: ArgumentError: Missing values in key columns are not allowed when matchmissing == :error. `missing` found in column :id in left data frame.
```

If you would prefer `missing` values to be treated as equal pass
Expand All @@ -340,8 +348,8 @@ the `matchmissing=:equal` keyword argument:
```jldoctest joins
julia> innerjoin(df1, df2, on=:id, matchmissing=:equal)
2×3 DataFrame
Row │ id a b
│ Int64? Int64 Int64
Row │ id a b
│ Int64? Int64 Int64
─────┼───────────────────────
1 │ 1 1 1
2 │ missing 2 3
Expand All @@ -353,7 +361,7 @@ case pass `matchmissing=:notequal`:
```jldoctest joins
julia> innerjoin(df1, df2, on=:id, matchmissing=:notequal)
1×3 DataFrame
Row │ id a b
Row │ id a b
│ Int64? Int64 Int64
─────┼──────────────────────
1 │ 1 1 1
Expand All @@ -366,8 +374,8 @@ By default the order of rows produced by the join operation is undefined:
```jldoctest joins
julia> df_left = DataFrame(id=[1, 2, 4, 5], left=1:4)
4×2 DataFrame
Row │ id left
│ Int64 Int64
Row │ id left
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
Expand All @@ -376,8 +384,8 @@ julia> df_left = DataFrame(id=[1, 2, 4, 5], left=1:4)

julia> df_right = DataFrame(id=[2, 1, 3, 6, 7], right=1:5)
5×2 DataFrame
Row │ id right
│ Int64 Int64
Row │ id right
│ Int64 Int64
─────┼──────────────
1 │ 2 1
2 │ 1 2
Expand All @@ -387,7 +395,7 @@ julia> df_right = DataFrame(id=[2, 1, 3, 6, 7], right=1:5)

julia> outerjoin(df_left, df_right, on=:id)
7×3 DataFrame
Row │ id left right
Row │ id left right
│ Int64 Int64? Int64?
─────┼─────────────────────────
1 │ 2 2 1
Expand All @@ -405,7 +413,7 @@ the `order=:left` keyword argument:
```jldoctest joins
julia> outerjoin(df_left, df_right, on=:id, order=:left)
7×3 DataFrame
Row │ id left right
Row │ id left right
│ Int64 Int64? Int64?
─────┼─────────────────────────
1 │ 1 1 2
Expand All @@ -426,7 +434,7 @@ not present in it at the end):
```jldoctest joins
julia> outerjoin(df_left, df_right, on=:id, order=:right)
7×3 DataFrame
Row │ id left right
Row │ id left right
│ Int64 Int64? Int64?
─────┼─────────────────────────
1 │ 2 2 1
Expand All @@ -448,8 +456,8 @@ the right table.
```jldoctest joins
julia> main = DataFrame(id=1:4, main=1:4)
4×2 DataFrame
Row │ id main
│ Int64 Int64
Row │ id main
│ Int64 Int64
─────┼──────────────
1 │ 1 1
2 │ 2 2
Expand All @@ -460,12 +468,12 @@ julia> leftjoin!(main, DataFrame(id=[2, 4], info=["a", "b"]), on=:id);

julia> main
4×3 DataFrame
Row │ id main info
│ Int64 Int64 String?
Row │ id main info
│ Int64 Int64 String?
─────┼───────────────────────
1 │ 1 1 missing
1 │ 1 1 missing
2 │ 2 2 a
3 │ 3 3 missing
3 │ 3 3 missing
4 │ 4 4 b
```

Expand All @@ -477,4 +485,3 @@ in the right table:
julia> leftjoin!(main, DataFrame(id=[2, 2], info_bad=["a", "b"]), on=:id)
ERROR: ArgumentError: duplicate rows found in right table
```

19 changes: 13 additions & 6 deletions src/join/composer.jl
Original file line number Diff line number Diff line change
Expand Up @@ -79,21 +79,28 @@ struct DataFrameJoiner
dfl_on = select(dfl, left_on, copycols=false)
dfr_on = select(dfr, right_on, copycols=false)
if matchmissing === :error
for df in (dfl_on, dfr_on), col in eachcol(df)
for (df_i, df) in enumerate((dfl_on, dfr_on)),
(col_name, col) in pairs(eachcol(df))
if any(ismissing, col)
throw(ArgumentError("missing values in key columns are not allowed " *
"when matchmissing == :error"))
throw(ArgumentError("Missing values in key columns are not allowed " *
"when matchmissing == :error. " *
"`missing` found in column :$col_name in " *
(df_i == 1 ? "left" : "right") * " data frame."))
end
end
elseif !(matchmissing in (:equal, :notequal))
throw(ArgumentError("matchmissing allows only :error, :equal, or :notequal"))
end
for df in (dfl_on, dfr_on), col in eachcol(df)
for (df_i, df) in enumerate((dfl_on, dfr_on)),
(col_name, col) in pairs(eachcol(df))
if any(x -> (x isa Union{Complex, Real}) &&
(isnan(x) || isequal(real(x), -0.0) || isequal(imag(x), -0.0)), col)
throw(ArgumentError("currently for numeric values NaN and `-0.0` " *
throw(ArgumentError("Currently for numeric values `NaN` and `-0.0` " *
"in their real or imaginary components are not " *
"allowed. Use CategoricalArrays.jl to wrap " *
"allowed. " *
"Such value was found in column :$col_name in " *
(df_i == 1 ? "left" : "right") * " data frame. " *
"Use CategoricalArrays.jl to wrap " *
"these values in a CategoricalVector to perform " *
"the requested join."))
end
Expand Down